K-Nearest Neighbors

Supervised > classification/regression

method of instance-based learning, save examples in database and obtain result by $f (x) = lookup (x)$
general features of instance-based learners:
- defer generalization decision until new query instance is observed
- classify new instance base don analyzing similar instance while ignoring distant examples
- represent instances as real-valued points in Euclidean space
k-nearest neighbors
- the choice of representative distance function is very important for generalizing well => domain knowledge matters
- particularly susceptible to high dimensionality => the space to covered by examples grows exponentially w.r.t number of features
there is no general best choice for $k$ , also possible to choose $k = n$
- if all instances are considered (distance-weighted), the algorithm is a global method
- if only nearest examples are considered, then the algorithm is a local method
strategies for improving classification efficiency
- multi-dimensional access methods (k-d trees)
- fast approximate similarity search
- locality sensitive hashing
- condensing by choosing a smaller subset of records that give the same results
- editing data by removing objects to improve efficiency

Pseudo algorithm

given:
- Training data $D$ = { $x_{i}$ , $y_{i}$ }
- Distance metric $d = (q, x)$
- Number of neighbors: $k$
- Query point: $q$
- $N N = {i : d (q, x_{i}) k smallest}$ (if $> k$ equally near, take all)
return:
- classification - plurality vote: $q = y_{i} \in N N$ (possibly weighted vote)
  - break ties e.g. most common overall, random choice, etc.
- regression - mean $q = y_{i} \in N N$

Complexity

method	time	space
k-NN, learning	$O (1)$	$O (n)$
k-NN, query (sorted)*	$l g (n) + k$	$O (1)$
k-NN, query (unsorted)	$O (n)$	$O (1)$

* If $k$ is $> l g (n)$ it will dominate; if $k$ is small then $\lg (n)$

Example

q = (4, 2) 
               |  Euclidean | Manhattan |
|  x   |   y   |  distance  | distance  | nearest
|------|-------|------------|-----------|----------  
| 1, 6 |   7   |     25     |     7     |
| 2, 4 |   8   |      8     |     4     |  a-b-c-d
| 3, 7 |  16   |     26     |     6     |  d
| 6, 8 |  44   |     40     |     8     |
| 7, 1 |  50   |     10     |     4     |  b-c-d
| 8, 4 |  68   |     20     |     6     |  d

a) 1-NN, E.D. (average) => 8 / 1            = 8
b) 1-NN, M.D. (average) => (8+50) / 2       = 29
c) 3-NN, E.D. (average) => (8+50+68) / 3    = 42
d) 3-NN, M.D. (average) => (8+16+50+68) / 4 = 35.5

The actual function is $x_{1}^{2} + x_{2}$ => ( $q$ = 18) => depending on distance metric, choice of k, number of examples, these results are vastly different (this is a purposely bad example).

Distance metrics make assumptions about domain, which may not apply.

Locally weighted regression

Locally weighted (linear) regression is a generalization of kNN where an explicit local approximation of target function is constructed for each query instance.

uses nearby or distance-weighted examples to form local approximation to $f$
target function is approximated using constant, linear, quadratic function
- using more complex functional forms is challenging because cost of fitting more complex function for each query instance is high
- these simpler functions model the target quite well over sufficiently small subregion
it is local because function is approximating only on data near the query point
is is weighted because contribution of each point is weighted by its distance from query point
starts with hypothesis space $H \in lines$
- over learning becomes $H \supseteq lines$
- can build local knowledge from nearby concepts to represent arbitrarily complicated functions

Bias

Inductive bias

restriction bias: N/A => kNN algorithm never forms an explicit general hypothesis $\hat{f}$ regarding the target function $f$
preference bias
- locality: nearest points are similar => domain knowledge required to choose right distance function
- smoothness: in choosing near neighbors we expect similar points to behave similarly
- all features matter equally: assume points can be represented in n-dimensional space

Advantages

ability to model complex target functions by collection of local approximations
can estimate locally and differently for ach new instance to be classified
information presented in examples is never lost
simple to implement, no training
attempts to construct a local target function and never generalize over entire instance space => this is advantageous when the target function is very complex but can be described by less complex local approximations

Disadvantages

no generalization because all examples are kept => difficult to label new instance
cost of classifying new instances is high since all computation takes place at classification time
overfits and can be sensitive to noise => use weighted distance to smooth out impact of noisy examples
having to break ties when two or more similar points with different labels exist
considers all attributes when attempting to retrieve similar examples
- most similar may seem far apart because of many attributes
- highly sensitive to curse of dimensionality
highly sensitive to choosing the right distance metric