Skip to content

K-Nearest Neighbors

Supervised > classification/regression

  • method of instance-based learning, save examples in database and obtain result by \(f(x) = \text{lookup}(x)\)
  • general features of instance-based learners:
    • defer generalization decision until new query instance is observed
    • classify new instance base don analyzing similar instance while ignoring distant examples
    • represent instances as real-valued points in Euclidean space
  • k-nearest neighbors
    • the choice of representative distance function is very important for generalizing well => domain knowledge matters
    • particularly susceptible to high dimensionality => the space to covered by examples grows exponentially w.r.t number of features
  • there is no general best choice for \(k\), also possible to choose \(k = n\)
    • if all instances are considered (distance-weighted), the algorithm is a global method
    • if only nearest examples are considered, then the algorithm is a local method
  • strategies for improving classification efficiency
    • multi-dimensional access methods (k-d trees)
    • fast approximate similarity search
    • locality sensitive hashing
    • condensing by choosing a smaller subset of records that give the same results
    • editing data by removing objects to improve efficiency

Pseudo algorithm

  • given:

    • Training data \(D\) = { \(x_i\), \(y_i\) }
    • Distance metric \(d = (q, x)\)
    • Number of neighbors: \(k\)
    • Query point: \(q\)
    • \(NN = \{ i : d(q, x_i) \enspace k \text{ smallest}\}\) (if \(> k\) equally near, take all)
  • return:

    • classification - plurality vote: \(q = y_i \in NN\) (possibly weighted vote)
      • break ties e.g. most common overall, random choice, etc.
    • regression - mean \(q = y_i \in NN\)

Complexity

method time space
k-NN, learning \(O(1)\) \(O(n)\)
k-NN, query (sorted)* \(lg(n) + k\) \(O(1)\)
k-NN, query (unsorted) \(O(n)\) \(O(1)\)

* If \(k\) is \(> lg(n)\) it will dominate; if \(k\) is small then \(\lg(n)\)

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
q = (4, 2) 
               |  Euclidean | Manhattan |
|  x   |   y   |  distance  | distance  | nearest
|------|-------|------------|-----------|----------  
| 1, 6 |   7   |     25     |     7     |
| 2, 4 |   8   |      8     |     4     |  a-b-c-d
| 3, 7 |  16   |     26     |     6     |  d
| 6, 8 |  44   |     40     |     8     |
| 7, 1 |  50   |     10     |     4     |  b-c-d
| 8, 4 |  68   |     20     |     6     |  d

a) 1-NN, E.D. (average) => 8 / 1            = 8
b) 1-NN, M.D. (average) => (8+50) / 2       = 29
c) 3-NN, E.D. (average) => (8+50+68) / 3    = 42
d) 3-NN, M.D. (average) => (8+16+50+68) / 4 = 35.5

The actual function is \(x_1^2 + x_2\) => (\(q\) = 18) => depending on distance metric, choice of k, number of examples, these results are vastly different (this is a purposely bad example).

Distance metrics make assumptions about domain, which may not apply.

Locally weighted regression

Locally weighted (linear) regression is a generalization of kNN where an explicit local approximation of target function is constructed for each query instance.

  • uses nearby or distance-weighted examples to form local approximation to \(f\)
  • target function is approximated using constant, linear, quadratic function
    • using more complex functional forms is challenging because cost of fitting more complex function for each query instance is high
    • these simpler functions model the target quite well over sufficiently small subregion
  • it is local because function is approximating only on data near the query point
  • is is weighted because contribution of each point is weighted by its distance from query point
  • starts with hypothesis space \(H \in \text{lines}\)
    • over learning becomes \(H \supseteq \text{lines}\)
    • can build local knowledge from nearby concepts to represent arbitrarily complicated functions

Bias

Inductive bias

  • restriction bias: N/A => kNN algorithm never forms an explicit general hypothesis \(\hat{f}\) regarding the target function \(f\)

  • preference bias

    • locality: nearest points are similar => domain knowledge required to choose right distance function
    • smoothness: in choosing near neighbors we expect similar points to behave similarly
    • all features matter equally: assume points can be represented in n-dimensional space

Advantages

  • ability to model complex target functions by collection of local approximations
  • can estimate locally and differently for ach new instance to be classified
  • information presented in examples is never lost
  • simple to implement, no training
  • attempts to construct a local target function and never generalize over entire instance space => this is advantageous when the target function is very complex but can be described by less complex local approximations

Disadvantages

  • no generalization because all examples are kept => difficult to label new instance
  • cost of classifying new instances is high since all computation takes place at classification time
  • overfits and can be sensitive to noise => use weighted distance to smooth out impact of noisy examples
  • having to break ties when two or more similar points with different labels exist
  • considers all attributes when attempting to retrieve similar examples
    • most similar may seem far apart because of many attributes
    • highly sensitive to curse of dimensionality
  • highly sensitive to choosing the right distance metric