Skip to content

K-Nearest Neighbors

Supervised > classification/regression

  • method of instance-based learning, save examples in database and obtain result by f(x)=lookup(x)
  • general features of instance-based learners:
    • defer generalization decision until new query instance is observed
    • classify new instance base don analyzing similar instance while ignoring distant examples
    • represent instances as real-valued points in Euclidean space
  • k-nearest neighbors
    • the choice of representative distance function is very important for generalizing well => domain knowledge matters
    • particularly susceptible to high dimensionality => the space to covered by examples grows exponentially w.r.t number of features
  • there is no general best choice for k, also possible to choose k=n
    • if all instances are considered (distance-weighted), the algorithm is a global method
    • if only nearest examples are considered, then the algorithm is a local method
  • strategies for improving classification efficiency
    • multi-dimensional access methods (k-d trees)
    • fast approximate similarity search
    • locality sensitive hashing
    • condensing by choosing a smaller subset of records that give the same results
    • editing data by removing objects to improve efficiency

Pseudo algorithm

  • given:

    • Training data D = { xi, yi }
    • Distance metric d=(q,x)
    • Number of neighbors: k
    • Query point: q
    • NN={i:d(q,xi)k smallest} (if >k equally near, take all)
  • return:

    • classification - plurality vote: q=yiNN (possibly weighted vote)
      • break ties e.g. most common overall, random choice, etc.
    • regression - mean q=yiNN

Complexity

method time space
k-NN, learning O(1) O(n)
k-NN, query (sorted)* lg(n)+k O(1)
k-NN, query (unsorted) O(n) O(1)

* If k is >lg(n) it will dominate; if k is small then lg(n)

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
q = (4, 2) 
               |  Euclidean | Manhattan |
|  x   |   y   |  distance  | distance  | nearest
|------|-------|------------|-----------|----------  
| 1, 6 |   7   |     25     |     7     |
| 2, 4 |   8   |      8     |     4     |  a-b-c-d
| 3, 7 |  16   |     26     |     6     |  d
| 6, 8 |  44   |     40     |     8     |
| 7, 1 |  50   |     10     |     4     |  b-c-d
| 8, 4 |  68   |     20     |     6     |  d

a) 1-NN, E.D. (average) => 8 / 1            = 8
b) 1-NN, M.D. (average) => (8+50) / 2       = 29
c) 3-NN, E.D. (average) => (8+50+68) / 3    = 42
d) 3-NN, M.D. (average) => (8+16+50+68) / 4 = 35.5

The actual function is x12+x2 => (q = 18) => depending on distance metric, choice of k, number of examples, these results are vastly different (this is a purposely bad example).

Distance metrics make assumptions about domain, which may not apply.

Locally weighted regression

Locally weighted (linear) regression is a generalization of kNN where an explicit local approximation of target function is constructed for each query instance.

  • uses nearby or distance-weighted examples to form local approximation to f
  • target function is approximated using constant, linear, quadratic function
    • using more complex functional forms is challenging because cost of fitting more complex function for each query instance is high
    • these simpler functions model the target quite well over sufficiently small subregion
  • it is local because function is approximating only on data near the query point
  • is is weighted because contribution of each point is weighted by its distance from query point
  • starts with hypothesis space Hlines
    • over learning becomes Hlines
    • can build local knowledge from nearby concepts to represent arbitrarily complicated functions

Bias

Inductive bias

  • restriction bias: N/A => kNN algorithm never forms an explicit general hypothesis f^ regarding the target function f

  • preference bias

    • locality: nearest points are similar => domain knowledge required to choose right distance function
    • smoothness: in choosing near neighbors we expect similar points to behave similarly
    • all features matter equally: assume points can be represented in n-dimensional space

Advantages

  • ability to model complex target functions by collection of local approximations
  • can estimate locally and differently for ach new instance to be classified
  • information presented in examples is never lost
  • simple to implement, no training
  • attempts to construct a local target function and never generalize over entire instance space => this is advantageous when the target function is very complex but can be described by less complex local approximations

Disadvantages

  • no generalization because all examples are kept => difficult to label new instance
  • cost of classifying new instances is high since all computation takes place at classification time
  • overfits and can be sensitive to noise => use weighted distance to smooth out impact of noisy examples
  • having to break ties when two or more similar points with different labels exist
  • considers all attributes when attempting to retrieve similar examples
    • most similar may seem far apart because of many attributes
    • highly sensitive to curse of dimensionality
  • highly sensitive to choosing the right distance metric