K-Nearest Neighbors
Supervised > classification/regression
- method of instance-based learning, save examples in database and obtain result by \(f(x) = \text{lookup}(x)\)
- general features of instance-based learners:
- defer generalization decision until new query instance is observed
- classify new instance base don analyzing similar instance while ignoring distant examples
- represent instances as real-valued points in Euclidean space
- k-nearest neighbors
- the choice of representative distance function is very important for generalizing well => domain knowledge matters
- particularly susceptible to high dimensionality => the space to covered by examples grows exponentially w.r.t number of features
- there is no general best choice for \(k\), also possible to choose \(k = n\)
- if all instances are considered (distance-weighted), the algorithm is a global method
- if only nearest examples are considered, then the algorithm is a local method
- strategies for improving classification efficiency
- multi-dimensional access methods (k-d trees)
- fast approximate similarity search
- locality sensitive hashing
- condensing by choosing a smaller subset of records that give the same results
- editing data by removing objects to improve efficiency
Pseudo algorithm
-
given:
- Training data \(D\) = { \(x_i\), \(y_i\) }
- Distance metric \(d = (q, x)\)
- Number of neighbors: \(k\)
- Query point: \(q\)
- \(NN = \{ i : d(q, x_i) \enspace k \text{ smallest}\}\) (if \(> k\) equally near, take all)
-
return:
- classification - plurality vote: \(q = y_i \in NN\) (possibly weighted vote)
- break ties e.g. most common overall, random choice, etc.
- regression - mean \(q = y_i \in NN\)
- classification - plurality vote: \(q = y_i \in NN\) (possibly weighted vote)
Complexity
method | time | space |
---|---|---|
k-NN, learning | \(O(1)\) | \(O(n)\) |
k-NN, query (sorted)* | \(lg(n) + k\) | \(O(1)\) |
k-NN, query (unsorted) | \(O(n)\) | \(O(1)\) |
* If \(k\) is \(> lg(n)\) it will dominate; if \(k\) is small then \(\lg(n)\)
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The actual function is \(x_1^2 + x_2\) => (\(q\) = 18) => depending on distance metric, choice of k, number of examples, these results are vastly different (this is a purposely bad example).
Distance metrics make assumptions about domain, which may not apply.
Locally weighted regression
Locally weighted (linear) regression is a generalization of kNN where an explicit local approximation of target function is constructed for each query instance.
- uses nearby or distance-weighted examples to form local approximation to \(f\)
- target function is approximated using constant, linear, quadratic function
- using more complex functional forms is challenging because cost of fitting more complex function for each query instance is high
- these simpler functions model the target quite well over sufficiently small subregion
- it is local because function is approximating only on data near the query point
- is is weighted because contribution of each point is weighted by its distance from query point
- starts with hypothesis space \(H \in \text{lines}\)
- over learning becomes \(H \supseteq \text{lines}\)
- can build local knowledge from nearby concepts to represent arbitrarily complicated functions
Bias
Inductive bias
-
restriction bias: N/A => kNN algorithm never forms an explicit general hypothesis \(\hat{f}\) regarding the target function \(f\)
-
preference bias
- locality: nearest points are similar => domain knowledge required to choose right distance function
- smoothness: in choosing near neighbors we expect similar points to behave similarly
- all features matter equally: assume points can be represented in n-dimensional space
Advantages
- ability to model complex target functions by collection of local approximations
- can estimate locally and differently for ach new instance to be classified
- information presented in examples is never lost
- simple to implement, no training
- attempts to construct a local target function and never generalize over entire instance space => this is advantageous when the target function is very complex but can be described by less complex local approximations
Disadvantages
- no generalization because all examples are kept => difficult to label new instance
- cost of classifying new instances is high since all computation takes place at classification time
- overfits and can be sensitive to noise => use weighted distance to smooth out impact of noisy examples
- having to break ties when two or more similar points with different labels exist
- considers all attributes when attempting to retrieve similar examples
- most similar may seem far apart because of many attributes
- highly sensitive to curse of dimensionality
- highly sensitive to choosing the right distance metric