K-means
Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 5, section 2
Unsupervised learning > clustering
-
K-means generates \(K\) clusters, defined in around centroid prototype, which is usually the mean of points
-
usually applied to objects in continuous \(n\)-dimensional space
-
if the prototype is an actual point (menoid) then the algorithm is called K-menoid
-
applicable to wide variety of data, eg. continuous numeric, documents and time series
Pseudo code
1 2 3 4 5 6 |
|
- \(k\) is a user-specified parameter
- selection of initial centroids greatly impacts the quality of result
- various distance metrics can be used to calculate distance between points, e.g. Euclidean, cosine => prefer simple metric because it must be calculated repeatedly
- centroid is the point that minimizes squared distance between points, for Euclidean distance use sum of squared error (SSE)
- over many runs of the algorithm, clustering solution with smallest SSE is preferred
- for various proximity functions, K-means is guaranteed to converge
- most convergence occurs in early iterations => the stop condition can be weakened to e.g. <1% of centroids change
Possible Distance metrics
Function | centroid | objective function |
---|---|---|
Manhattan | median | Minimize sum of \(L_1\) distance of object to its cluster centroid |
Squared Euclidean | mean | Minimize sum of the squares \(L_2\) distance of object to its cluster centroid |
cosine | mean | Maximize sum of cosine similarity of object to its cluster centroid |
Bregman divergence | mean | Minimize sym of B.D. of an object to its cluster centroid |
All of the above are guaranteed to converge.
Choosing initial centroids
- Poor choices yield poor results.
- Optimal clustering will be obtained if two initial centroids fall anywhere in a pair of clusters since centroids will redistribute themselves
-
When \(k\) increases, it becomes harder to distribute initial centroids to meet above criteria
-
strategies for choosing centroids:
- random initialization (over multiple runs to improve result)
- take a sample of points and apply hierarchical clustering, then use resulting centroids as initial centroids to K-means => works well, but practical only for small datasets (max. thousands records) and small number of K
- (1) choose 1 point at random or centroid of all points (2) select as next centroid, the point farthest from already selected => yields well-separated centroids, but may cause selection of outliers as centroids (overcome by initial sampling)
Centroid and SSE
Computing centroid of \(i^{th}\) cluster:
\(\displaystyle c_i = \frac{1}{m_i} \sum_{x \in C_i} x\)
where:
- \(C_i\) is the \(i^{th}\) cluster
- \(c_i\) centroid of cluster \(C_i\)
- \(m_i\) is number of objects in cluster \(C_i\)
- \(x\) is object in cluster
Computing Error
\(\displaystyle SSE = \sum_{i=1}^{k}\sum_{x \in C_i} dist(c_i, x)^2\)
For document data: compute cohesion:
\(\displaystyle \text{Cohesion} = \sum_{i=1}^{k}\sum_{x \in C_i} cosine(c_i, x)^2\)
Postprocessing
-
goal: reducing SSE with postprocessing
-
obvious strategy: find more clusters, increase K
-
otherwise try to reduce SSE by adjusting discovered clusters by spitting and merging
- often possible because to improve because solutions tend to converge to local minimum
-
strategies for decreasing SSE by splitting a cluster:
- choose cluster with largest SSE, standard deviation, etc. then split
- introduce new centroid: choose point farthest away from cluster centroid or some point randomly
-
strategies for reducing cluster count, while maintaining minimal increase in SSE:
- disperse a cluster: remove centroid with least increase to total SSE, and reassign points to other clusters
- merge two clusters: typically choose clusters with two closest centroids, or with least increase to total SSE
Alternatives
K-means++
-
finds K-means solution optimal within factor \(O(log(k))\) => better results, smaller SSE
-
pick initial centroids at random, then subsequent as far as possible => pick centroids incrementally until \(K\) have been selected
-
each point has a probability of being picked as the new centroid that is proportional to the square of its distance to its closest centroid
-
it might seem that this approach might tend to choose outliers for centroids, but because outliers are rare, by definition, this is unlikely
-
-
K-means++ could be combined with bisecting K-means and postprocessing to improve results
1 2 3 4 5 6 7 |
|
Bisecting K-means
-
extension to the basic K-means, where clusters are split until \(K\) clusters have been obtained
-
choosing which cluster to split can be based on e.g. largest SSE, size (most points), or some combination of size/SSE criteria
-
this approach is less susceptible to initiation issues, but not guaranteed to give global/local minimal solution
-
locally splitting clusters may not result in local minimum SSE, but the results of bisection K-means can then be used as initial centroids to standard K-means
1 2 3 4 5 6 7 8 9 10 11 |
|
Analysis
Complexity
- \(m\) = number of points (records)
- \(n\) = number of attributes
- \(I\) = number of iterations until convergence
K-means Complexity | |
---|---|
Time | \(O(I \times K \times m \times n)\) |
Space | \(O((m + K) n)\) |
Advantages
-
simple and in general case, applicable to various types of data
-
general efficiency - enables multiple runs of the algorithm
-
some variants are less susceptible to initialization issues (bisecting K-means)
Limitations
-
not guaranteed to find global optimal solution
-
empty clusters may be obtained => can be handled by choosing replacement centroid or incrementally updating centroids
-
outliers can influence the obtained clusters: influence the prototype and increase SSE => reduce impact by removing outliers at preprocessing stage
-
not applicable to data of non-globular shape, or highly varying sizes/densities => notion of centroid must exist => one strategy to overcome this is to find many small clusters then combine in a postprocessing step