K-means

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 5, section 2

Unsupervised learning > clustering

K-means generates \(K\) clusters, defined in around centroid prototype, which is usually the mean of points
usually applied to objects in continuous \(n\)-dimensional space
if the prototype is an actual point (menoid) then the algorithm is called K-menoid
applicable to wide variety of data, eg. continuous numeric, documents and time series

Pseudo code

K_MEANS(k):
    select k points as initial centroids
    do:
        form k clusters by assigning each point to its nearest centroid
        recompute centroids
    while(centroid locations change)    

\(k\) is a user-specified parameter
selection of initial centroids greatly impacts the quality of result
various distance metrics can be used to calculate distance between points, e.g. Euclidean, cosine => prefer simple metric because it must be calculated repeatedly
centroid is the point that minimizes squared distance between points, for Euclidean distance use sum of squared error (SSE)
over many runs of the algorithm, clustering solution with smallest SSE is preferred
for various proximity functions, K-means is guaranteed to converge
most convergence occurs in early iterations => the stop condition can be weakened to e.g. <1% of centroids change

Possible Distance metrics

Function	centroid	objective function
Manhattan	median	Minimize sum of \(L_1\) distance of object to its cluster centroid
Squared Euclidean	mean	Minimize sum of the squares \(L_2\) distance of object to its cluster centroid
cosine	mean	Maximize sum of cosine similarity of object to its cluster centroid
Bregman divergence	mean	Minimize sym of B.D. of an object to its cluster centroid

All of the above are guaranteed to converge.

Choosing initial centroids

Poor choices yield poor results.
Optimal clustering will be obtained if two initial centroids fall anywhere in a pair of clusters since centroids will redistribute themselves
When \(k\) increases, it becomes harder to distribute initial centroids to meet above criteria
strategies for choosing centroids:
- random initialization (over multiple runs to improve result)
- take a sample of points and apply hierarchical clustering, then use resulting centroids as initial centroids to K-means => works well, but practical only for small datasets (max. thousands records) and small number of K
- (1) choose 1 point at random or centroid of all points (2) select as next centroid, the point farthest from already selected => yields well-separated centroids, but may cause selection of outliers as centroids (overcome by initial sampling)

Centroid and SSE

Computing centroid of \(i^{th}\) cluster:

\(\displaystyle c_i = \frac{1}{m_i} \sum_{x \in C_i} x\)

where:

\(C_i\) is the \(i^{th}\) cluster
\(c_i\) centroid of cluster \(C_i\)
\(m_i\) is number of objects in cluster \(C_i\)
\(x\) is object in cluster

Computing Error

\(\displaystyle SSE = \sum_{i=1}^{k}\sum_{x \in C_i} dist(c_i, x)^2\)

For document data: compute cohesion:

\(\displaystyle \text{Cohesion} = \sum_{i=1}^{k}\sum_{x \in C_i} cosine(c_i, x)^2\)

Postprocessing

goal: reducing SSE with postprocessing
obvious strategy: find more clusters, increase K
otherwise try to reduce SSE by adjusting discovered clusters by spitting and merging
- often possible because to improve because solutions tend to converge to local minimum
strategies for decreasing SSE by splitting a cluster:
- choose cluster with largest SSE, standard deviation, etc. then split
- introduce new centroid: choose point farthest away from cluster centroid or some point randomly
strategies for reducing cluster count, while maintaining minimal increase in SSE:
- disperse a cluster: remove centroid with least increase to total SSE, and reassign points to other clusters
- merge two clusters: typically choose clusters with two closest centroids, or with least increase to total SSE

Alternatives

K-means++

finds K-means solution optimal within factor \(O(log(k))\) => better results, smaller SSE
pick initial centroids at random, then subsequent as far as possible => pick centroids incrementally until \(K\) have been selected
- each point has a probability of being picked as the new centroid that is proportional to the square of its distance to its closest centroid
- it might seem that this approach might tend to choose outliers for centroids, but because outliers are rare, by definition, this is unlikely
K-means++ could be combined with bisecting K-means and postprocessing to improve results

K_MEANS_PLUSPLUS(k):
    pick one of the points at random as initial centroid
    for i = 1 to k:
        compute distance d(x) of each point to its closest centroid
        assign each point a probability proportional to its d(x)^2
        pick new centroid from remaining points using weighted probability
    done

Bisecting K-means

extension to the basic K-means, where clusters are split until \(K\) clusters have been obtained
choosing which cluster to split can be based on e.g. largest SSE, size (most points), or some combination of size/SSE criteria
this approach is less susceptible to initiation issues, but not guaranteed to give global/local minimal solution
locally splitting clusters may not result in local minimum SSE, but the results of bisection K-means can then be used as initial centroids to standard K-means

BISECTING_K_MEANS(k, num_trials):
    Init list of clusters to contain 1 cluster of all points 
    do
        remove a cluster from list of clusters
        // perform several trivial bisections of selected cluster
        for i = 1 to num_trials
            bisect cluster using basic K-means 
        done
        select the two clusters from the bisection with lowest total SSE
        add these two to list of clusters
    until k clusters

Analysis

Complexity

\(m\) = number of points (records)
\(n\) = number of attributes
\(I\) = number of iterations until convergence

K-means Complexity
Time	\(O(I \times K \times m \times n)\)
Space	\(O((m + K) n)\)

Advantages

simple and in general case, applicable to various types of data
general efficiency - enables multiple runs of the algorithm
some variants are less susceptible to initialization issues (bisecting K-means)

Limitations

not guaranteed to find global optimal solution
empty clusters may be obtained => can be handled by choosing replacement centroid or incrementally updating centroids
outliers can influence the obtained clusters: influence the prototype and increase SSE => reduce impact by removing outliers at preprocessing stage
not applicable to data of non-globular shape, or highly varying sizes/densities => notion of centroid must exist => one strategy to overcome this is to find many small clusters then combine in a postprocessing step