Basics

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 5

Basic clustering problem
- given: set of objects $X$ , inter-object distances $D (x, y) = D (y, x) x, y \in X$
- output: partition $P_{D} (x) = P_{D} (y)$ if $x, y$ are in the same cluster
divide data into meaningful and/or useful groups, to describe or summarize data
- clusters describe data instances with similar characteristics
- ideally: small distance between points in cluster; long distance between clusters
Some utility applications: data summarization, compression, efficiency in finding nearest neighbors
clustering problem is very much driven by specific choice of algorithm

Clustering techniques

based on nesting of clusters:
- partitional (un-nested): divides data into non-overlapping clusters such that each instance is in exactly one cluster
- hierarchical (nested): clusters are permitted to have sub-clusters; can be visualized as a tree
based on count of instance memberships:
- exclusive: each instance belongs to a single cluster
- overlapping / non-exclusive: instance can belong to more than one cluster
- fuzzy / soft: (probabilistic) instances membership in each cluster is determined by weigh 0-1, and typical additional constrain requires sum of weights equal to 1 => convert fuzzy to exclusive by assigning each object to cluster with highest weight
based on instance membership in any cluster:
- complete: each instance is assigned to some cluster
- partial: not all instances are assigned a cluster; primary motivations include omitting noise and outliers

Types of clusters

well-separated: the distance between clusters is larger than distance between any two points within a cluster
prototype-based / center-based: prototype defines a cluster and each object in cluster is closer to prototype that defines the cluster than to prototype of any other cluster; the prototype may be actual record or the calculated centroid
contiguity-based / nearest neighbor / transitive: each point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
graph-based: nodes represent data objects, and edges their connections; cluster is a connected component where each node is connected within cluster but there are no connections to outside the cluster => when nodes are strongly connected, then the nodes form a clique
density-based: cluster is dense region surrounded by a region of low density => useful strategy when clusters are irregular, intertwined or when noise or outliers are present
shared-property/conceptual clusters: objects in a cluster share some common property; this encompasses all of the above, but enables defining clusters not covered by previous definitions.

clusters

Clustering properties

Richness: for any assignment of objects to clusters there is some distance matrix $D$ such that $P_{D}$ returns that clustering $\forall C \exists D P_{D} = C$
Scale-invariance: scaling distances by a positive value does not change the clustering $\forall D \forall_{K > 0} P_{D} = P_{K D}$
Consistency: shrinking intra-cluster distances and expanding inter-cluster distance does not change the clustering $P_{D} = P_{D^{'}}$

Impossibility theorem: no clustering algorithm can have all 3 properties.

Other considerations:

order dependence: whether results differ based on the order of processing of data points
nondeterminism: whether different runs give different results
scalability: how large datasets the clustering algorithm can reasonably handle
parameter selection: number of parameters and selection of optimal parameter values

Algorithms

Algorithm	Description	Category	Time Complexity (optimal)	Space Complexity
K-means	Find $K$ clusters represented by centroids	prototype-based, partitional, complete	$O (I K m n)$	$O ((m + K) n)$
Agglomerative hierarchical clustering	Find hierarchical clusters by merging clusters	prototype-based / graph-based, hierarchical, complete	$O (m^{2} \lg (m))$	$O (m^{2})$
DBSCAN	Cluster by density, ignoring low-density regions	Density-based, partitional, partial	$O (m \log m)$	$O (m)$
Fuzzy C-means	Points have weighted membership in clusters	prototype-based, partitional, fuzzy	$O (L K I)$	$O (L K)$

where: $m$ = number of points (records); $n$ = number of attributes; $I$ = number of iterations until convergence; $K$ = number of clusters, $L$ is the number of links

Also see:

Expectation Maximization - statistical approach using mixture models
Single-linkage clustering (SLC) - special case of agglomerative using single-link
Self-organizing maps - clustering and data visualization based on neural networks

Cluster Validation

Some considerations:

clustering tendency - distinguishing if non-random structure actually exists in the data
determining correct number of clusters
Evaluating how well the results of a cluster analysis fit the data without reference to external information.
Comparing the results of a cluster analysis to externally known results, such as externally provided class labels.
Comparing two sets of clusters to determine which is better

Defining cluster evaluation metrics is difficult, there are 3 general categories:

unsupervised: measure goodness without reference to external information (e.g. SSE); two subclasses:
- cohesion: compactness, tightness - how closely related objects in a cluster are
- separation: isolation - how distinct or well-separated clusters are
supervised: measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure (e.g. entropy of class labels compared to known class labels)
relative: Compares different clusterings or clusters, using either supervised or unsupervised metrics

Unsupervised

Cohesion and separation

for graph-based view of cohesion/separation:
- $cohesion (C_{i})$ = $\sum_{x \in C_{i}, y \in C_{i}} p r o x i m i t y (x, y)$
- $separation (C_{i}, C_{j})$ = $\sum_{x \in C_{i}, y \in C_{j}} p r o x i m i t y (x, y)$
proximity function can be similarity or dissimilarity
If clusters have prototype (centroid or menoid)
- cohesion and separation can be defined as distance from prototype:
  $cluster SSE$ = $\sum_{i} \sum_{x \in C_{i}} d i s t a n c e (c_{i}, x)^{2}$
- Group sum of squares (SSB) measures distance between clusters
  $Total SSB$ = $\sum_{i = 1}^{K} m_{i} \times d i s t a n c e (c_{i}, c)^{2}$

The result of cohesion/separation can be used to improve clustering by merging or splitting clusters

Silhouette Coefficient

Evaluation metric for evaluating points in a cluster. Steps:
- Calculate $a_{i}$ , which is the average distance from object $i$ to all other objects in its cluster
- Calculate $b_{i}$ , which is the minimum average distance from object $i$ and objects in any cluster not containing $i$
- The silhouette coefficient is: $s_{i} = (b_{i} - a_{i}) / m a x (a_{i}, b_{i})$
The range of coefficient is -1 to 1. Negative value is bad; as high as possible is good (obtained when $a_{i} = 0$ ).
Cluster S.C. can be computed by taking the average S.C. of its points
Does not work well on arbitrary shapes.

Cophenetic Correlation Coefficient (CPCC)

Cophenetic Correlation Coefficient (CPCC) metric is for evaluating hierarchical clustering

cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time
CPCC calculates the difference between cophenetic distance and original distance matrix

Supervised

supervised methods include knowing the real cluster data labels
can help establish "ground truth" for clustering results
similar metrics can be defined for precision, recall and F-measure

Entropy

measure the degree of clusters matching known class labels
For a set of clusters, $e$ :

$e = \sum_{i = 1}^{K} \frac{m_{i}}{m} e_{i}$
For individual cluster, $e_{i}$ :

$e_{i} = - \sum_{j = 1}^{L} p_{i j} \lg p_{i j}$

where:
- $L$ is number of classes
- $p_{i j} = m_{i j} / m_{i}$
- $m_{i}$ is number of objects in cluster $i$
- $m_{i j}$ is number of objects of class $j$ in cluster $i$

Purity

the maximum $p_{i j}$ for a cluster
overall purity of a cluster is

$\sum_{i = 1}^{K} \frac{m_{i}}{m} m a x_{j} (p_{i j})$