Formulas
Bayes rule
Bitvector similarity
Simple matching | Jaccard coefficients |
---|---|
- measure similarity between two binary vectors
- range: [0, 1] where 1 is highly similar, 0 is dissimilar
- see: example
Centroid
is the cluster centroid of cluster is number of objects in cluster is object in cluster
Correlation
- where cov is covariance and SD is standard deviation.
- range min, max: [-1, 1]
- the closer to −1 or 1, the stronger the correlation between the variables
- +1: perfect direct (increasing) linear relationship
- −1: perfect inverse (decreasing) linear relationship (anti-correlation)
- invariant both to scaling (multiplication) and translation (constant offset)
Covariance
where
Cosine similarity
- measures similarity between vectors
- min, max range: [-1, 1]
- −1 => exactly opposite, 1 => exactly the same, 0 => orthogonal
- in-between values indicate intermediate similarity or dissimilarity
- invariant to scaling (multiplication) but not to translation (constant offset)
- see: example
Entropy
proportion of belonging to class is total number of classes- min: 0, max:
- binary:
- for multiple children, compute weighted entropy
Euclidean distance
is the number of dimensions/attributes and are the attributes of data objects and- measure distance between instances
- not invariant to scaling (multiplication), translation (constant offset)
- specialized case of Minkowski distance where
Gini index
proportion of belonging to class (relative frequency of training instances) is total number of classes- range min: 0 max:
- binary classification:
- for multiple children compute weighted gini index
Group average
- where
is number of objects in cluster - distance measure for hierarchical clustering
Information gain
Mahalanobis distance
is the covariance matrix
Manhattan distance
distance or norm- specialized case of Minkowski distance where
Mean
Minkowski distance
is a parameter is the number of dimensions/attributes and are the attributes of data objects and- generalization of Euclidean distance
- when
=> Manhattan distance, Euclidean distance - when
"supremum"/L norm distance - note:
=> all distances defined for all dimensions - weighted Minkowski:
Misclassification error
Silhouette Coefficient
- for some point in a cluster,
is average distance of to other points in a cluster is min distance between and points of any cluster not containing- range -1 to 1
- negative value is bad; as high as possible is good (
)
Standard deviation
where