Formulas

Bayes rule

$P (h | D) = \frac{P (D | h) P (h)}{P (D)}$

Bitvector similarity

Simple matching	Jaccard coefficients
$\frac{f_{11} + f_{00}}{f_{00} + f_{01} + f_{10} + f_{11}}$	$\frac{f_{11}}{f_{01} + f_{10} + f_{11}}$

measure similarity between two binary vectors
range: [0, 1] where 1 is highly similar, 0 is dissimilar
see: example

Centroid

$c_{i} = \frac{1}{m_{i}} \sum_{x \in C_{i}} x$

$C_{i}$ is the $i^{t h}$ cluster
$c_{i}$ centroid of cluster $C_{i}$
$m_{i}$ is number of objects in cluster $C_{i}$
$x$ is object in cluster

Correlation

$corr(x, y)$ = $\frac{cov(x, y)}{SD(x) \cdot SD(y)}$

where cov is covariance and SD is standard deviation.
range min, max: [-1, 1]
the closer to −1 or 1, the stronger the correlation between the variables
+1: perfect direct (increasing) linear relationship
−1: perfect inverse (decreasing) linear relationship (anti-correlation)
invariant both to scaling (multiplication) and translation (constant offset)

Covariance

$covariance(x, y)$ = $\frac{1}{n - 1} \sum_{k = 1}^{n} (x_{k} - \overset{―}{x}) (y_{k} - \overset{―}{y})$

where $\overset{―}{x}$ , $\overset{―}{y}$ is the mean.

Cosine similarity

$\cos (x, y)$ = $\frac{x \cdot y}{‖ x ‖ \cdot ‖ y ‖}$ = $\frac{\sum_{i = 1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{n} x_{i}^{2}} \sqrt{\sum_{i = 1}^{n} y_{i}^{2}}}$

measures similarity between vectors
min, max range: [-1, 1]
−1 => exactly opposite, 1 => exactly the same, 0 => orthogonal
in-between values indicate intermediate similarity or dissimilarity
invariant to scaling (multiplication) but not to translation (constant offset)
see: example

Entropy

$Entropy(S) = \sum_{i = 0}^{c - 1} - p_{i} \log_{2} p_{i}$

$p_{i}$ proportion of $S$ belonging to class $i$
$c$ is total number of classes
$0 \log_{2} 0 = 0$
min: 0, max: $l o g_{2} (c)$
binary: $- P (\oplus) \log_{2} P (\oplus) - P (⊖) l o g_{2} P (⊖)$
for multiple children, compute weighted entropy

Euclidean distance

$d (x, y) = \sqrt{\sum_{k = 1}^{n} (x_{k} - y_{k})^{2}}$

$n$ is the number of dimensions/attributes
$x_{k}$ and $y_{k}$ are the $k^{t h}$ attributes of data objects $x$ and $y$
measure distance between instances
not invariant to scaling (multiplication), translation (constant offset)
specialized case of Minkowski distance where $r = 2$

Gini index

$Gini index(S) = 1 - \sum_{i = 0}^{c - 1} p_{i}^{2}$

$p_{i}$ proportion of $S$ belonging to class $i$ (relative frequency of training instances)
$c$ is total number of classes
range min: 0 max: $1 - 1 / c$
binary classification: $1 - P (\oplus)^{2} - P (⊖)^{2}$
for multiple children compute weighted gini index

Group average

$proximity (C_{i}, C_{j})$ = $\frac{\sum_{x \in C_{i}, y \in C_{j}} p r o x i m i t y (x, y)}{m_{i} \times m_{j}}$

where $m_{k}$ is number of objects in cluster $C_{k}$
distance measure for hierarchical clustering

Information gain

$Gain(S, A) = Entropy(S)$ - $\sum_{v} \frac{| S_{v} |}{| S |} Entropy (S_{v})$

Mahalanobis distance

$d(x, y) = ((x - y)^{T} Σ^{- 1} (x - y))^{- 0.5}$

$Σ$ is the covariance matrix

Manhattan distance

$d(x, y) = ‖ p - q ‖ = \sum_{i - 1}^{n} | p_{i} - q_{i} |$

$L_{1}$ distance or $l_{1}$ norm
specialized case of Minkowski distance where $r = 1$

Mean

$\overset{―}{x} = \frac{1}{n} \sum_{k = 1}^{n} x_{k}$

Minkowski distance

$d (x, y) = (\sum_{k = 1}^{n} | (x_{k} - y_{k}) |^{r})^{\frac{1}{r}}$

$r$ is a parameter
$n$ is the number of dimensions/attributes
$x_{k}$ and $y_{k}$ are the $k^{t h}$ attributes of data objects $x$ and $y$
generalization of Euclidean distance
when $r = 1$ => Manhattan distance, $r = 2$ Euclidean distance
when $r \to \infty$ "supremum"/L norm distance
note: $r \neq n$ => all distances defined for all dimensions
weighted Minkowski: $d (x, y) = (\sum_{k = 1}^{n} w_{k} \cdot | x_{k} - y_{k} |^{r})^{\frac{1}{r}}$

Misclassification error

$Error(S) = 1 - max (p_{i})$

Silhouette Coefficient

$s_{i} = (b_{i} - a_{i}) / m a x (a_{i}, b_{i})$

for some point in a cluster, $i$
$a_{i}$ is average distance of $i$ to other points in a cluster
$b_{i}$ is min distance between $i$ and points of any cluster not containing $i$
range -1 to 1
negative value is bad; as high as possible is good ( $a_{i} = 0$ )

Standard deviation

$SD(x) = \sqrt{\frac{1}{n - 1} \sum_{k = 1}^{n} (x_{k} - \overset{―}{x})}$

where $\overset{―}{x}$ is the mean of $x$ .

Sum of squared errors

$\frac{1}{2} \sum_{i = 1}^{m} ({\hat{y}}_{i} - y_{i})^{2}$