Skip to content

Formulas

Bayes rule

\(\displaystyle P(h|D) = \frac{P(D|h) P(h)}{P(D)}\)


Bitvector similarity

Simple matching Jaccard coefficients
\(\displaystyle \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}}\) \(\displaystyle \frac{f_{11}}{f_{01}+f_{10}+f_{11}}\)
  • measure similarity between two binary vectors
  • range: [0, 1] where 1 is highly similar, 0 is dissimilar
  • see: example

Centroid

\(\displaystyle c_i = \frac{1}{m_i} \sum_{x \in C_i} x\)

  • \(C_i\) is the \(i^{th}\) cluster
  • \(c_i\) centroid of cluster \(C_i\)
  • \(m_i\) is number of objects in cluster \(C_i\)
  • \(x\) is object in cluster

Correlation

\(\displaystyle \text{corr(x, y)}\) = \(\frac{\text{cov(x, y)}} { \text{SD(x)} \cdot \text{SD(y)} }\)

  • where cov is covariance and SD is standard deviation.
  • range min, max: [-1, 1]
  • the closer to −1 or 1, the stronger the correlation between the variables
  • +1: perfect direct (increasing) linear relationship
  • −1: perfect inverse (decreasing) linear relationship (anti-correlation)
  • invariant both to scaling (multiplication) and translation (constant offset)

Covariance

\(\displaystyle \text{covariance(x, y)}\) = \(\displaystyle \frac{1}{n-1} \sum_{k=1}^{n}(x_k-\overline{x}) (y_k-\overline{y})\)

where \(\overline{x}\), \(\overline{y}\) is the mean.


Cosine similarity

\(\displaystyle \cos(x, y)\) = \(\displaystyle \frac{x \cdot y}{\Vert x \Vert \cdot \Vert y \Vert}\) = \(\displaystyle \frac{\sum_{i=1}^{n} x_iy_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}}\)

  • measures similarity between vectors
  • min, max range: [-1, 1]
  • −1 => exactly opposite, 1 => exactly the same, 0 => orthogonal
  • in-between values indicate intermediate similarity or dissimilarity
  • invariant to scaling (multiplication) but not to translation (constant offset)
  • see: example

Entropy

\(\displaystyle \text{Entropy(S)} = \sum_{i=0}^{c-1} -p_i \log_2 p_i\)

  • \(p_i\) proportion of \(S\) belonging to class \(i\)
  • \(c\) is total number of classes
  • \(0 \log_2 0 = 0\)
  • min: 0, max: \(log_2(c)\)
  • binary: \(- P(⊕) \log_2 P(⊕) - P(⊖) log_2 P(⊖)\)
  • for multiple children, compute weighted entropy

Euclidean distance

\(\displaystyle d(x, y) = \sqrt{ \sum_{k=1}^{n} (x_k - y_k)^2 }\)

  • \(n\) is the number of dimensions/attributes
  • \(x_k\) and \(y_k\) are the \(k^{th}\) attributes of data objects \(x\) and \(y\)
  • measure distance between instances
  • not invariant to scaling (multiplication), translation (constant offset)
  • specialized case of Minkowski distance where \(r = 2\)

Gini index

\(\displaystyle \text{Gini index(S)} = 1 - \sum_{i = 0}^{c-1} p_i^2\)

  • \(p_i\) proportion of \(S\) belonging to class \(i\) (relative frequency of training instances)
  • \(c\) is total number of classes
  • range min: 0 max: \(1-1/c\)
  • binary classification: \(1 - P(⊕)^2 - P(⊖)^2\)
  • for multiple children compute weighted gini index

Group average

\(\displaystyle \text{proximity}(C_i, C_j)\) = \(\displaystyle \frac{\sum_{x \in C_i, y \in C_j} proximity(x, y)}{m_i \times m_j}\)

  • where \(m_k\) is number of objects in cluster \(C_k\)
  • distance measure for hierarchical clustering

Information gain

\(\displaystyle \text{Gain(S, A)} = \text{Entropy(S)}\) - \(\displaystyle \sum_{v} \frac{|S_v|}{|S|} \text{Entropy}(S_v)\)


Mahalanobis distance

\(\displaystyle \text{d(x, y)} = ((x-y)^T\Sigma^{-1}(x-y))^{-0.5}\)

  • \(\Sigma\) is the covariance matrix

Manhattan distance

\(\displaystyle \text{d(x, y)} = \Vert p - q \Vert = \sum_{i-1}^{n} \vert p_i - q_i \vert\)

  • \(L_1\) distance or \(l_1\) norm
  • specialized case of Minkowski distance where \(r = 1\)

Mean

\(\displaystyle \overline{x} = \frac{1}{n} \sum_{k=1}^{n} x_k\)


Minkowski distance

\(\displaystyle d(x, y) = \Big( \sum_{k=1}^{n} \vert(x_k - y_k)\vert^r \Big)^\frac{1}{r}\)

  • \(r\) is a parameter
  • \(n\) is the number of dimensions/attributes
  • \(x_k\) and \(y_k\) are the \(k^{th}\) attributes of data objects \(x\) and \(y\)
  • generalization of Euclidean distance
  • when \(r = 1\) => Manhattan distance, \(r = 2\) Euclidean distance
  • when \(r \rightarrow \infty\) "supremum"/L norm distance
  • note: \(r \neq n\) => all distances defined for all dimensions
  • weighted Minkowski: \(d(x, y) = \Big( \sum_{k=1}^{n} w_k \cdot \vert x_k - y_k \vert ^ r \Big)^\frac{1}{r}\)

Misclassification error

\(\displaystyle \text{Error(S)} = 1 - \text{max}(p_i)\)


Silhouette Coefficient

\(\displaystyle s_i = (b_i - a_i)/max(a_i, b_i)\)

  • for some point in a cluster, \(i\)
  • \(a_i\) is average distance of \(i\) to other points in a cluster
  • \(b_i\) is min distance between \(i\) and points of any cluster not containing \(i\)
  • range -1 to 1
  • negative value is bad; as high as possible is good (\(a_i=0\))

Standard deviation

\(\displaystyle \text{SD(x)} = \sqrt{\frac{1}{n-1} \sum_{k=1}^{n}(x_k-\overline{x})}\)

where \(\overline{x}\) is the mean of \(x\).


Sum of squared errors

\(\displaystyle \frac{1}{2}\sum_{i=1}^{m} (\hat{y}_i - y_i)^2\)