Formulas
Bayes rule
\(\displaystyle P(h|D) = \frac{P(D|h) P(h)}{P(D)}\)
Bitvector similarity
Simple matching | Jaccard coefficients |
---|---|
\(\displaystyle \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}}\) | \(\displaystyle \frac{f_{11}}{f_{01}+f_{10}+f_{11}}\) |
- measure similarity between two binary vectors
- range: [0, 1] where 1 is highly similar, 0 is dissimilar
- see: example
Centroid
\(\displaystyle c_i = \frac{1}{m_i} \sum_{x \in C_i} x\)
- \(C_i\) is the \(i^{th}\) cluster
- \(c_i\) centroid of cluster \(C_i\)
- \(m_i\) is number of objects in cluster \(C_i\)
- \(x\) is object in cluster
Correlation
\(\displaystyle \text{corr(x, y)}\) = \(\frac{\text{cov(x, y)}} { \text{SD(x)} \cdot \text{SD(y)} }\)
- where cov is covariance and SD is standard deviation.
- range min, max: [-1, 1]
- the closer to −1 or 1, the stronger the correlation between the variables
- +1: perfect direct (increasing) linear relationship
- −1: perfect inverse (decreasing) linear relationship (anti-correlation)
- invariant both to scaling (multiplication) and translation (constant offset)
Covariance
\(\displaystyle \text{covariance(x, y)}\) = \(\displaystyle \frac{1}{n-1} \sum_{k=1}^{n}(x_k-\overline{x}) (y_k-\overline{y})\)
where \(\overline{x}\), \(\overline{y}\) is the mean.
Cosine similarity
\(\displaystyle \cos(x, y)\) = \(\displaystyle \frac{x \cdot y}{\Vert x \Vert \cdot \Vert y \Vert}\) = \(\displaystyle \frac{\sum_{i=1}^{n} x_iy_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2}}\)
- measures similarity between vectors
- min, max range: [-1, 1]
- −1 => exactly opposite, 1 => exactly the same, 0 => orthogonal
- in-between values indicate intermediate similarity or dissimilarity
- invariant to scaling (multiplication) but not to translation (constant offset)
- see: example
Entropy
\(\displaystyle \text{Entropy(S)} = \sum_{i=0}^{c-1} -p_i \log_2 p_i\)
- \(p_i\) proportion of \(S\) belonging to class \(i\)
- \(c\) is total number of classes
- \(0 \log_2 0 = 0\)
- min: 0, max: \(log_2(c)\)
- binary: \(- P(⊕) \log_2 P(⊕) - P(⊖) log_2 P(⊖)\)
- for multiple children, compute weighted entropy
Euclidean distance
\(\displaystyle d(x, y) = \sqrt{ \sum_{k=1}^{n} (x_k - y_k)^2 }\)
- \(n\) is the number of dimensions/attributes
- \(x_k\) and \(y_k\) are the \(k^{th}\) attributes of data objects \(x\) and \(y\)
- measure distance between instances
- not invariant to scaling (multiplication), translation (constant offset)
- specialized case of Minkowski distance where \(r = 2\)
Gini index
\(\displaystyle \text{Gini index(S)} = 1 - \sum_{i = 0}^{c-1} p_i^2\)
- \(p_i\) proportion of \(S\) belonging to class \(i\) (relative frequency of training instances)
- \(c\) is total number of classes
- range min: 0 max: \(1-1/c\)
- binary classification: \(1 - P(⊕)^2 - P(⊖)^2\)
- for multiple children compute weighted gini index
Group average
\(\displaystyle \text{proximity}(C_i, C_j)\) = \(\displaystyle \frac{\sum_{x \in C_i, y \in C_j} proximity(x, y)}{m_i \times m_j}\)
- where \(m_k\) is number of objects in cluster \(C_k\)
- distance measure for hierarchical clustering
Information gain
\(\displaystyle \text{Gain(S, A)} = \text{Entropy(S)}\) - \(\displaystyle \sum_{v} \frac{|S_v|}{|S|} \text{Entropy}(S_v)\)
Mahalanobis distance
\(\displaystyle \text{d(x, y)} = ((x-y)^T\Sigma^{-1}(x-y))^{-0.5}\)
- \(\Sigma\) is the covariance matrix
Manhattan distance
\(\displaystyle \text{d(x, y)} = \Vert p - q \Vert = \sum_{i-1}^{n} \vert p_i - q_i \vert\)
- \(L_1\) distance or \(l_1\) norm
- specialized case of Minkowski distance where \(r = 1\)
Mean
\(\displaystyle \overline{x} = \frac{1}{n} \sum_{k=1}^{n} x_k\)
Minkowski distance
\(\displaystyle d(x, y) = \Big( \sum_{k=1}^{n} \vert(x_k - y_k)\vert^r \Big)^\frac{1}{r}\)
- \(r\) is a parameter
- \(n\) is the number of dimensions/attributes
- \(x_k\) and \(y_k\) are the \(k^{th}\) attributes of data objects \(x\) and \(y\)
- generalization of Euclidean distance
- when \(r = 1\) => Manhattan distance, \(r = 2\) Euclidean distance
- when \(r \rightarrow \infty\) "supremum"/L norm distance
- note: \(r \neq n\) => all distances defined for all dimensions
- weighted Minkowski: \(d(x, y) = \Big( \sum_{k=1}^{n} w_k \cdot \vert x_k - y_k \vert ^ r \Big)^\frac{1}{r}\)
Misclassification error
\(\displaystyle \text{Error(S)} = 1 - \text{max}(p_i)\)
Silhouette Coefficient
\(\displaystyle s_i = (b_i - a_i)/max(a_i, b_i)\)
- for some point in a cluster, \(i\)
- \(a_i\) is average distance of \(i\) to other points in a cluster
- \(b_i\) is min distance between \(i\) and points of any cluster not containing \(i\)
- range -1 to 1
- negative value is bad; as high as possible is good (\(a_i=0\))
Standard deviation
\(\displaystyle \text{SD(x)} = \sqrt{\frac{1}{n-1} \sum_{k=1}^{n}(x_k-\overline{x})}\)
where \(\overline{x}\) is the mean of \(x\).
Sum of squared errors
\(\displaystyle \frac{1}{2}\sum_{i=1}^{m} (\hat{y}_i - y_i)^2\)