Skip to content

Formulas

Bayes rule

P(h|D)=P(D|h)P(h)P(D)


Bitvector similarity

Simple matching Jaccard coefficients
f11+f00f00+f01+f10+f11 f11f01+f10+f11
  • measure similarity between two binary vectors
  • range: [0, 1] where 1 is highly similar, 0 is dissimilar
  • see: example

Centroid

ci=1mixCix

  • Ci is the ith cluster
  • ci centroid of cluster Ci
  • mi is number of objects in cluster Ci
  • x is object in cluster

Correlation

corr(x, y) = cov(x, y)SD(x)SD(y)

  • where cov is covariance and SD is standard deviation.
  • range min, max: [-1, 1]
  • the closer to −1 or 1, the stronger the correlation between the variables
  • +1: perfect direct (increasing) linear relationship
  • −1: perfect inverse (decreasing) linear relationship (anti-correlation)
  • invariant both to scaling (multiplication) and translation (constant offset)

Covariance

covariance(x, y) = 1n1k=1n(xkx)(yky)

where x, y is the mean.


Cosine similarity

cos(x,y) = xyxy = i=1nxiyii=1nxi2i=1nyi2

  • measures similarity between vectors
  • min, max range: [-1, 1]
  • −1 => exactly opposite, 1 => exactly the same, 0 => orthogonal
  • in-between values indicate intermediate similarity or dissimilarity
  • invariant to scaling (multiplication) but not to translation (constant offset)
  • see: example

Entropy

Entropy(S)=i=0c1pilog2pi

  • pi proportion of S belonging to class i
  • c is total number of classes
  • 0log20=0
  • min: 0, max: log2(c)
  • binary: P()log2P()P()log2P()
  • for multiple children, compute weighted entropy

Euclidean distance

d(x,y)=k=1n(xkyk)2

  • n is the number of dimensions/attributes
  • xk and yk are the kth attributes of data objects x and y
  • measure distance between instances
  • not invariant to scaling (multiplication), translation (constant offset)
  • specialized case of Minkowski distance where r=2

Gini index

Gini index(S)=1i=0c1pi2

  • pi proportion of S belonging to class i (relative frequency of training instances)
  • c is total number of classes
  • range min: 0 max: 11/c
  • binary classification: 1P()2P()2
  • for multiple children compute weighted gini index

Group average

proximity(Ci,Cj) = xCi,yCjproximity(x,y)mi×mj

  • where mk is number of objects in cluster Ck
  • distance measure for hierarchical clustering

Information gain

Gain(S, A)=Entropy(S) - v|Sv||S|Entropy(Sv)


Mahalanobis distance

d(x, y)=((xy)TΣ1(xy))0.5

  • Σ is the covariance matrix

Manhattan distance

d(x, y)=pq=i1n|piqi|

  • L1 distance or l1 norm
  • specialized case of Minkowski distance where r=1

Mean

x=1nk=1nxk


Minkowski distance

d(x,y)=(k=1n|(xkyk)|r)1r

  • r is a parameter
  • n is the number of dimensions/attributes
  • xk and yk are the kth attributes of data objects x and y
  • generalization of Euclidean distance
  • when r=1 => Manhattan distance, r=2 Euclidean distance
  • when r "supremum"/L norm distance
  • note: rn => all distances defined for all dimensions
  • weighted Minkowski: d(x,y)=(k=1nwk|xkyk|r)1r

Misclassification error

Error(S)=1max(pi)


Silhouette Coefficient

si=(biai)/max(ai,bi)

  • for some point in a cluster, i
  • ai is average distance of i to other points in a cluster
  • bi is min distance between i and points of any cluster not containing i
  • range -1 to 1
  • negative value is bad; as high as possible is good (ai=0)

Standard deviation

SD(x)=1n1k=1n(xkx)

where x is the mean of x.


Sum of squared errors

12i=1m(y^iyi)2