Skip to content

Detection Methods

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9.3-9.8

Statistical Methods

  • use statistical approaches to model normal class => the result is a probability score

    • instance \(x\) is anomalous if it exceeds chosen constant threshold \(c\)
  • two subclasses: parametric and non-parametric models

    • parametric: use well-known families of statistical distributions that require estimating parameters from the data
    • non-parametric: more flexible; learn the distribution of the normal class directly from the available data
  • When outliers are prevalent in training data, they may distort the notion of normal => consider alternative techniques able to account for this

Parametric models

  • common types: Gaussian, Poisson, and binomial distribution
  • parameters need to be learned from the available data
  • when normal class follows chosen distribution, these methods are effective in detecting anomalies

Example: Gaussian distribution

  • two parameters: mean \(\mu\) and standard distribution \(\sigma\), will be learned from the data
  • probability density function \(f(x)\), for some point \(x\):

    \[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
  • in multivariate case, use multivariate Gaussian distribution

    • in multivariate case, distance from center does not give viable anomaly score, because distribution is not symmetrical wrt. center if attributes are correlated
    • instead use e.g. Mahalanobis distance because it considers shape of the data
  • if normal class is not a single Gaussian distribution (multiple normal classes with different \(\mu\) and \(\sigma\)), use Gaussian mixture models instead

Non-parametric models

  • use kernel function to approximate density of normal class from available data

    • regions with a dense occurrence of normal instances have high probability
    • anomaly score is the inverse of its probability wrt. learned density
    • no assumption is made about data conforming to some known distribution
  • example approach: histogram

    • build a histogram of normal data, to represent different ranges of attribute values
    • detect if instance is anomalous if it falls within the range of histogram
    • right choice of bin size for the histogram is important

Analysis

Advantages

  • statistical methods are effective and justifiable when sufficient knowledge about data and its suitability for for selected technique is available

  • produce confidence scores which are more informative than binary labels

Disadvantages

  • not guaranteed to work is wrong statistical model is applied

  • not many methods exist for multivariate and high-dimensional data


Proximity-based Methods

  • also: distance-based outlier detection techniques

  • identify anomalies as those instances that are most distant from the other objects

  • base assumptions:

    • normal instances are related and appear close to each other

    • anomalies are different and relatively far from other instances

  • model-free and local techniques => no explicit model to determine anomaly score

Distance and density

Distance-based:

  • Compute distance of object \(x\) to its \(k^{th}\) nearest neighbor: \(dist(x, k)\)
    • when \(x\) is close to other instances, its score is low
    • anomalous instance will be far from others and has high score
  • this method is sensitive to choice of \(k\)
    • more robust approach is to use average distance to \(k\) nearest neighbors

Density-based:

  • density represents number of instances \(n\) within a fixed distance \(d\)
  • anomalies are instances that exist in regions of low density
  • this approach is sensitive to the right choice of \(d\) => again can take average


Distance and density-based approaches are related:

  • \(\text{density}(x, k) = 1/\text{dist}(x,k)\)
  • \(\text{avg.density}(x, k) = 1/\text{avg.dist}(x,k)\)

Both consider the locality of an instance in computing anomaly score
=> when data contains varying densities, these two methods are not suitable

Relative density

This approach is for data with varying densities: notion of density is relative to densities of neighboring instances. One example of relative distance:

\[ \text{relative density}(x, k) = \frac{\sum_{i=1}^{k} \text{density}(y_i, k)/k}{\text{density}(x,k)} \]

where \(y_i\) are the average density of \(k\)-nearest neighbors of point \(x\).

  • relative density is high when average density of points in its neighborhood is higher than density of point \(x\)
  • more robust measure if using average density
  • Also see Local Outlier Factor which is a widely used relative density-based metric for anomaly detection

Analysis

Advantages

  • more general than statistical approaches because it is easier to determine distance between objects than distribution
  • non-parametric => not restricted to specific models or distributions
  • widely applicable if a reasonable distance metric can be established
  • can be visualized

Disadvantages

  • sensitive to choice of distance metric
  • difficult to define distance in high-dimensional spaces
  • high complexity: \(O(n^2)\) time for \(n\) points
  • choice of right parameters (\(d, k\), etc.) is challenging

Clustering Methods

  • clusters represent the normal class
  • base assumption: normal instances are close and can be clustered
  • anomalies do not fit clusters or appear in small clusters far apart from normal clusters
  • two categories (1) small clusters are anomalous (2) individual points are anomalous if they do not fit some cluster
  • strategies for avoiding outliers strongly impacting the clustering:
    • cluster once, remove outliers, then cluster again
    • maintain set of potential outliers during clustering, and update this set as clusters change => instances remaining at the end are the outliers

Anomalous clusters

  • assumes clustered anomalies exist in the data => appear in tight small groups
  • these anomalous clusters are distant from clusters of normal class
  • basic approach: cluster all data and flag those that are too small or too distant from other clusters

Anomalous Instances

  • find singular instances not well-explained by clusters
  • basic approach: first cluster all data, then assess degree of how well instances belong to clusters
    • distance from prototype (as computed by appropriate choice metric) yields anomaly score
    • distances sufficiently far away from centroids flagged as anomalous

Analysis

Advantages

  • can operate without labelled data, and learn from only normal instances
  • some algorithms, e.g. K-means, have linear time and space complexity

Disadvantages

  • clustering algorithms are sensitive to noise => anomalies can distort and degrade quality of formed clusters
  • sensitive to choice of number of clusters => whether instance is an outlier may depend on this choice
  • care is needed in choosing the right clustering algorithm for specific data

Reconstruction Methods

  • There are patterns in the distribution of the normal class that can be captured using lower-dimensional representations, e.g., by using dimensionality reduction techniques

  • If there is a hidden structure in the normal class, we can expect to obtain a good approximation using a smaller number of features

  • use e.g. principal components analysis (PCA) to obtain the principal components representing the normal class

  • Once we have derived a smaller set of \(k\) features, we can project any new data instance \(x\) to its \(k\)-dimensional representation \(y\)

    • also possible to generate reconstructed \(x\), denoted as \(\hat{x}\), from \(y\)
  • Reconstruction error(\(x\)) = \(|| x - \hat{x}||^2\)

    • the lower-dimensional features explain the normal data
    • For normal instances, the reconstruction error is low
    • For anomalous instances, the reconstruction error is high
  • PCA works well when features form linear combinations of original attributes => cannot represent nonlinear patterns in the normal class

  • ANN autoencoders can be used for nonlinear patterns
  • Autoencoders can similarly measure reconstruction error to provide anomaly score

Analysis

Advantages

  • no assumptions made about distribution of normal class
  • can represent rich variety of patterns of normal class
  • robust against irrelevant attributes

Disadvantages

  • high-dimensionality is costly for computing the error

One-class SVM

  • encloses normal class within a single decision boundary => outside the boundary means anomalous

  • the focus is on modeling the boundary of the normal class

  • Uses training instances from the normal class to learn its decision boundary

  • Transform data to a higher dimensional space where the normal class can be separated using a linear hyperplane

    • use applicable kernel function for this transformation

    • in the transformed space, training instances can be separated with a linear hyperplane

    • ideally want hyperplane where all normal instances are on one side of the plane

  • One-class SVM differs from regular SVM in that the constraints are only defined for the normal class but not the anomaly class

    • it is challenging to learn nonlinear boundaries in one class setting, when no information about anomalies is available

    • "the origin trick" helps to maximize the boundary in one class scenario and learn the separating hyperplane (the origin works as a surrogate second class)

    • hyperparameter \(ν\), the upper bound fraction of tolerated anomalies impacts the learned boundary => high value means hyperplane is robust to large number of outliers during training

Analysis

Advantages

  • leverage the principle of structural risk minimization => strong theoretical foundations
  • balance between simplicity of model and effectiveness of the boundary in enclosing the distribution of the normal class
  • hyper-parameter \(ν\) provides a built-in mechanism to avoid outliers in the training data, which is often common in real-world problems

Disadvantage

  • choice of \(ν\) significantly impacts the properties of the learned decision boundary
  • use of a Gaussian kernel requires a relatively large training size to effectively learn nonlinear decision boundaries
  • high computational complexity

Information Theory

  • basic idea: consider the amount of bits needed to encode an instance

    • if normal class shows some structural pattern, fewer bits are needed to encode it
    • anomalous instances require more bits
  • some approaches:

    • Kolmogorov complexity
    • compression using standard compression techniques => use the size of file as a measure of information content
  • In general: \(\text{Gain}(x) = \text{Info}(D) - \text{Info}(D\) \ \(x)\), where \(D\) is the information of dataset, before and after removing \(x\)

    • the value is high for an anomaly
    • \(\text{Gain}(x)\) works as an anomaly score

Analysis

Advantages

  • unsupervised method: do not require a separate training set of normal-only instances
  • no assumptions are made about the structure of the normal class
  • generic enough to be applied with data sets of varying types and properties

Disadvantages

  • high complexity makes them expensive to apply to large data sets
  • performance depends heavily on the choice of the measure used for capturing the information content