Skip to content

Detection Methods

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9.3-9.8

Statistical Methods

  • use statistical approaches to model normal class => the result is a probability score

    • instance x is anomalous if it exceeds chosen constant threshold c
  • two subclasses: parametric and non-parametric models

    • parametric: use well-known families of statistical distributions that require estimating parameters from the data
    • non-parametric: more flexible; learn the distribution of the normal class directly from the available data
  • When outliers are prevalent in training data, they may distort the notion of normal => consider alternative techniques able to account for this

Parametric models

  • common types: Gaussian, Poisson, and binomial distribution
  • parameters need to be learned from the available data
  • when normal class follows chosen distribution, these methods are effective in detecting anomalies

Example: Gaussian distribution

  • two parameters: mean μ and standard distribution σ, will be learned from the data
  • probability density function f(x), for some point x:

    f(x)=12πσ2e(xμ)22σ2
  • in multivariate case, use multivariate Gaussian distribution

    • in multivariate case, distance from center does not give viable anomaly score, because distribution is not symmetrical wrt. center if attributes are correlated
    • instead use e.g. Mahalanobis distance because it considers shape of the data
  • if normal class is not a single Gaussian distribution (multiple normal classes with different μ and σ), use Gaussian mixture models instead

Non-parametric models

  • use kernel function to approximate density of normal class from available data

    • regions with a dense occurrence of normal instances have high probability
    • anomaly score is the inverse of its probability wrt. learned density
    • no assumption is made about data conforming to some known distribution
  • example approach: histogram

    • build a histogram of normal data, to represent different ranges of attribute values
    • detect if instance is anomalous if it falls within the range of histogram
    • right choice of bin size for the histogram is important

Analysis

Advantages

  • statistical methods are effective and justifiable when sufficient knowledge about data and its suitability for for selected technique is available

  • produce confidence scores which are more informative than binary labels

Disadvantages

  • not guaranteed to work is wrong statistical model is applied

  • not many methods exist for multivariate and high-dimensional data


Proximity-based Methods

  • also: distance-based outlier detection techniques

  • identify anomalies as those instances that are most distant from the other objects

  • base assumptions:

    • normal instances are related and appear close to each other

    • anomalies are different and relatively far from other instances

  • model-free and local techniques => no explicit model to determine anomaly score

Distance and density

Distance-based:

  • Compute distance of object x to its kth nearest neighbor: dist(x,k)
    • when x is close to other instances, its score is low
    • anomalous instance will be far from others and has high score
  • this method is sensitive to choice of k
    • more robust approach is to use average distance to k nearest neighbors

Density-based:

  • density represents number of instances n within a fixed distance d
  • anomalies are instances that exist in regions of low density
  • this approach is sensitive to the right choice of d => again can take average


Distance and density-based approaches are related:

  • density(x,k)=1/dist(x,k)
  • avg.density(x,k)=1/avg.dist(x,k)

Both consider the locality of an instance in computing anomaly score
=> when data contains varying densities, these two methods are not suitable

Relative density

This approach is for data with varying densities: notion of density is relative to densities of neighboring instances. One example of relative distance:

relative density(x,k)=i=1kdensity(yi,k)/kdensity(x,k)

where yi are the average density of k-nearest neighbors of point x.

  • relative density is high when average density of points in its neighborhood is higher than density of point x
  • more robust measure if using average density
  • Also see Local Outlier Factor which is a widely used relative density-based metric for anomaly detection

Analysis

Advantages

  • more general than statistical approaches because it is easier to determine distance between objects than distribution
  • non-parametric => not restricted to specific models or distributions
  • widely applicable if a reasonable distance metric can be established
  • can be visualized

Disadvantages

  • sensitive to choice of distance metric
  • difficult to define distance in high-dimensional spaces
  • high complexity: O(n2) time for n points
  • choice of right parameters (d,k, etc.) is challenging

Clustering Methods

  • clusters represent the normal class
  • base assumption: normal instances are close and can be clustered
  • anomalies do not fit clusters or appear in small clusters far apart from normal clusters
  • two categories (1) small clusters are anomalous (2) individual points are anomalous if they do not fit some cluster
  • strategies for avoiding outliers strongly impacting the clustering:
    • cluster once, remove outliers, then cluster again
    • maintain set of potential outliers during clustering, and update this set as clusters change => instances remaining at the end are the outliers

Anomalous clusters

  • assumes clustered anomalies exist in the data => appear in tight small groups
  • these anomalous clusters are distant from clusters of normal class
  • basic approach: cluster all data and flag those that are too small or too distant from other clusters

Anomalous Instances

  • find singular instances not well-explained by clusters
  • basic approach: first cluster all data, then assess degree of how well instances belong to clusters
    • distance from prototype (as computed by appropriate choice metric) yields anomaly score
    • distances sufficiently far away from centroids flagged as anomalous

Analysis

Advantages

  • can operate without labelled data, and learn from only normal instances
  • some algorithms, e.g. K-means, have linear time and space complexity

Disadvantages

  • clustering algorithms are sensitive to noise => anomalies can distort and degrade quality of formed clusters
  • sensitive to choice of number of clusters => whether instance is an outlier may depend on this choice
  • care is needed in choosing the right clustering algorithm for specific data

Reconstruction Methods

  • There are patterns in the distribution of the normal class that can be captured using lower-dimensional representations, e.g., by using dimensionality reduction techniques

  • If there is a hidden structure in the normal class, we can expect to obtain a good approximation using a smaller number of features

  • use e.g. principal components analysis (PCA) to obtain the principal components representing the normal class

  • Once we have derived a smaller set of k features, we can project any new data instance x to its k-dimensional representation y

    • also possible to generate reconstructed x, denoted as x^, from y
  • Reconstruction error(x) = ||xx^||2

    • the lower-dimensional features explain the normal data
    • For normal instances, the reconstruction error is low
    • For anomalous instances, the reconstruction error is high
  • PCA works well when features form linear combinations of original attributes => cannot represent nonlinear patterns in the normal class

  • ANN autoencoders can be used for nonlinear patterns
  • Autoencoders can similarly measure reconstruction error to provide anomaly score

Analysis

Advantages

  • no assumptions made about distribution of normal class
  • can represent rich variety of patterns of normal class
  • robust against irrelevant attributes

Disadvantages

  • high-dimensionality is costly for computing the error

One-class SVM

  • encloses normal class within a single decision boundary => outside the boundary means anomalous

  • the focus is on modeling the boundary of the normal class

  • Uses training instances from the normal class to learn its decision boundary

  • Transform data to a higher dimensional space where the normal class can be separated using a linear hyperplane

    • use applicable kernel function for this transformation

    • in the transformed space, training instances can be separated with a linear hyperplane

    • ideally want hyperplane where all normal instances are on one side of the plane

  • One-class SVM differs from regular SVM in that the constraints are only defined for the normal class but not the anomaly class

    • it is challenging to learn nonlinear boundaries in one class setting, when no information about anomalies is available

    • "the origin trick" helps to maximize the boundary in one class scenario and learn the separating hyperplane (the origin works as a surrogate second class)

    • hyperparameter ν, the upper bound fraction of tolerated anomalies impacts the learned boundary => high value means hyperplane is robust to large number of outliers during training

Analysis

Advantages

  • leverage the principle of structural risk minimization => strong theoretical foundations
  • balance between simplicity of model and effectiveness of the boundary in enclosing the distribution of the normal class
  • hyper-parameter ν provides a built-in mechanism to avoid outliers in the training data, which is often common in real-world problems

Disadvantage

  • choice of ν significantly impacts the properties of the learned decision boundary
  • use of a Gaussian kernel requires a relatively large training size to effectively learn nonlinear decision boundaries
  • high computational complexity

Information Theory

  • basic idea: consider the amount of bits needed to encode an instance

    • if normal class shows some structural pattern, fewer bits are needed to encode it
    • anomalous instances require more bits
  • some approaches:

    • Kolmogorov complexity
    • compression using standard compression techniques => use the size of file as a measure of information content
  • In general: Gain(x)=Info(D)Info(D \ x), where D is the information of dataset, before and after removing x

    • the value is high for an anomaly
    • Gain(x) works as an anomaly score

Analysis

Advantages

  • unsupervised method: do not require a separate training set of normal-only instances
  • no assumptions are made about the structure of the normal class
  • generic enough to be applied with data sets of varying types and properties

Disadvantages

  • high complexity makes them expensive to apply to large data sets
  • performance depends heavily on the choice of the measure used for capturing the information content