Detection Methods

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9.3-9.8

Statistical Methods

use statistical approaches to model normal class => the result is a probability score
- instance $x$ is anomalous if it exceeds chosen constant threshold $c$
two subclasses: parametric and non-parametric models
- parametric: use well-known families of statistical distributions that require estimating parameters from the data
- non-parametric: more flexible; learn the distribution of the normal class directly from the available data
When outliers are prevalent in training data, they may distort the notion of normal => consider alternative techniques able to account for this

Parametric models

common types: Gaussian, Poisson, and binomial distribution
parameters need to be learned from the available data
when normal class follows chosen distribution, these methods are effective in detecting anomalies

Example: Gaussian distribution

two parameters: mean $μ$ and standard distribution $σ$ , will be learned from the data
probability density function $f (x)$ , for some point $x$ :

$f (x) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$
in multivariate case, use multivariate Gaussian distribution
- in multivariate case, distance from center does not give viable anomaly score, because distribution is not symmetrical wrt. center if attributes are correlated
- instead use e.g. Mahalanobis distance because it considers shape of the data
if normal class is not a single Gaussian distribution (multiple normal classes with different $μ$ and $σ$ ), use Gaussian mixture models instead

Non-parametric models

use kernel function to approximate density of normal class from available data
- regions with a dense occurrence of normal instances have high probability
- anomaly score is the inverse of its probability wrt. learned density
- no assumption is made about data conforming to some known distribution
example approach: histogram
- build a histogram of normal data, to represent different ranges of attribute values
- detect if instance is anomalous if it falls within the range of histogram
- right choice of bin size for the histogram is important

Analysis

Advantages

statistical methods are effective and justifiable when sufficient knowledge about data and its suitability for for selected technique is available
produce confidence scores which are more informative than binary labels

Disadvantages

not guaranteed to work is wrong statistical model is applied
not many methods exist for multivariate and high-dimensional data

Proximity-based Methods

also: distance-based outlier detection techniques
identify anomalies as those instances that are most distant from the other objects
base assumptions:
- normal instances are related and appear close to each other
- anomalies are different and relatively far from other instances
model-free and local techniques => no explicit model to determine anomaly score

Distance and density

Distance-based:

Compute distance of object $x$ to its $k^{t h}$ nearest neighbor: $d i s t (x, k)$
- when $x$ is close to other instances, its score is low
- anomalous instance will be far from others and has high score
this method is sensitive to choice of $k$
- more robust approach is to use average distance to $k$ nearest neighbors

Density-based:

density represents number of instances $n$ within a fixed distance $d$
anomalies are instances that exist in regions of low density
this approach is sensitive to the right choice of $d$ => again can take average

Distance and density-based approaches are related:

$density (x, k) = 1 / dist (x, k)$
$avg.density (x, k) = 1 / avg.dist (x, k)$

Both consider the locality of an instance in computing anomaly score
=> when data contains varying densities, these two methods are not suitable

Relative density

This approach is for data with varying densities: notion of density is relative to densities of neighboring instances. One example of relative distance:

relative density (x, k) = \frac{\sum_{i = 1}^{k} density (y_{i}, k) / k}{density (x, k)}

where $y_{i}$ are the average density of $k$ -nearest neighbors of point $x$ .

relative density is high when average density of points in its neighborhood is higher than density of point $x$
more robust measure if using average density
Also see Local Outlier Factor which is a widely used relative density-based metric for anomaly detection

Analysis

Advantages

more general than statistical approaches because it is easier to determine distance between objects than distribution
non-parametric => not restricted to specific models or distributions
widely applicable if a reasonable distance metric can be established
can be visualized

Disadvantages

sensitive to choice of distance metric
difficult to define distance in high-dimensional spaces
high complexity: $O (n^{2})$ time for $n$ points
choice of right parameters ( $d, k$ , etc.) is challenging

Clustering Methods

clusters represent the normal class
base assumption: normal instances are close and can be clustered
anomalies do not fit clusters or appear in small clusters far apart from normal clusters
two categories (1) small clusters are anomalous (2) individual points are anomalous if they do not fit some cluster
strategies for avoiding outliers strongly impacting the clustering:
- cluster once, remove outliers, then cluster again
- maintain set of potential outliers during clustering, and update this set as clusters change => instances remaining at the end are the outliers

Anomalous clusters

assumes clustered anomalies exist in the data => appear in tight small groups
these anomalous clusters are distant from clusters of normal class
basic approach: cluster all data and flag those that are too small or too distant from other clusters

Anomalous Instances

find singular instances not well-explained by clusters
basic approach: first cluster all data, then assess degree of how well instances belong to clusters
- distance from prototype (as computed by appropriate choice metric) yields anomaly score
- distances sufficiently far away from centroids flagged as anomalous

Analysis

Advantages

can operate without labelled data, and learn from only normal instances
some algorithms, e.g. K-means, have linear time and space complexity

Disadvantages

clustering algorithms are sensitive to noise => anomalies can distort and degrade quality of formed clusters
sensitive to choice of number of clusters => whether instance is an outlier may depend on this choice
care is needed in choosing the right clustering algorithm for specific data

Reconstruction Methods

There are patterns in the distribution of the normal class that can be captured using lower-dimensional representations, e.g., by using dimensionality reduction techniques
If there is a hidden structure in the normal class, we can expect to obtain a good approximation using a smaller number of features
use e.g. principal components analysis (PCA) to obtain the principal components representing the normal class
Once we have derived a smaller set of $k$ features, we can project any new data instance $x$ to its $k$ -dimensional representation $y$
- also possible to generate reconstructed $x$ , denoted as $\hat{x}$ , from $y$
Reconstruction error( $x$ ) = $| | x - \hat{x} | |^{2}$
- the lower-dimensional features explain the normal data
- For normal instances, the reconstruction error is low
- For anomalous instances, the reconstruction error is high
PCA works well when features form linear combinations of original attributes => cannot represent nonlinear patterns in the normal class
ANN autoencoders can be used for nonlinear patterns
Autoencoders can similarly measure reconstruction error to provide anomaly score

Analysis

Advantages

no assumptions made about distribution of normal class
can represent rich variety of patterns of normal class
robust against irrelevant attributes

Disadvantages

high-dimensionality is costly for computing the error

One-class SVM

encloses normal class within a single decision boundary => outside the boundary means anomalous
the focus is on modeling the boundary of the normal class
Uses training instances from the normal class to learn its decision boundary
Transform data to a higher dimensional space where the normal class can be separated using a linear hyperplane
- use applicable kernel function for this transformation
- in the transformed space, training instances can be separated with a linear hyperplane
- ideally want hyperplane where all normal instances are on one side of the plane
One-class SVM differs from regular SVM in that the constraints are only defined for the normal class but not the anomaly class
- it is challenging to learn nonlinear boundaries in one class setting, when no information about anomalies is available
- "the origin trick" helps to maximize the boundary in one class scenario and learn the separating hyperplane (the origin works as a surrogate second class)
- hyperparameter $ν$ , the upper bound fraction of tolerated anomalies impacts the learned boundary => high value means hyperplane is robust to large number of outliers during training

Analysis

Advantages

leverage the principle of structural risk minimization => strong theoretical foundations
balance between simplicity of model and effectiveness of the boundary in enclosing the distribution of the normal class
hyper-parameter $ν$ provides a built-in mechanism to avoid outliers in the training data, which is often common in real-world problems

Disadvantage

choice of $ν$ significantly impacts the properties of the learned decision boundary
use of a Gaussian kernel requires a relatively large training size to effectively learn nonlinear decision boundaries
high computational complexity

Information Theory

basic idea: consider the amount of bits needed to encode an instance
- if normal class shows some structural pattern, fewer bits are needed to encode it
- anomalous instances require more bits
some approaches:
- Kolmogorov complexity
- compression using standard compression techniques => use the size of file as a measure of information content
In general: $Gain (x) = Info (D) - Info (D$ \ $x)$ , where $D$ is the information of dataset, before and after removing $x$
- the value is high for an anomaly
- $Gain (x)$ works as an anomaly score

Analysis

Advantages

unsupervised method: do not require a separate training set of normal-only instances
no assumptions are made about the structure of the normal class
generic enough to be applied with data sets of varying types and properties

Disadvantages

high complexity makes them expensive to apply to large data sets
performance depends heavily on the choice of the measure used for capturing the information content