Detection Methods
Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9.3-9.8
Statistical Methods
-
use statistical approaches to model normal class => the result is a probability score
- instance \(x\) is anomalous if it exceeds chosen constant threshold \(c\)
-
two subclasses: parametric and non-parametric models
- parametric: use well-known families of statistical distributions that require estimating parameters from the data
- non-parametric: more flexible; learn the distribution of the normal class directly from the available data
-
When outliers are prevalent in training data, they may distort the notion of normal => consider alternative techniques able to account for this
Parametric models
- common types: Gaussian, Poisson, and binomial distribution
- parameters need to be learned from the available data
- when normal class follows chosen distribution, these methods are effective in detecting anomalies
Example: Gaussian distribution
- two parameters: mean \(\mu\) and standard distribution \(\sigma\), will be learned from the data
-
probability density function \(f(x)\), for some point \(x\):
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \] -
in multivariate case, use multivariate Gaussian distribution
- in multivariate case, distance from center does not give viable anomaly score, because distribution is not symmetrical wrt. center if attributes are correlated
- instead use e.g. Mahalanobis distance because it considers shape of the data
-
if normal class is not a single Gaussian distribution (multiple normal classes with different \(\mu\) and \(\sigma\)), use Gaussian mixture models instead
Non-parametric models
-
use kernel function to approximate density of normal class from available data
- regions with a dense occurrence of normal instances have high probability
- anomaly score is the inverse of its probability wrt. learned density
- no assumption is made about data conforming to some known distribution
-
example approach: histogram
- build a histogram of normal data, to represent different ranges of attribute values
- detect if instance is anomalous if it falls within the range of histogram
- right choice of bin size for the histogram is important
Analysis
Advantages
-
statistical methods are effective and justifiable when sufficient knowledge about data and its suitability for for selected technique is available
-
produce confidence scores which are more informative than binary labels
Disadvantages
-
not guaranteed to work is wrong statistical model is applied
-
not many methods exist for multivariate and high-dimensional data
Proximity-based Methods
-
also: distance-based outlier detection techniques
-
identify anomalies as those instances that are most distant from the other objects
-
base assumptions:
-
normal instances are related and appear close to each other
-
anomalies are different and relatively far from other instances
-
-
model-free and local techniques => no explicit model to determine anomaly score
Distance and density
Distance-based:
- Compute distance of object \(x\) to its \(k^{th}\) nearest neighbor: \(dist(x, k)\)
- when \(x\) is close to other instances, its score is low
- anomalous instance will be far from others and has high score
- this method is sensitive to choice of \(k\)
- more robust approach is to use average distance to \(k\) nearest neighbors
Density-based:
- density represents number of instances \(n\) within a fixed distance \(d\)
- anomalies are instances that exist in regions of low density
- this approach is sensitive to the right choice of \(d\) => again can take average
Distance and density-based approaches are related:
- \(\text{density}(x, k) = 1/\text{dist}(x,k)\)
- \(\text{avg.density}(x, k) = 1/\text{avg.dist}(x,k)\)
Both consider the locality of an instance in computing anomaly score
=> when data contains varying densities, these two methods are not suitable
Relative density
This approach is for data with varying densities: notion of density is relative to densities of neighboring instances. One example of relative distance:
where \(y_i\) are the average density of \(k\)-nearest neighbors of point \(x\).
- relative density is high when average density of points in its neighborhood is higher than density of point \(x\)
- more robust measure if using average density
- Also see Local Outlier Factor which is a widely used relative density-based metric for anomaly detection
Analysis
Advantages
- more general than statistical approaches because it is easier to determine distance between objects than distribution
- non-parametric => not restricted to specific models or distributions
- widely applicable if a reasonable distance metric can be established
- can be visualized
Disadvantages
- sensitive to choice of distance metric
- difficult to define distance in high-dimensional spaces
- high complexity: \(O(n^2)\) time for \(n\) points
- choice of right parameters (\(d, k\), etc.) is challenging
Clustering Methods
- clusters represent the normal class
- base assumption: normal instances are close and can be clustered
- anomalies do not fit clusters or appear in small clusters far apart from normal clusters
- two categories (1) small clusters are anomalous (2) individual points are anomalous if they do not fit some cluster
- strategies for avoiding outliers strongly impacting the clustering:
- cluster once, remove outliers, then cluster again
- maintain set of potential outliers during clustering, and update this set as clusters change => instances remaining at the end are the outliers
Anomalous clusters
- assumes clustered anomalies exist in the data => appear in tight small groups
- these anomalous clusters are distant from clusters of normal class
- basic approach: cluster all data and flag those that are too small or too distant from other clusters
Anomalous Instances
- find singular instances not well-explained by clusters
- basic approach: first cluster all data, then assess degree of how well instances belong to clusters
- distance from prototype (as computed by appropriate choice metric) yields anomaly score
- distances sufficiently far away from centroids flagged as anomalous
Analysis
Advantages
- can operate without labelled data, and learn from only normal instances
- some algorithms, e.g. K-means, have linear time and space complexity
Disadvantages
- clustering algorithms are sensitive to noise => anomalies can distort and degrade quality of formed clusters
- sensitive to choice of number of clusters => whether instance is an outlier may depend on this choice
- care is needed in choosing the right clustering algorithm for specific data
Reconstruction Methods
-
There are patterns in the distribution of the normal class that can be captured using lower-dimensional representations, e.g., by using dimensionality reduction techniques
-
If there is a hidden structure in the normal class, we can expect to obtain a good approximation using a smaller number of features
-
use e.g. principal components analysis (PCA) to obtain the principal components representing the normal class
-
Once we have derived a smaller set of \(k\) features, we can project any new data instance \(x\) to its \(k\)-dimensional representation \(y\)
- also possible to generate reconstructed \(x\), denoted as \(\hat{x}\), from \(y\)
-
Reconstruction error(\(x\)) = \(|| x - \hat{x}||^2\)
- the lower-dimensional features explain the normal data
- For normal instances, the reconstruction error is low
- For anomalous instances, the reconstruction error is high
-
PCA works well when features form linear combinations of original attributes => cannot represent nonlinear patterns in the normal class
- ANN autoencoders can be used for nonlinear patterns
- Autoencoders can similarly measure reconstruction error to provide anomaly score
Analysis
Advantages
- no assumptions made about distribution of normal class
- can represent rich variety of patterns of normal class
- robust against irrelevant attributes
Disadvantages
- high-dimensionality is costly for computing the error
One-class SVM
-
encloses normal class within a single decision boundary => outside the boundary means anomalous
-
the focus is on modeling the boundary of the normal class
-
Uses training instances from the normal class to learn its decision boundary
-
Transform data to a higher dimensional space where the normal class can be separated using a linear hyperplane
-
use applicable kernel function for this transformation
-
in the transformed space, training instances can be separated with a linear hyperplane
-
ideally want hyperplane where all normal instances are on one side of the plane
-
-
One-class SVM differs from regular SVM in that the constraints are only defined for the normal class but not the anomaly class
-
it is challenging to learn nonlinear boundaries in one class setting, when no information about anomalies is available
-
"the origin trick" helps to maximize the boundary in one class scenario and learn the separating hyperplane (the origin works as a surrogate second class)
-
hyperparameter \(ν\), the upper bound fraction of tolerated anomalies impacts the learned boundary => high value means hyperplane is robust to large number of outliers during training
-
Analysis
Advantages
- leverage the principle of structural risk minimization => strong theoretical foundations
- balance between simplicity of model and effectiveness of the boundary in enclosing the distribution of the normal class
- hyper-parameter \(ν\) provides a built-in mechanism to avoid outliers in the training data, which is often common in real-world problems
Disadvantage
- choice of \(ν\) significantly impacts the properties of the learned decision boundary
- use of a Gaussian kernel requires a relatively large training size to effectively learn nonlinear decision boundaries
- high computational complexity
Information Theory
-
basic idea: consider the amount of bits needed to encode an instance
- if normal class shows some structural pattern, fewer bits are needed to encode it
- anomalous instances require more bits
-
some approaches:
- Kolmogorov complexity
- compression using standard compression techniques => use the size of file as a measure of information content
-
In general: \(\text{Gain}(x) = \text{Info}(D) - \text{Info}(D\) \ \(x)\), where \(D\) is the information of dataset, before and after removing \(x\)
- the value is high for an anomaly
- \(\text{Gain}(x)\) works as an anomaly score
Analysis
Advantages
- unsupervised method: do not require a separate training set of normal-only instances
- no assumptions are made about the structure of the normal class
- generic enough to be applied with data sets of varying types and properties
Disadvantages
- high complexity makes them expensive to apply to large data sets
- performance depends heavily on the choice of the measure used for capturing the information content