Detection Methods
Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9.3-9.8
Statistical Methods
-
use statistical approaches to model normal class => the result is a probability score
- instance
is anomalous if it exceeds chosen constant threshold
- instance
-
two subclasses: parametric and non-parametric models
- parametric: use well-known families of statistical distributions that require estimating parameters from the data
- non-parametric: more flexible; learn the distribution of the normal class directly from the available data
-
When outliers are prevalent in training data, they may distort the notion of normal => consider alternative techniques able to account for this
Parametric models
- common types: Gaussian, Poisson, and binomial distribution
- parameters need to be learned from the available data
- when normal class follows chosen distribution, these methods are effective in detecting anomalies
Example: Gaussian distribution
- two parameters: mean
and standard distribution , will be learned from the data -
probability density function
, for some point : -
in multivariate case, use multivariate Gaussian distribution
- in multivariate case, distance from center does not give viable anomaly score, because distribution is not symmetrical wrt. center if attributes are correlated
- instead use e.g. Mahalanobis distance because it considers shape of the data
-
if normal class is not a single Gaussian distribution (multiple normal classes with different
and ), use Gaussian mixture models instead
Non-parametric models
-
use kernel function to approximate density of normal class from available data
- regions with a dense occurrence of normal instances have high probability
- anomaly score is the inverse of its probability wrt. learned density
- no assumption is made about data conforming to some known distribution
-
example approach: histogram
- build a histogram of normal data, to represent different ranges of attribute values
- detect if instance is anomalous if it falls within the range of histogram
- right choice of bin size for the histogram is important
Analysis
Advantages
-
statistical methods are effective and justifiable when sufficient knowledge about data and its suitability for for selected technique is available
-
produce confidence scores which are more informative than binary labels
Disadvantages
-
not guaranteed to work is wrong statistical model is applied
-
not many methods exist for multivariate and high-dimensional data
Proximity-based Methods
-
also: distance-based outlier detection techniques
-
identify anomalies as those instances that are most distant from the other objects
-
base assumptions:
-
normal instances are related and appear close to each other
-
anomalies are different and relatively far from other instances
-
-
model-free and local techniques => no explicit model to determine anomaly score
Distance and density
Distance-based:
- Compute distance of object
to its nearest neighbor:- when
is close to other instances, its score is low - anomalous instance will be far from others and has high score
- when
- this method is sensitive to choice of
- more robust approach is to use average distance to
nearest neighbors
- more robust approach is to use average distance to
Density-based:
- density represents number of instances
within a fixed distance - anomalies are instances that exist in regions of low density
- this approach is sensitive to the right choice of
=> again can take average
Distance and density-based approaches are related:
Both consider the locality of an instance in computing anomaly score
=> when data contains varying densities, these two methods are not suitable
Relative density
This approach is for data with varying densities: notion of density is relative to densities of neighboring instances. One example of relative distance:
where
- relative density is high when average density of points in its neighborhood is higher than density of point
- more robust measure if using average density
- Also see Local Outlier Factor which is a widely used relative density-based metric for anomaly detection
Analysis
Advantages
- more general than statistical approaches because it is easier to determine distance between objects than distribution
- non-parametric => not restricted to specific models or distributions
- widely applicable if a reasonable distance metric can be established
- can be visualized
Disadvantages
- sensitive to choice of distance metric
- difficult to define distance in high-dimensional spaces
- high complexity:
time for points - choice of right parameters (
, etc.) is challenging
Clustering Methods
- clusters represent the normal class
- base assumption: normal instances are close and can be clustered
- anomalies do not fit clusters or appear in small clusters far apart from normal clusters
- two categories (1) small clusters are anomalous (2) individual points are anomalous if they do not fit some cluster
- strategies for avoiding outliers strongly impacting the clustering:
- cluster once, remove outliers, then cluster again
- maintain set of potential outliers during clustering, and update this set as clusters change => instances remaining at the end are the outliers
Anomalous clusters
- assumes clustered anomalies exist in the data => appear in tight small groups
- these anomalous clusters are distant from clusters of normal class
- basic approach: cluster all data and flag those that are too small or too distant from other clusters
Anomalous Instances
- find singular instances not well-explained by clusters
- basic approach: first cluster all data, then assess degree of how well instances belong to clusters
- distance from prototype (as computed by appropriate choice metric) yields anomaly score
- distances sufficiently far away from centroids flagged as anomalous
Analysis
Advantages
- can operate without labelled data, and learn from only normal instances
- some algorithms, e.g. K-means, have linear time and space complexity
Disadvantages
- clustering algorithms are sensitive to noise => anomalies can distort and degrade quality of formed clusters
- sensitive to choice of number of clusters => whether instance is an outlier may depend on this choice
- care is needed in choosing the right clustering algorithm for specific data
Reconstruction Methods
-
There are patterns in the distribution of the normal class that can be captured using lower-dimensional representations, e.g., by using dimensionality reduction techniques
-
If there is a hidden structure in the normal class, we can expect to obtain a good approximation using a smaller number of features
-
use e.g. principal components analysis (PCA) to obtain the principal components representing the normal class
-
Once we have derived a smaller set of
features, we can project any new data instance to its -dimensional representation- also possible to generate reconstructed
, denoted as , from
- also possible to generate reconstructed
-
Reconstruction error(
) =- the lower-dimensional features explain the normal data
- For normal instances, the reconstruction error is low
- For anomalous instances, the reconstruction error is high
-
PCA works well when features form linear combinations of original attributes => cannot represent nonlinear patterns in the normal class
- ANN autoencoders can be used for nonlinear patterns
- Autoencoders can similarly measure reconstruction error to provide anomaly score
Analysis
Advantages
- no assumptions made about distribution of normal class
- can represent rich variety of patterns of normal class
- robust against irrelevant attributes
Disadvantages
- high-dimensionality is costly for computing the error
One-class SVM
-
encloses normal class within a single decision boundary => outside the boundary means anomalous
-
the focus is on modeling the boundary of the normal class
-
Uses training instances from the normal class to learn its decision boundary
-
Transform data to a higher dimensional space where the normal class can be separated using a linear hyperplane
-
use applicable kernel function for this transformation
-
in the transformed space, training instances can be separated with a linear hyperplane
-
ideally want hyperplane where all normal instances are on one side of the plane
-
-
One-class SVM differs from regular SVM in that the constraints are only defined for the normal class but not the anomaly class
-
it is challenging to learn nonlinear boundaries in one class setting, when no information about anomalies is available
-
"the origin trick" helps to maximize the boundary in one class scenario and learn the separating hyperplane (the origin works as a surrogate second class)
-
hyperparameter
, the upper bound fraction of tolerated anomalies impacts the learned boundary => high value means hyperplane is robust to large number of outliers during training
-
Analysis
Advantages
- leverage the principle of structural risk minimization => strong theoretical foundations
- balance between simplicity of model and effectiveness of the boundary in enclosing the distribution of the normal class
- hyper-parameter
provides a built-in mechanism to avoid outliers in the training data, which is often common in real-world problems
Disadvantage
- choice of
significantly impacts the properties of the learned decision boundary - use of a Gaussian kernel requires a relatively large training size to effectively learn nonlinear decision boundaries
- high computational complexity
Information Theory
-
basic idea: consider the amount of bits needed to encode an instance
- if normal class shows some structural pattern, fewer bits are needed to encode it
- anomalous instances require more bits
-
some approaches:
- Kolmogorov complexity
- compression using standard compression techniques => use the size of file as a measure of information content
-
In general:
\ , where is the information of dataset, before and after removing- the value is high for an anomaly
works as an anomaly score
Analysis
Advantages
- unsupervised method: do not require a separate training set of normal-only instances
- no assumptions are made about the structure of the normal class
- generic enough to be applied with data sets of varying types and properties
Disadvantages
- high complexity makes them expensive to apply to large data sets
- performance depends heavily on the choice of the measure used for capturing the information content