Basics
Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9
Definition. An anomaly is an observation that doesn't fit the distribution of the data for normal instances, i.e., is unlikely under the distribution of the majority of instances.
-
other names: deviation detection, exception mining
-
some applications: fraud detection, intrusion detection, aviation safety, medicine, ecosystem, etc.
- also useful as a pre-processing step, when outliers are not wanted in data
-
many techniques exist => type of input data impacts choice of technique
-
number of attributes: in multivariate case, no attribute may be anomalous individually but their combination may be
-
data representation: anomaly detection is interested in difference between instances, and this difference can be captured in a proximity matrix, denoting pairwise difference between instances
-
presence of labels: if labels are available (rarely), supervised techniques can be used. More commonly anomaly detection uses unsupervised methods.
-
Characteristics of detection methods
Category | Description | |
---|---|---|
Model | Model-based | represent normal objects in a model, and detect instances not fitting the model; sometimes both normal and anomalous are modelled |
Model-free | identify instances without learning a model from input data | |
Perspective | Global | anomalies are detected considering global context e.g. model |
Local | when result of anomaly detection for object does not change if instances outside its local neighborhood are removed or changed | |
Output | Label | when anomaly detection result is a binary label, either anomalous or normal |
Score | when numeric value is used to express likelihood that an object is anomalous |
Detection methods
Approach | Category | Description |
---|---|---|
Clustering methods | Model-based, score | Normal class represented as cluster |
Info-theory methods | Model-free, score | Anomalous instances require more bits for their representation |
One-class SVM | Model-based, global, label | Encloses normal class within a single decision boundary |
Proximity-based methods | Model-free, local, score | Distance metric detects anomalous instances |
Reconstruction-based methods | Model-based, global, score | Normal class resides in a space of lower dimensionality than original space of attributes |
Statistical methods | Model-based, global, score | Statistical approaches model normal class |
Evaluation
-
supervised learning evaluation methods can be used when labelled data is available
- since anomalous instances are rare, precision, recall, and false positive rate are more appropriate measures than accuracy
- specifically false positive rate is useful measure since too many false alarms render an anomaly detection system useless
-
for model-based approaches effectiveness of outlier detection can be judged with respect to the improvement in the goodness of fit of the model once anomalies are eliminated
-
for information theoretic approaches, the information gain gives a measure of the effectiveness
-
for reconstruction-based approaches, the reconstruction error provides a measure that can be used for evaluation
-
in general: evaluate the distribution of anomaly scores => majority should be low since anomalies are rare
- visualize the distribution of the scores to assess whether the approach generates scores behaving in a reasonable manner