Skip to content

Basics

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 9

Definition. An anomaly is an observation that doesn't fit the distribution of the data for normal instances, i.e., is unlikely under the distribution of the majority of instances.

  • other names: deviation detection, exception mining

  • some applications: fraud detection, intrusion detection, aviation safety, medicine, ecosystem, etc.

    • also useful as a pre-processing step, when outliers are not wanted in data
  • many techniques exist => type of input data impacts choice of technique

    • number of attributes: in multivariate case, no attribute may be anomalous individually but their combination may be

    • data representation: anomaly detection is interested in difference between instances, and this difference can be captured in a proximity matrix, denoting pairwise difference between instances

    • presence of labels: if labels are available (rarely), supervised techniques can be used. More commonly anomaly detection uses unsupervised methods.

Characteristics of detection methods

Category Description
Model Model-based represent normal objects in a model, and detect instances not fitting the model; sometimes both normal and anomalous are modelled
Model-free identify instances without learning a model from input data
Perspective Global anomalies are detected considering global context e.g. model
Local when result of anomaly detection for object does not change if instances outside its local neighborhood are removed or changed
Output Label when anomaly detection result is a binary label, either anomalous or normal
Score when numeric value is used to express likelihood that an object is anomalous

Detection methods

Approach Category Description
Clustering methods Model-based, score Normal class represented as cluster
Info-theory methods Model-free, score Anomalous instances require more bits for their representation
One-class SVM Model-based, global, label Encloses normal class within a single decision boundary
Proximity-based methods Model-free, local, score Distance metric detects anomalous instances
Reconstruction-based methods Model-based, global, score Normal class resides in a space of lower dimensionality than original space of attributes
Statistical methods Model-based, global, score Statistical approaches model normal class

Evaluation

  • supervised learning evaluation methods can be used when labelled data is available

    • since anomalous instances are rare, precision, recall, and false positive rate are more appropriate measures than accuracy
    • specifically false positive rate is useful measure since too many false alarms render an anomaly detection system useless
  • for model-based approaches effectiveness of outlier detection can be judged with respect to the improvement in the goodness of fit of the model once anomalies are eliminated

  • for information theoretic approaches, the information gain gives a measure of the effectiveness

  • for reconstruction-based approaches, the reconstruction error provides a measure that can be used for evaluation

  • in general: evaluate the distribution of anomaly scores => majority should be low since anomalies are rare

    • visualize the distribution of the scores to assess whether the approach generates scores behaving in a reasonable manner