Anomaly Detection

What

Finding data points that are significantly different from the majority. Fraud, defects, intrusions, unusual behavior.

Approaches

Statistical

  • Points beyond 3 standard deviations
  • IQR method (used in Data Cleaning)

Isolation Forest

Randomly partition data. Anomalies need fewer splits to isolate.

from sklearn.ensemble import IsolationForest
 
model = IsolationForest(contamination=0.05)  # expect 5% anomalies
model.fit(X)
predictions = model.predict(X)  # 1 = normal, -1 = anomaly

One-Class SVM

Learn a boundary around normal data in high-dimensional space. Anything outside the boundary is an anomaly. Works well with kernel trick for nonlinear boundaries.

from sklearn.svm import OneClassSVM
 
model = OneClassSVM(kernel="rbf", nu=0.05)  # nu ≈ expected anomaly fraction
model.fit(X_train)
predictions = model.predict(X_test)  # 1 = normal, -1 = anomaly

Local Outlier Factor (LOF)

Compares the local density of a point to the density of its neighbors. A point in a sparse region surrounded by dense regions is an outlier.

from sklearn.neighbors import LocalOutlierFactor
 
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
predictions = lof.fit_predict(X)  # 1 = normal, -1 = anomaly

DBSCAN for anomaly detection

Density-based clustering. Points that don’t belong to any cluster (label = -1) are treated as anomalies. Useful when anomalies form no structure but normal data clusters naturally.

Autoencoders (deep learning)

Train a neural net to compress and reconstruct normal data. Anomalies have high reconstruction error.

Choosing an approach

SituationMethodWhy
Small data, few featuresStatistical (z-score, IQR)Simple, interpretable
Tabular data, moderate sizeIsolation ForestFast, handles high dimensions
Dense clusters of normal dataLOF or DBSCANCatches local anomalies
High-dimensional, complex patternsAutoencoderLearns nonlinear structure
Need a decision boundaryOne-Class SVMGood with kernel for nonlinear shapes

Evaluation challenges

Anomaly detection is usually unsupervised — you rarely have labeled anomalies. This makes evaluation tricky:

  • No labels: can’t compute precision/recall directly. Domain experts review flagged points
  • Class imbalance: even with labels, accuracy is misleading (99% normal → 99% accuracy by predicting all normal)
  • Use precision@k: rank points by anomaly score, check how many of the top-k are real anomalies
  • Contamination parameter: most methods need you to guess the anomaly fraction. Get it wrong and you miss anomalies or flag too many normals