Anomaly Detection
What
Finding data points that are significantly different from the majority. Fraud, defects, intrusions, unusual behavior.
Approaches
Statistical
- Points beyond 3 standard deviations
- IQR method (used in Data Cleaning)
Isolation Forest
Randomly partition data. Anomalies need fewer splits to isolate.
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05) # expect 5% anomalies
model.fit(X)
predictions = model.predict(X) # 1 = normal, -1 = anomalyOne-Class SVM
Learn a boundary around normal data in high-dimensional space. Anything outside the boundary is an anomaly. Works well with kernel trick for nonlinear boundaries.
from sklearn.svm import OneClassSVM
model = OneClassSVM(kernel="rbf", nu=0.05) # nu ≈ expected anomaly fraction
model.fit(X_train)
predictions = model.predict(X_test) # 1 = normal, -1 = anomalyLocal Outlier Factor (LOF)
Compares the local density of a point to the density of its neighbors. A point in a sparse region surrounded by dense regions is an outlier.
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
predictions = lof.fit_predict(X) # 1 = normal, -1 = anomalyDBSCAN for anomaly detection
Density-based clustering. Points that don’t belong to any cluster (label = -1) are treated as anomalies. Useful when anomalies form no structure but normal data clusters naturally.
Autoencoders (deep learning)
Train a neural net to compress and reconstruct normal data. Anomalies have high reconstruction error.
Choosing an approach
| Situation | Method | Why |
|---|---|---|
| Small data, few features | Statistical (z-score, IQR) | Simple, interpretable |
| Tabular data, moderate size | Isolation Forest | Fast, handles high dimensions |
| Dense clusters of normal data | LOF or DBSCAN | Catches local anomalies |
| High-dimensional, complex patterns | Autoencoder | Learns nonlinear structure |
| Need a decision boundary | One-Class SVM | Good with kernel for nonlinear shapes |
Evaluation challenges
Anomaly detection is usually unsupervised — you rarely have labeled anomalies. This makes evaluation tricky:
- No labels: can’t compute precision/recall directly. Domain experts review flagged points
- Class imbalance: even with labels, accuracy is misleading (99% normal → 99% accuracy by predicting all normal)
- Use precision@k: rank points by anomaly score, check how many of the top-k are real anomalies
- Contamination parameter: most methods need you to guess the anomaly fraction. Get it wrong and you miss anomalies or flag too many normals
Links
- Supervised vs Unsupervised Learning
- K-Means Clustering
- Support Vector Machines — One-Class SVM uses the same kernel trick
- Deep Learning Roadmap