Anomaly Detection

What

Finding data points that are significantly different from the majority. Fraud, defects, intrusions, unusual behavior.

Approaches

Statistical

Points beyond 3 standard deviations
IQR method (used in Data Cleaning)

Isolation Forest

Randomly partition data. Anomalies need fewer splits to isolate.

from sklearn.ensemble import IsolationForest
 
model = IsolationForest(contamination=0.05)  # expect 5% anomalies
model.fit(X)
predictions = model.predict(X)  # 1 = normal, -1 = anomaly

One-Class SVM

Learn a boundary around normal data in high-dimensional space. Anything outside the boundary is an anomaly. Works well with kernel trick for nonlinear boundaries.

from sklearn.svm import OneClassSVM
 
model = OneClassSVM(kernel="rbf", nu=0.05)  # nu ≈ expected anomaly fraction
model.fit(X_train)
predictions = model.predict(X_test)  # 1 = normal, -1 = anomaly

Local Outlier Factor (LOF)

Compares the local density of a point to the density of its neighbors. A point in a sparse region surrounded by dense regions is an outlier.

from sklearn.neighbors import LocalOutlierFactor
 
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
predictions = lof.fit_predict(X)  # 1 = normal, -1 = anomaly

DBSCAN for anomaly detection

Density-based clustering. Points that don’t belong to any cluster (label = -1) are treated as anomalies. Useful when anomalies form no structure but normal data clusters naturally.

Autoencoders (deep learning)

Train a neural net to compress and reconstruct normal data. Anomalies have high reconstruction error.

Choosing an approach

Situation	Method	Why
Small data, few features	Statistical (z-score, IQR)	Simple, interpretable
Tabular data, moderate size	Isolation Forest	Fast, handles high dimensions
Dense clusters of normal data	LOF or DBSCAN	Catches local anomalies
High-dimensional, complex patterns	Autoencoder	Learns nonlinear structure
Need a decision boundary	One-Class SVM	Good with kernel for nonlinear shapes

Evaluation challenges

Anomaly detection is usually unsupervised — you rarely have labeled anomalies. This makes evaluation tricky:

No labels: can’t compute precision/recall directly. Domain experts review flagged points
Class imbalance: even with labels, accuracy is misleading (99% normal → 99% accuracy by predicting all normal)
Use precision@k: rank points by anomaly score, check how many of the top-k are real anomalies
Contamination parameter: most methods need you to guess the anomaly fraction. Get it wrong and you miss anomalies or flag too many normals

AI/ML Notes

Explorer

Anomaly Detection

Anomaly Detection

What

Approaches

Statistical

Isolation Forest

One-Class SVM

Local Outlier Factor (LOF)

DBSCAN for anomaly detection

Autoencoders (deep learning)

Choosing an approach

Evaluation challenges

Links

Graph View

Table of Contents

Backlinks