Scikit-Learn Overview

What

The standard library for classical ML in Python. Consistent API for preprocessing, models, evaluation.

The API pattern

Every model follows the same interface:

from sklearn.some_module import SomeModel
 
model = SomeModel(hyperparameters)
model.fit(X_train, y_train)         # train
predictions = model.predict(X_test)  # predict
score = model.score(X_test, y_test)  # evaluate

Key modules

ModuleWhatExamples
preprocessingScale, encode, transform featuresStandardScaler, LabelEncoder, OneHotEncoder
model_selectionSplit data, tune hyperparameterstrain_test_split, cross_val_score, GridSearchCV
linear_modelLinear/logistic regressionLinearRegression, LogisticRegression, Ridge, Lasso
treeDecision treesDecisionTreeClassifier, DecisionTreeRegressor
ensembleCombine modelsRandomForestClassifier, GradientBoostingClassifier
svmSupport vector machinesSVC, SVR
clusterUnsupervised clusteringKMeans, DBSCAN
decompositionDimensionality reductionPCA, NMF
metricsEvaluate performanceaccuracy_score, f1_score, confusion_matrix
pipelineChain preprocessing + modelPipeline, make_pipeline

Typical workflow

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
 
# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
# preprocess
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit + transform on train
X_test = scaler.transform(X_test)          # only transform on test!
 
# train
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
 
# evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))