Model Serving

What

Deploying a trained model as a service that accepts inputs and returns predictions in real time. This is the bridge between a notebook experiment and a product feature. Model serving covers the protocols (REST, gRPC), inference patterns (online, batch, streaming), infrastructure (model registries, serving frameworks), and deployment strategies (canary, A/B) needed to run ML in production.

Why It Matters

A model that only runs in a notebook has zero business value. Serving is where ML meets reality:

  • Product teams need predictions at low latency (< 100ms for user-facing features)
  • Batch pipelines need offline scoring over millions of rows
  • Bad deployments can silently degrade — you need rollback, monitoring, and traffic splitting
  • The serving layer is often the bottleneck, not the model itself

How It Works

Inference Patterns

PatternWhen to useLatencyExample
Online (real-time)User-facing features< 100msFraud detection at payment time
BatchPeriodic scoring, recommendationsMinutes–hoursNightly churn predictions
StreamingEvent-driven, near-real-timeSecondsAnomaly detection on sensor data
EmbeddedEdge/mobile, no network< 10msOn-device autocorrect

REST vs gRPC

REST (HTTP + JSON): universal, easy to debug with curl, good for prototyping. Overhead from JSON serialization makes it slower for large tensors.

gRPC (HTTP/2 + Protobuf): binary serialization, 2-10x faster than REST for tensor data, supports streaming. Preferred for model-to-model communication and high-throughput serving.

REST:  Client --HTTP/JSON--> Server       # simple, debuggable
gRPC:  Client --HTTP2/Protobuf--> Server  # fast, typed, streaming

Rule of thumb: REST for external APIs, gRPC for internal model services.

Model Registries

A model registry tracks trained models with metadata, versioning, and lineage. It answers: “which model version is in production, who trained it, on what data?”

Key registries: MLflow Model Registry, Weights & Biases Registry, SageMaker Model Registry, Vertex AI Model Registry.

A registered model has: name, version, stage (staging/production/archived), metrics, training run link, artifact URI.

Serving Frameworks

FrameworkBest forKey feature
TorchServePyTorch modelsHandler API, batching, multi-model
TF ServingTensorFlow/KerasSavedModel format, gRPC native
Triton (NVIDIA)Multi-framework GPU servingDynamic batching, ensemble pipelines
vLLMLLM servingPagedAttention, continuous batching
BentoMLFramework-agnosticPython-native, easy packaging

Deployment Strategies

Blue-green: two identical environments. Switch traffic from blue (old) to green (new) instantly. Rollback = switch back.

Canary: route a small percentage of traffic (e.g., 5%) to the new model. Monitor metrics. Gradually increase if healthy.

A/B testing: route traffic by user segment. Compare business metrics (not just model metrics) between versions. Requires statistical rigor — decide sample size and significance threshold before starting.

Shadow mode: new model receives real traffic but its predictions are discarded. Compare against production model offline. Zero risk.

Latency Optimization

  1. Model optimization: ONNX Runtime, TensorRT, quantization (float32 int8)
  2. Batching: accumulate requests and run inference in one GPU call. Triton and TorchServe support dynamic batching (wait up to N ms to fill a batch)
  3. Caching: cache predictions for repeated inputs (Redis/memcached)
  4. Hardware: GPU for throughput, CPU for simplicity, dedicated inference chips (AWS Inferentia, Google TPU)
  5. Async processing: return request ID immediately, poll for result — for heavy models

Code Example

FastAPI Serving with PyTorch

import torch
import torch.nn as nn
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
 
# --- Define model (same architecture as training) ---
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
 
    def forward(self, x):
        return self.net(x)
 
# --- Load trained model ---
MODEL_PATH = "classifier.pt"
model = SimpleClassifier(input_dim=4, hidden_dim=32, output_dim=3)
model.load_state_dict(torch.load(MODEL_PATH, map_location="cpu"))
model.eval()
 
# --- API ---
app = FastAPI(title="Iris Classifier")
 
class PredictRequest(BaseModel):
    features: list[float]  # [sepal_l, sepal_w, petal_l, petal_w]
 
class PredictResponse(BaseModel):
    predicted_class: int
    probabilities: list[float]
 
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    x = torch.tensor([req.features], dtype=torch.float32)
    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()
    return PredictResponse(
        predicted_class=pred_class,
        probabilities=probs[0].tolist(),
    )
 
@app.get("/health")
def health():
    return {"status": "ok"}
 
# Run: uvicorn serve:app --host 0.0.0.0 --port 8000

Test with curl

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# {"predicted_class": 0, "probabilities": [0.95, 0.03, 0.02]}

Key Tradeoffs

DecisionOption AOption B
ProtocolREST (simple, debuggable)gRPC (fast, typed)
BatchingNo batching (lowest latency per request)Dynamic batching (highest throughput)
OptimizationServe raw PyTorch (flexible)ONNX/TensorRT (2-5x faster, less flexible)
DeploymentCanary (safe, gradual)Blue-green (instant, all-or-nothing)
InfrastructureSelf-hosted (control)Managed (SageMaker/Vertex, less ops)

Common Pitfalls

  • Training-serving skew: preprocessing differs between training and serving. Fix: share the same preprocessing code or use a feature store
  • No health checks: the model process crashes silently. Always expose /health
  • Ignoring cold start: first request after deploy loads model into GPU memory — can take 10-30s. Pre-warm models on deploy
  • Unbounded request size: one huge input can OOM the server. Validate input dimensions

Exercises

  1. Extend the FastAPI example to serve two model versions behind /v1/predict and /v2/predict. Add a header X-Model-Version to the response
  2. Add dynamic batching: accumulate requests for 10ms, run a single forward pass, return individual results. Measure throughput improvement
  3. Export the PyTorch model to ONNX and serve it with ONNX Runtime. Compare latency against raw PyTorch
  4. Implement a canary deployment with nginx: 95% traffic to model v1, 5% to model v2. Write the nginx config

Self-Test Questions

  1. When would you choose gRPC over REST for model serving? What about the reverse?
  2. What is dynamic batching and why does it improve GPU utilization?
  3. How does a canary deployment differ from A/B testing? When would you use each?
  4. What is training-serving skew and how do you prevent it?
  5. Why is ONNX Runtime faster than serving a raw PyTorch model?