Model Serving

What

Deploying a trained model as a service that accepts inputs and returns predictions in real time. This is the bridge between a notebook experiment and a product feature. Model serving covers the protocols (REST, gRPC), inference patterns (online, batch, streaming), infrastructure (model registries, serving frameworks), and deployment strategies (canary, A/B) needed to run ML in production.

Why It Matters

A model that only runs in a notebook has zero business value. Serving is where ML meets reality:

Product teams need predictions at low latency (< 100ms for user-facing features)
Batch pipelines need offline scoring over millions of rows
Bad deployments can silently degrade — you need rollback, monitoring, and traffic splitting
The serving layer is often the bottleneck, not the model itself

How It Works

Inference Patterns

Pattern	When to use	Latency	Example
Online (real-time)	User-facing features	< 100ms	Fraud detection at payment time
Batch	Periodic scoring, recommendations	Minutes–hours	Nightly churn predictions
Streaming	Event-driven, near-real-time	Seconds	Anomaly detection on sensor data
Embedded	Edge/mobile, no network	< 10ms	On-device autocorrect

REST vs gRPC

REST (HTTP + JSON): universal, easy to debug with curl, good for prototyping. Overhead from JSON serialization makes it slower for large tensors.

gRPC (HTTP/2 + Protobuf): binary serialization, 2-10x faster than REST for tensor data, supports streaming. Preferred for model-to-model communication and high-throughput serving.

REST:  Client --HTTP/JSON--> Server       # simple, debuggable
gRPC:  Client --HTTP2/Protobuf--> Server  # fast, typed, streaming

Rule of thumb: REST for external APIs, gRPC for internal model services.

Model Registries

A model registry tracks trained models with metadata, versioning, and lineage. It answers: “which model version is in production, who trained it, on what data?”

Key registries: MLflow Model Registry, Weights & Biases Registry, SageMaker Model Registry, Vertex AI Model Registry.

A registered model has: name, version, stage (staging/production/archived), metrics, training run link, artifact URI.

Serving Frameworks

Framework	Best for	Key feature
TorchServe	PyTorch models	Handler API, batching, multi-model
TF Serving	TensorFlow/Keras	SavedModel format, gRPC native
Triton (NVIDIA)	Multi-framework GPU serving	Dynamic batching, ensemble pipelines
vLLM	LLM serving	PagedAttention, continuous batching
BentoML	Framework-agnostic	Python-native, easy packaging

Deployment Strategies

Blue-green: two identical environments. Switch traffic from blue (old) to green (new) instantly. Rollback = switch back.

Canary: route a small percentage of traffic (e.g., 5%) to the new model. Monitor metrics. Gradually increase if healthy.

A/B testing: route traffic by user segment. Compare business metrics (not just model metrics) between versions. Requires statistical rigor — decide sample size and significance threshold before starting.

Shadow mode: new model receives real traffic but its predictions are discarded. Compare against production model offline. Zero risk.

Latency Optimization

Model optimization: ONNX Runtime, TensorRT, quantization (float32 → int8)
Batching: accumulate requests and run inference in one GPU call. Triton and TorchServe support dynamic batching (wait up to N ms to fill a batch)
Caching: cache predictions for repeated inputs (Redis/memcached)
Hardware: GPU for throughput, CPU for simplicity, dedicated inference chips (AWS Inferentia, Google TPU)
Async processing: return request ID immediately, poll for result — for heavy models

Code Example

FastAPI Serving with PyTorch

import torch
import torch.nn as nn
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
 
# --- Define model (same architecture as training) ---
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )
 
    def forward(self, x):
        return self.net(x)
 
# --- Load trained model ---
MODEL_PATH = "classifier.pt"
model = SimpleClassifier(input_dim=4, hidden_dim=32, output_dim=3)
model.load_state_dict(torch.load(MODEL_PATH, map_location="cpu"))
model.eval()
 
# --- API ---
app = FastAPI(title="Iris Classifier")
 
class PredictRequest(BaseModel):
    features: list[float]  # [sepal_l, sepal_w, petal_l, petal_w]
 
class PredictResponse(BaseModel):
    predicted_class: int
    probabilities: list[float]
 
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    x = torch.tensor([req.features], dtype=torch.float32)
    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()
    return PredictResponse(
        predicted_class=pred_class,
        probabilities=probs[0].tolist(),
    )
 
@app.get("/health")
def health():
    return {"status": "ok"}
 
# Run: uvicorn serve:app --host 0.0.0.0 --port 8000

Test with curl

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# {"predicted_class": 0, "probabilities": [0.95, 0.03, 0.02]}

Key Tradeoffs

Decision	Option A	Option B
Protocol	REST (simple, debuggable)	gRPC (fast, typed)
Batching	No batching (lowest latency per request)	Dynamic batching (highest throughput)
Optimization	Serve raw PyTorch (flexible)	ONNX/TensorRT (2-5x faster, less flexible)
Deployment	Canary (safe, gradual)	Blue-green (instant, all-or-nothing)
Infrastructure	Self-hosted (control)	Managed (SageMaker/Vertex, less ops)

Common Pitfalls

Training-serving skew: preprocessing differs between training and serving. Fix: share the same preprocessing code or use a feature store
No health checks: the model process crashes silently. Always expose /health
Ignoring cold start: first request after deploy loads model into GPU memory — can take 10-30s. Pre-warm models on deploy
Unbounded request size: one huge input can OOM the server. Validate input dimensions

Exercises

Extend the FastAPI example to serve two model versions behind /v1/predict and /v2/predict. Add a header X-Model-Version to the response
Add dynamic batching: accumulate requests for 10ms, run a single forward pass, return individual results. Measure throughput improvement
Export the PyTorch model to ONNX and serve it with ONNX Runtime. Compare latency against raw PyTorch
Implement a canary deployment with nginx: 95% traffic to model v1, 5% to model v2. Write the nginx config

Self-Test Questions

When would you choose gRPC over REST for model serving? What about the reverse?
What is dynamic batching and why does it improve GPU utilization?
How does a canary deployment differ from A/B testing? When would you use each?
What is training-serving skew and how do you prevent it?
Why is ONNX Runtime faster than serving a raw PyTorch model?

AI/ML Notes

Explorer

Model Serving

Model Serving

What

Why It Matters

How It Works

Inference Patterns

REST vs gRPC

Model Registries

Serving Frameworks

Deployment Strategies

Latency Optimization

Code Example

FastAPI Serving with PyTorch

Test with curl

Key Tradeoffs

Common Pitfalls

Exercises

Self-Test Questions

Links

Graph View

Table of Contents

Backlinks