Model Serving
What
Deploying a trained model as a service that accepts inputs and returns predictions in real time. This is the bridge between a notebook experiment and a product feature. Model serving covers the protocols (REST, gRPC), inference patterns (online, batch, streaming), infrastructure (model registries, serving frameworks), and deployment strategies (canary, A/B) needed to run ML in production.
Why It Matters
A model that only runs in a notebook has zero business value. Serving is where ML meets reality:
- Product teams need predictions at low latency (< 100ms for user-facing features)
- Batch pipelines need offline scoring over millions of rows
- Bad deployments can silently degrade — you need rollback, monitoring, and traffic splitting
- The serving layer is often the bottleneck, not the model itself
How It Works
Inference Patterns
| Pattern | When to use | Latency | Example |
|---|---|---|---|
| Online (real-time) | User-facing features | < 100ms | Fraud detection at payment time |
| Batch | Periodic scoring, recommendations | Minutes–hours | Nightly churn predictions |
| Streaming | Event-driven, near-real-time | Seconds | Anomaly detection on sensor data |
| Embedded | Edge/mobile, no network | < 10ms | On-device autocorrect |
REST vs gRPC
REST (HTTP + JSON): universal, easy to debug with curl, good for prototyping. Overhead from JSON serialization makes it slower for large tensors.
gRPC (HTTP/2 + Protobuf): binary serialization, 2-10x faster than REST for tensor data, supports streaming. Preferred for model-to-model communication and high-throughput serving.
REST: Client --HTTP/JSON--> Server # simple, debuggable
gRPC: Client --HTTP2/Protobuf--> Server # fast, typed, streaming
Rule of thumb: REST for external APIs, gRPC for internal model services.
Model Registries
A model registry tracks trained models with metadata, versioning, and lineage. It answers: “which model version is in production, who trained it, on what data?”
Key registries: MLflow Model Registry, Weights & Biases Registry, SageMaker Model Registry, Vertex AI Model Registry.
A registered model has: name, version, stage (staging/production/archived), metrics, training run link, artifact URI.
Serving Frameworks
| Framework | Best for | Key feature |
|---|---|---|
| TorchServe | PyTorch models | Handler API, batching, multi-model |
| TF Serving | TensorFlow/Keras | SavedModel format, gRPC native |
| Triton (NVIDIA) | Multi-framework GPU serving | Dynamic batching, ensemble pipelines |
| vLLM | LLM serving | PagedAttention, continuous batching |
| BentoML | Framework-agnostic | Python-native, easy packaging |
Deployment Strategies
Blue-green: two identical environments. Switch traffic from blue (old) to green (new) instantly. Rollback = switch back.
Canary: route a small percentage of traffic (e.g., 5%) to the new model. Monitor metrics. Gradually increase if healthy.
A/B testing: route traffic by user segment. Compare business metrics (not just model metrics) between versions. Requires statistical rigor — decide sample size and significance threshold before starting.
Shadow mode: new model receives real traffic but its predictions are discarded. Compare against production model offline. Zero risk.
Latency Optimization
- Model optimization: ONNX Runtime, TensorRT, quantization (
float32→int8) - Batching: accumulate requests and run inference in one GPU call. Triton and TorchServe support dynamic batching (wait up to N ms to fill a batch)
- Caching: cache predictions for repeated inputs (Redis/memcached)
- Hardware: GPU for throughput, CPU for simplicity, dedicated inference chips (AWS Inferentia, Google TPU)
- Async processing: return request ID immediately, poll for result — for heavy models
Code Example
FastAPI Serving with PyTorch
import torch
import torch.nn as nn
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
# --- Define model (same architecture as training) ---
class SimpleClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x):
return self.net(x)
# --- Load trained model ---
MODEL_PATH = "classifier.pt"
model = SimpleClassifier(input_dim=4, hidden_dim=32, output_dim=3)
model.load_state_dict(torch.load(MODEL_PATH, map_location="cpu"))
model.eval()
# --- API ---
app = FastAPI(title="Iris Classifier")
class PredictRequest(BaseModel):
features: list[float] # [sepal_l, sepal_w, petal_l, petal_w]
class PredictResponse(BaseModel):
predicted_class: int
probabilities: list[float]
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
x = torch.tensor([req.features], dtype=torch.float32)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
pred_class = probs.argmax(dim=1).item()
return PredictResponse(
predicted_class=pred_class,
probabilities=probs[0].tolist(),
)
@app.get("/health")
def health():
return {"status": "ok"}
# Run: uvicorn serve:app --host 0.0.0.0 --port 8000Test with curl
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
# {"predicted_class": 0, "probabilities": [0.95, 0.03, 0.02]}Key Tradeoffs
| Decision | Option A | Option B |
|---|---|---|
| Protocol | REST (simple, debuggable) | gRPC (fast, typed) |
| Batching | No batching (lowest latency per request) | Dynamic batching (highest throughput) |
| Optimization | Serve raw PyTorch (flexible) | ONNX/TensorRT (2-5x faster, less flexible) |
| Deployment | Canary (safe, gradual) | Blue-green (instant, all-or-nothing) |
| Infrastructure | Self-hosted (control) | Managed (SageMaker/Vertex, less ops) |
Common Pitfalls
- Training-serving skew: preprocessing differs between training and serving. Fix: share the same preprocessing code or use a feature store
- No health checks: the model process crashes silently. Always expose
/health - Ignoring cold start: first request after deploy loads model into GPU memory — can take 10-30s. Pre-warm models on deploy
- Unbounded request size: one huge input can OOM the server. Validate input dimensions
Exercises
- Extend the FastAPI example to serve two model versions behind
/v1/predictand/v2/predict. Add a headerX-Model-Versionto the response - Add dynamic batching: accumulate requests for 10ms, run a single forward pass, return individual results. Measure throughput improvement
- Export the PyTorch model to ONNX and serve it with ONNX Runtime. Compare latency against raw PyTorch
- Implement a canary deployment with nginx: 95% traffic to model v1, 5% to model v2. Write the nginx config
Self-Test Questions
- When would you choose gRPC over REST for model serving? What about the reverse?
- What is dynamic batching and why does it improve GPU utilization?
- How does a canary deployment differ from A/B testing? When would you use each?
- What is training-serving skew and how do you prevent it?
- Why is ONNX Runtime faster than serving a raw PyTorch model?
Links
- MLOps Roadmap
- ML Pipelines
- Experiment Tracking — tracking the models you serve
- Model Monitoring — detecting when served models degrade
- Feature Stores — consistent features between training and serving