Feature Stores
What
A central repository for ML features — the computed values your models actually train on and serve from. Think of it as a data warehouse specifically designed for ML workflows.
Why you need one
The core problem: features computed during training must match features computed during serving. Without a feature store:
- Training uses a Pandas pipeline, serving uses a different SQL query — subtle bugs
- Teams recompute the same features independently — wasted work
- No record of how a feature was computed — debugging nightmares
Feature stores solve training-serving skew, the silent killer of ML in production.
Architecture
| Component | Purpose | Backed by |
|---|---|---|
| Offline store | Batch features for training | Data warehouse (BigQuery, Snowflake, Parquet) |
| Online store | Low-latency features for serving | Redis, DynamoDB, Cassandra |
| Feature registry | Metadata, lineage, documentation | Catalog database |
| Transformation engine | Compute features from raw data | Spark, SQL, Python |
The flow: raw data → transformations → offline store (for training) → materialized to online store (for serving). Same feature definition, two storage backends.
What to store
- Features: the computed values (user_avg_purchase_last_30d, text_embedding_v2)
- Metadata: data type, owner, description, freshness requirements
- Lineage: which raw data sources and transforms produced this feature
- Timestamps: point-in-time correctness to avoid future data leaking into training
Key tools
| Tool | Notes |
|---|---|
| Feast | Open-source, lightweight, Python-native. Good starting point |
| Tecton | Managed service by Feast creators. Production-grade |
| Hopsworks | Open-source platform with built-in feature store |
| Vertex AI Feature Store | GCP-native, integrates with Vertex pipelines |
| SageMaker Feature Store | AWS-native |
For learning and small projects, Feast is the right choice. For production at scale, evaluate managed options.
When you don’t need one
If you have one model, one team, and batch-only serving — a feature store is overkill. A well-organized Parquet file and a documented transformation script will do. Feature stores pay off when you have multiple models sharing features, or real-time serving requirements.
Links
- ML Pipelines — pipelines that produce and consume features
- Model Serving — where online features get consumed
- MLOps Roadmap — the bigger picture