Feature Engineering & Feature Stores at Scale
Session 38 of the 48-session learning series.
Date: Thu, 2026-07-09 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 13 · Est. read: 2 h
Why this session matters
This is Session 38 of 48 in the ML track. Modern ML is 80% feature engineering, 20% modelling — and at scale the consistency between training and serving features is the #1 source of silent model degradation. Feature stores were invented to fix exactly that. Understanding the architecture, not just the buzzword, is critical.
Agenda
- What a feature actually is (and where definitions go to die)
- Feature stores — offline + online + the consistency contract
- Training/serving skew — the silent killer
- Feature pipelines — point-in-time correctness, backfills, freshness
- Real-time features — streaming aggregations, materialisation latency
Pre-read (skim before the session)
- Uber — Michelangelo platform
- Feast — Feature store concepts
- Tecton blog — point-in-time correctness
- Eugene Yan — Feature stores explained
Deep dive
1. The feature definition problem
A feature is a deterministic function of raw data:
user_clicks_last_7d(user_id, ts) = COUNT(clicks WHERE user=user_id AND click_ts BETWEEN ts-7d AND ts)
Definitions live in three places without a feature store:
- The training notebook (Python + Spark).
- The online serving code (Go service).
- The eval script (different Python script).
Three implementations of the same logic. Three sources of drift. Three places to fix when the definition changes.
2. What a feature store gives you
- One definition for a feature → used in both training and serving.
- Offline store (warehouse) for training; bulk historical reads.
- Online store (low-latency KV) for serving; single-key fast reads.
- Materialisation — pipeline copies pre-computed features from offline → online.
- Discovery — UI / catalog of available features.
- Lineage — feature → upstream tables → owners.
- Monitoring — drift, freshness, distribution.
It's data infrastructure with an ML flavour.
3. Offline vs online stores
| Layer | Tech | Latency | Use |
|---|---|---|---|
| Offline | Parquet on S3, Snowflake, BigQuery, Delta | seconds-minutes | Training, backfill, batch eval |
| Online | Redis, DynamoDB, Cassandra, ScyllaDB | < 10 ms | Real-time inference |
The store must guarantee: feature(user=42, time=now) at training time = feature(user=42, time=now) at serving time.
4. Training/serving skew — the bug class
Symptoms:
- Model AUC=0.85 offline, behaves like AUC=0.65 in production.
- Bug manifest months later when a feature pipeline silently drifts.
Causes:
- Feature computed differently in train vs serve (different SQL, different timezone, different null handling).
- Feature snapshot used a 24h-late table in training (look-ahead bias) but live in serving.
- Online feature has different freshness than training timestamps assumed.
Fixes (all of them, layered):
- Single source of truth for feature logic (feature store).
- Point-in-time joins (next section).
- Log live features → reuse for retraining.
5. Point-in-time correctness
This is the hard part. When you build training data, for each row (entity, label_ts) you want the features as they were at label_ts — not now.
-- WRONG: uses current feature value
SELECT label.*, feat.value
FROM labels JOIN features ON labels.user = features.user
-- RIGHT: temporal join, point-in-time
SELECT label.*, feat.value
FROM labels
ASOF JOIN features
ON labels.user = features.user
AND feat.event_ts <= label.label_ts
Without point-in-time joins you leak the future into the past and your offline metrics are fiction. Spark, Polars, Feast, dbt all support point-in-time joins with varying syntax.
6. Streaming features
The features that drove the field forward:
clicks_last_5min(user)— recency matters.merchant_velocity_last_hour(merchant)— fraud detection.session_pages_so_far(session)— recommendation context.
Compute via:
- Flink / Spark Structured Streaming → write to online store every N seconds.
- Materialised in Redis with TTL.
- Reads at serve = single Redis GET.
Trade-offs: lower freshness vs query simplicity vs cost.
7. Feature freshness budget
Per feature, declare:
- Staleness SLA — "this feature must be < N seconds old".
- Alert when freshness exceeds SLA.
Production pattern:
- Real-time fraud features: < 5 s.
- Recommendation features: < 60 s.
- Personalisation embeddings: < 1 day.
- Demographic / static: weekly is fine.
Pay for compute matching the SLA. Don't run sub-second updates on weekly features.
8. Backfill
When you add a new feature you need historical values to train on. Backfill = compute the feature for every row of historical data.
Patterns:
- Run a Spark job over all historical raw data.
- Use the same code path as the streaming compute (Kappa architecture) — single source of truth.
- Idempotent writes; can re-run on failure.
- Watch the cost — backfilling 2 years of data is petabyte-scale.
9. Embedding features
Embeddings (S17) are a special kind of feature:
- High-dim numeric vectors.
- Stored in vector DB or as bytes in KV store.
- Updated when the model that produced them is retrained.
- Need version tracking (
item_emb_v3).
Versioning is critical. Serving an old model with a new embedding format = silent garbage.
10. Feature governance
- Owner — engineer responsible.
- Description — what does it actually mean.
- Source — upstream tables / streams.
- Tags — sensitive (PII), expensive, slow-to-refresh.
- Usage — which models consume it; auto-discovered.
- Quality stats — null rate, distribution, drift score.
Same hygiene as data governance (S32). A feature without an owner is a feature you'll delete during the next outage.
11. The build/buy decision
Feature store tools:
- Feast — open source; you bring the infra. Most popular OSS.
- Tecton — SaaS; biggest commercial; richest features.
- SageMaker Feature Store — AWS-native.
- Vertex AI Feature Store — GCP-native.
- Databricks Feature Engineering — built into Lakehouse.
- Build your own — fine for a single ML team; quickly painful for multi-team.
Rule of thumb:
- 1–2 ML engineers, 1–3 models: skip the feature store. Use Spark + Redis.
- 5+ ML engineers, 10+ models: feature store is mandatory.
12. Reality check
A first-feature-store rollout:
- Audit current features — find the duplicates.
- Move top-10 highest-traffic features into Feast (or chosen platform).
- Implement point-in-time training data generation.
- Wire monitoring: freshness, null rate, drift.
- Deprecate the duplicates; force-migrate model owners.
Plan for 3 months minimum. The benefit lands as the second and third model consume the same definitions — that's when you stop reinventing.
Reading material
- Eugene Yan — Feature stores explained
- Designing Machine Learning Systems (Chip Huyen) — Feature engineering chapter
- Tecton — point-in-time correctness
- Uber — Michelangelo
In-depth research material
Video reference
▶︎ Feature Stores Explained (Made With ML)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Subarray Sum Equals K
- Link: https://leetcode.com/problems/subarray-sum-equals-k/
- Difficulty: Medium
- Why this problem: Aggregations over a rolling window with hash-map memoisation — the algorithmic heart of "events in last X minutes" features.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Define training/serving skew and 3 root causes.
- Explain point-in-time joins and write the ASOF join syntax.
- Sketch the offline/online architecture of a feature store.
- Pick a freshness SLA appropriate to the use case.
- Decide when a team needs a feature store vs Spark + Redis.
- Solve
subarray-sum-equals-k— prefix-sum + hashmap, the same shape as streaming-window aggregations.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.