ai mlintermediate 12m2026-06-09

Feature Engineering & Feature Stores at Scale

Session 38 of the 48-session learning series.

Date: Thu, 2026-07-09 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 13 · Est. read: 2 h

Why this session matters

This is Session 38 of 48 in the ML track. Modern ML is 80% feature engineering, 20% modelling — and at scale the consistency between training and serving features is the #1 source of silent model degradation. Feature stores were invented to fix exactly that. Understanding the architecture, not just the buzzword, is critical.

Agenda

What a feature actually is (and where definitions go to die)
Feature stores — offline + online + the consistency contract
Training/serving skew — the silent killer
Feature pipelines — point-in-time correctness, backfills, freshness
Real-time features — streaming aggregations, materialisation latency

Pre-read (skim before the session)

Deep dive

1. The feature definition problem

A feature is a deterministic function of raw data:

user_clicks_last_7d(user_id, ts) = COUNT(clicks WHERE user=user_id AND click_ts BETWEEN ts-7d AND ts)

Definitions live in three places without a feature store:

The training notebook (Python + Spark).
The online serving code (Go service).
The eval script (different Python script).

Three implementations of the same logic. Three sources of drift. Three places to fix when the definition changes.

2. What a feature store gives you

One definition for a feature → used in both training and serving.
Offline store (warehouse) for training; bulk historical reads.
Online store (low-latency KV) for serving; single-key fast reads.
Materialisation — pipeline copies pre-computed features from offline → online.
Discovery — UI / catalog of available features.
Lineage — feature → upstream tables → owners.
Monitoring — drift, freshness, distribution.

It's data infrastructure with an ML flavour.

3. Offline vs online stores

Layer	Tech	Latency	Use
Offline	Parquet on S3, Snowflake, BigQuery, Delta	seconds-minutes	Training, backfill, batch eval
Online	Redis, DynamoDB, Cassandra, ScyllaDB	< 10 ms	Real-time inference

The store must guarantee: feature(user=42, time=now) at training time = feature(user=42, time=now) at serving time.

4. Training/serving skew — the bug class

Symptoms:

Model AUC=0.85 offline, behaves like AUC=0.65 in production.
Bug manifest months later when a feature pipeline silently drifts.

Causes:

Feature computed differently in train vs serve (different SQL, different timezone, different null handling).
Feature snapshot used a 24h-late table in training (look-ahead bias) but live in serving.
Online feature has different freshness than training timestamps assumed.

Fixes (all of them, layered):

Single source of truth for feature logic (feature store).
Point-in-time joins (next section).
Log live features → reuse for retraining.

5. Point-in-time correctness

This is the hard part. When you build training data, for each row (entity, label_ts) you want the features as they were at label_ts — not now.

-- WRONG: uses current feature value
SELECT label.*, feat.value
FROM labels JOIN features ON labels.user = features.user

-- RIGHT: temporal join, point-in-time
SELECT label.*, feat.value
FROM labels
ASOF JOIN features
  ON labels.user = features.user
  AND feat.event_ts <= label.label_ts

Without point-in-time joins you leak the future into the past and your offline metrics are fiction. Spark, Polars, Feast, dbt all support point-in-time joins with varying syntax.

6. Streaming features

The features that drove the field forward:

clicks_last_5min(user) — recency matters.
merchant_velocity_last_hour(merchant) — fraud detection.
session_pages_so_far(session) — recommendation context.

Compute via:

Flink / Spark Structured Streaming → write to online store every N seconds.
Materialised in Redis with TTL.
Reads at serve = single Redis GET.

Trade-offs: lower freshness vs query simplicity vs cost.

7. Feature freshness budget

Per feature, declare:

Staleness SLA — "this feature must be < N seconds old".
Alert when freshness exceeds SLA.

Production pattern:

Real-time fraud features: < 5 s.
Recommendation features: < 60 s.
Personalisation embeddings: < 1 day.
Demographic / static: weekly is fine.

Pay for compute matching the SLA. Don't run sub-second updates on weekly features.

8. Backfill

When you add a new feature you need historical values to train on. Backfill = compute the feature for every row of historical data.

Patterns:

Run a Spark job over all historical raw data.
Use the same code path as the streaming compute (Kappa architecture) — single source of truth.
Idempotent writes; can re-run on failure.
Watch the cost — backfilling 2 years of data is petabyte-scale.

9. Embedding features

Embeddings (S17) are a special kind of feature:

High-dim numeric vectors.
Stored in vector DB or as bytes in KV store.
Updated when the model that produced them is retrained.
Need version tracking (item_emb_v3).

Versioning is critical. Serving an old model with a new embedding format = silent garbage.

10. Feature governance

Owner — engineer responsible.
Description — what does it actually mean.
Source — upstream tables / streams.
Tags — sensitive (PII), expensive, slow-to-refresh.
Usage — which models consume it; auto-discovered.
Quality stats — null rate, distribution, drift score.

Same hygiene as data governance (S32). A feature without an owner is a feature you'll delete during the next outage.

11. The build/buy decision

Feature store tools:

Feast — open source; you bring the infra. Most popular OSS.
Tecton — SaaS; biggest commercial; richest features.
SageMaker Feature Store — AWS-native.
Vertex AI Feature Store — GCP-native.
Databricks Feature Engineering — built into Lakehouse.
Build your own — fine for a single ML team; quickly painful for multi-team.

Rule of thumb:

1–2 ML engineers, 1–3 models: skip the feature store. Use Spark + Redis.
5+ ML engineers, 10+ models: feature store is mandatory.

12. Reality check

A first-feature-store rollout:

Audit current features — find the duplicates.
Move top-10 highest-traffic features into Feast (or chosen platform).
Implement point-in-time training data generation.
Wire monitoring: freshness, null rate, drift.
Deprecate the duplicates; force-migrate model owners.

Plan for 3 months minimum. The benefit lands as the second and third model consume the same definitions — that's when you stop reinventing.

Link: https://leetcode.com/problems/subarray-sum-equals-k/
Difficulty: Medium
Why this problem: Aggregations over a rolling window with hash-map memoisation — the algorithmic heart of "events in last X minutes" features.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define training/serving skew and 3 root causes.
Explain point-in-time joins and write the ASOF join syntax.
Sketch the offline/online architecture of a feature store.
Pick a freshness SLA appropriate to the use case.
Decide when a team needs a feature store vs Spark + Redis.
Solve subarray-sum-equals-k — prefix-sum + hashmap, the same shape as streaming-window aggregations.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing

Designing a Distributed Job Queue — Reliability, Backoff, Idempotency