Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Feature Engineering & Feature Stores at Scale

Session 38 of the 48-session learning series.

Date: Thu, 2026-07-09 · Time: 18:00–20:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 13 · Est. read: 2 h

Why this session matters

This is Session 38 of 48 in the ML track. Modern ML is 80% feature engineering, 20% modelling — and at scale the consistency between training and serving features is the #1 source of silent model degradation. Feature stores were invented to fix exactly that. Understanding the architecture, not just the buzzword, is critical.

Agenda

  • What a feature actually is (and where definitions go to die)
  • Feature stores — offline + online + the consistency contract
  • Training/serving skew — the silent killer
  • Feature pipelines — point-in-time correctness, backfills, freshness
  • Real-time features — streaming aggregations, materialisation latency

Pre-read (skim before the session)

Deep dive

1. The feature definition problem

A feature is a deterministic function of raw data:

user_clicks_last_7d(user_id, ts) = COUNT(clicks WHERE user=user_id AND click_ts BETWEEN ts-7d AND ts)

Definitions live in three places without a feature store:

  • The training notebook (Python + Spark).
  • The online serving code (Go service).
  • The eval script (different Python script).

Three implementations of the same logic. Three sources of drift. Three places to fix when the definition changes.

2. What a feature store gives you

  • One definition for a feature → used in both training and serving.
  • Offline store (warehouse) for training; bulk historical reads.
  • Online store (low-latency KV) for serving; single-key fast reads.
  • Materialisation — pipeline copies pre-computed features from offline → online.
  • Discovery — UI / catalog of available features.
  • Lineage — feature → upstream tables → owners.
  • Monitoring — drift, freshness, distribution.

It's data infrastructure with an ML flavour.

3. Offline vs online stores

LayerTechLatencyUse
OfflineParquet on S3, Snowflake, BigQuery, Deltaseconds-minutesTraining, backfill, batch eval
OnlineRedis, DynamoDB, Cassandra, ScyllaDB< 10 msReal-time inference

The store must guarantee: feature(user=42, time=now) at training time = feature(user=42, time=now) at serving time.

4. Training/serving skew — the bug class

Symptoms:

  • Model AUC=0.85 offline, behaves like AUC=0.65 in production.
  • Bug manifest months later when a feature pipeline silently drifts.

Causes:

  • Feature computed differently in train vs serve (different SQL, different timezone, different null handling).
  • Feature snapshot used a 24h-late table in training (look-ahead bias) but live in serving.
  • Online feature has different freshness than training timestamps assumed.

Fixes (all of them, layered):

  • Single source of truth for feature logic (feature store).
  • Point-in-time joins (next section).
  • Log live features → reuse for retraining.

5. Point-in-time correctness

This is the hard part. When you build training data, for each row (entity, label_ts) you want the features as they were at label_ts — not now.

-- WRONG: uses current feature value
SELECT label.*, feat.value
FROM labels JOIN features ON labels.user = features.user

-- RIGHT: temporal join, point-in-time
SELECT label.*, feat.value
FROM labels
ASOF JOIN features
  ON labels.user = features.user
  AND feat.event_ts <= label.label_ts

Without point-in-time joins you leak the future into the past and your offline metrics are fiction. Spark, Polars, Feast, dbt all support point-in-time joins with varying syntax.

6. Streaming features

The features that drove the field forward:

  • clicks_last_5min(user) — recency matters.
  • merchant_velocity_last_hour(merchant) — fraud detection.
  • session_pages_so_far(session) — recommendation context.

Compute via:

  • Flink / Spark Structured Streaming → write to online store every N seconds.
  • Materialised in Redis with TTL.
  • Reads at serve = single Redis GET.

Trade-offs: lower freshness vs query simplicity vs cost.

7. Feature freshness budget

Per feature, declare:

  • Staleness SLA — "this feature must be < N seconds old".
  • Alert when freshness exceeds SLA.

Production pattern:

  • Real-time fraud features: < 5 s.
  • Recommendation features: < 60 s.
  • Personalisation embeddings: < 1 day.
  • Demographic / static: weekly is fine.

Pay for compute matching the SLA. Don't run sub-second updates on weekly features.

8. Backfill

When you add a new feature you need historical values to train on. Backfill = compute the feature for every row of historical data.

Patterns:

  • Run a Spark job over all historical raw data.
  • Use the same code path as the streaming compute (Kappa architecture) — single source of truth.
  • Idempotent writes; can re-run on failure.
  • Watch the cost — backfilling 2 years of data is petabyte-scale.

9. Embedding features

Embeddings (S17) are a special kind of feature:

  • High-dim numeric vectors.
  • Stored in vector DB or as bytes in KV store.
  • Updated when the model that produced them is retrained.
  • Need version tracking (item_emb_v3).

Versioning is critical. Serving an old model with a new embedding format = silent garbage.

10. Feature governance

  • Owner — engineer responsible.
  • Description — what does it actually mean.
  • Source — upstream tables / streams.
  • Tags — sensitive (PII), expensive, slow-to-refresh.
  • Usage — which models consume it; auto-discovered.
  • Quality stats — null rate, distribution, drift score.

Same hygiene as data governance (S32). A feature without an owner is a feature you'll delete during the next outage.

11. The build/buy decision

Feature store tools:

  • Feast — open source; you bring the infra. Most popular OSS.
  • Tecton — SaaS; biggest commercial; richest features.
  • SageMaker Feature Store — AWS-native.
  • Vertex AI Feature Store — GCP-native.
  • Databricks Feature Engineering — built into Lakehouse.
  • Build your own — fine for a single ML team; quickly painful for multi-team.

Rule of thumb:

  • 1–2 ML engineers, 1–3 models: skip the feature store. Use Spark + Redis.
  • 5+ ML engineers, 10+ models: feature store is mandatory.

12. Reality check

A first-feature-store rollout:

  1. Audit current features — find the duplicates.
  2. Move top-10 highest-traffic features into Feast (or chosen platform).
  3. Implement point-in-time training data generation.
  4. Wire monitoring: freshness, null rate, drift.
  5. Deprecate the duplicates; force-migrate model owners.

Plan for 3 months minimum. The benefit lands as the second and third model consume the same definitions — that's when you stop reinventing.

Reading material

In-depth research material

Video reference

▶︎ Feature Stores Explained (Made With ML)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Subarray Sum Equals K

  • Link: https://leetcode.com/problems/subarray-sum-equals-k/
  • Difficulty: Medium
  • Why this problem: Aggregations over a rolling window with hash-map memoisation — the algorithmic heart of "events in last X minutes" features.
  • Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Define training/serving skew and 3 root causes.
  • Explain point-in-time joins and write the ASOF join syntax.
  • Sketch the offline/online architecture of a feature store.
  • Pick a freshness SLA appropriate to the use case.
  • Decide when a team needs a feature store vs Spark + Redis.
  • Solve subarray-sum-equals-k — prefix-sum + hashmap, the same shape as streaming-window aggregations.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.