Data Infrastructure for AI & Experimentation at Scale
A complete deep-dive into Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI + Experimentation — the foundational pillars behind modern streaming and content platforms.
What Does a Data & AI Platform Power?
Understanding the bigger picture — on any large-scale streaming platform (video, music, podcasts, live content), the data infrastructure powers:
- Identity systems — who is the user across devices
- Device graph — connect smart TV, mobile, tablet, laptop to same person/household
- Audience platform — build segments like "sports lovers" or "sci-fi bingers"
- Recommendation & personalization — what to show each user on the home feed
- Measurement & attribution — did a promotion lead to engagement or subscription?
- Experimentation — A/B testing thousands of changes simultaneously
- Reporting & insights — dashboards + GenAI agents for analytics
- Ad systems — targeting, bidding, measurement for ad-supported tiers
This is Applied ML + Data Platform + Experimentation + GenAI — not pure ML research.
Table of Contents
- The Event Backbone — Ingestion & Streaming
- Machine Learning Fundamentals
- The Feature Platform
- Ads / Audience / Recommendation Systems
- Identity Resolution & Device Graph
- Data Engineering at Scale
- ML System Design
- The Experimentation Platform
- Gen AI / LLM / Agents
- SQL & Data Modeling
- KPIs / Objective Functions
- MLOps / Production ML
- Data Quality & Observability
- End-to-End Architecture
1. The Event Backbone — Ingestion & Streaming
What Needs to Be Captured
On a streaming platform, user interactions generate a staggering volume of telemetry:
| Event Type | Examples | Volume (large platform) |
|---|---|---|
| Playback events | play, pause, seek, buffer, complete | ~50B/day |
| Navigation events | impression, click, scroll, search | ~30B/day |
| Engagement signals | like, save, share, add-to-list | ~2B/day |
| Device/context | device type, OS, network, geo, time | Attached to every event |
| Content metadata | genre, duration, cast, release date | Updated in batch |
Streaming Architecture — Apache Kafka at the Core
┌──────────────┐ ┌──────────────────┐ ┌────────────────────┐
│ Client SDKs │────▶│ Kafka Clusters │────▶│ Stream Processors │
│ (mobile, │ │ (partitioned by │ │ (Flink / Spark │
│ web, TV, │ │ user_id or │ │ Structured │
│ console) │ │ device_id) │ │ Streaming) │
└──────────────┘ └──────────────────┘ └────────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────────┐
│ Raw Event │ │ Derived Streams │
│ Data Lake │ │ (sessionized, │
│ (S3/ADLS/GCS) │ │ enriched, │
│ │ │ aggregated) │
└────────────────┘ └────────────────────┘
Key Design Decisions
Schema Registry (Avro/Protobuf): Every event must conform to a registered schema. Without this, downstream consumers break constantly. Use Apache Avro with Confluent Schema Registry or Protobuf definitions checked into version control. Schema evolution rules (backward/forward compatibility) are critical.
Partitioning Strategy: Partition by user_id for user-centric analytics (session analysis, feature computation). Partition by content_id for content-centric tasks (popularity counters, trending). Some platforms maintain dual topics — one partitioned each way.
Exactly-Once Semantics: Kafka supports idempotent producers and transactional writes. For streaming jobs consuming from Kafka → writing to a data lake, use Flink checkpointing with exactly-once sink connectors to avoid duplicate or missing events.
Backpressure & Dead Letter Queues: When a consumer can't keep up, events must be buffered, not dropped. Dead letter queues capture malformed events for later reprocessing rather than silently discarding them.
Event Schema Design — Example
{
"event_id": "uuid-v4",
"event_type": "playback.started",
"timestamp_ms": 1712419200000,
"user_id": "u_abc123",
"session_id": "s_xyz789",
"content_id": "c_movie_456",
"device": {
"type": "smart_tv",
"os": "tvOS",
"app_version": "5.12.0"
},
"context": {
"page": "home_feed",
"row": "continue_watching",
"position": 3,
"algorithm": "collab_filter_v2",
"experiment_id": "exp_2026_q2_reco_v3",
"variant": "treatment_b"
},
"content_metadata": {
"genre": ["sci-fi", "thriller"],
"duration_sec": 7200,
"release_year": 2025
}
}
Why context.algorithm and context.experiment_id matter: They tie every interaction back to the model and experiment that produced the recommendation. Without this, you can't measure model performance or run valid A/B tests.
2. Machine Learning Fundamentals
Supervised Learning
- Regression — predict continuous values (e.g., predicted watch time)
- Linear Regression — fit a line using ordinary least squares. Assumes linear relationship between features and target. Closed-form solution:
- Polynomial Regression — extend linear regression with polynomial features for non-linear relationships
- Classification — predict categories (e.g., will user click?)
- Logistic Regression — binary classification using the sigmoid function: . Despite the name, it's a classifier, not regression. Outputs probabilities, threshold at 0.5 by default.
- Decision Trees — recursive splitting on feature thresholds to minimize impurity (Gini or entropy). Human-interpretable but prone to overfitting.
- Random Forest — ensemble of decision trees trained on bootstrap samples (bagging). Each tree sees a random subset of features. Reduces variance while maintaining low bias. Typically 100-500 trees.
- Gradient Boosting (XGBoost, LightGBM, CatBoost) — sequential trees where each new tree corrects the errors of the ensemble so far. Minimizes a loss function via gradient descent in function space. XGBoost uses regularized objective: . LightGBM is faster with histogram-based splitting and leaf-wise growth.
- Neural Networks — layers of neurons with non-linear activations (ReLU, sigmoid, tanh). Universal approximators — can learn any function given sufficient depth and width.
Unsupervised Learning
- Clustering (KMeans) — partition data into K groups by minimizing within-cluster sum of squares. Sensitive to initialization (use KMeans++). Requires choosing K (elbow method, silhouette score).
- DBSCAN — density-based clustering. Finds arbitrarily shaped clusters. No need to specify K. Parameters: epsilon (neighborhood radius), min_samples.
- PCA (Principal Component Analysis) — project data onto orthogonal directions of maximum variance. Used for dimensionality reduction, visualization, and denoising. Keep components that explain >95% variance.
- Embeddings — dense vector representations of entities (users, items, words). Learned via neural networks. Similar entities have similar vectors in embedding space.
- Similarity search — find nearest neighbors in embedding space using cosine similarity, Euclidean distance, or dot product. Approximate Nearest Neighbor (ANN) algorithms: FAISS, ScaNN, HNSW for sub-linear search.
Important Concepts
- Bias vs Variance tradeoff — high bias = underfitting (model too simple), high variance = overfitting (model too complex). Goal: minimize total error = bias² + variance + irreducible noise.
- Overfitting — model memorizes training data, performs poorly on unseen data. Signs: training accuracy >> validation accuracy. Fix: more data, regularization, simpler model, early stopping, dropout.
- Regularization
- L1 (Lasso): adds to loss. Encourages sparsity — some weights become exactly zero. Good for feature selection.
- L2 (Ridge): adds to loss. Shrinks weights evenly toward zero. Prevents any single feature from dominating.
- Elastic Net: combines L1 + L2 regularization.
- Dropout (neural networks): randomly zero out neurons during training. Acts as ensemble of sub-networks.
- Cross Validation — evaluate model on multiple train/test splits. K-fold: split data into K parts, train on K-1, validate on 1, rotate. Stratified K-fold preserves class distribution.
- Feature Engineering — creating useful input features from raw data. Often the #1 differentiator in applied ML. Examples: time-since-last-interaction, rolling averages, ratio features, cross features.
- Feature Scaling — StandardScaler (zero mean, unit variance) or MinMaxScaler (0 to 1). Critical for distance-based algorithms (KNN, SVM) and gradient descent optimization.
- Handling missing values — imputation (mean, median, mode), indicator columns ("is_missing"), or model-based (KNN imputation, iterative imputer). Never silently drop rows in production.
- Class imbalance — when positive class is rare (e.g., 1% conversion rate). Techniques: SMOTE (synthetic oversampling), class weights in loss function, undersampling majority class, focal loss.
Metrics
| Metric | What It Measures | When to Use |
|---|---|---|
| Accuracy | % of correct predictions | Balanced classes only |
| Precision | Of predicted positive, how many are actually positive | When false positives are costly (spam detection) |
| Recall | Of actual positives, how many did we catch | When false negatives are costly (fraud detection) |
| F1 Score | Harmonic mean of precision and recall | Imbalanced classes, need balance |
| ROC AUC | Area under ROC curve — tradeoff between TPR and FPR | Binary classification, threshold-independent |
| PR AUC | Area under Precision-Recall curve | Highly imbalanced datasets |
| Log Loss | Penalizes confident wrong predictions | CTR prediction, calibrated probabilities |
| RMSE | Root Mean Squared Error | Regression, penalizes large errors |
| MAE | Mean Absolute Error | Regression, robust to outliers |
| NDCG | Normalized Discounted Cumulative Gain | Ranking quality |
| MAP | Mean Average Precision | Information retrieval |
| MRR | Mean Reciprocal Rank | First relevant result position |
3. The Feature Platform
The Feature Engineering Problem
ML models need features. Features come from raw events. The challenge:
- Training needs features computed over historical data (batch)
- Serving needs features computed in real-time (online)
- The features must be identical — otherwise you get training-serving skew, the silent killer of ML systems
Feature Store Architecture
(Kafka / Data Lake)"] --> Batch["⏱️ Batch Pipeline
(Spark / dbt)"] Raw --> Stream["⚡ Stream Pipeline
(Flink / Spark Streaming)"] Batch --> Offline["🗄️ Offline Store
(Hive / Delta Lake / BigQuery)"] Stream --> Online["⚡ Online Store
(Redis / DynamoDB / Bigtable)"] Offline --> Training["🤖 Model Training"] Online --> Serving["🔮 Model Serving"] Offline -.->|"point-in-time join"| Training Batch -.->|"backfill"| Online style Online fill:#ff9800,color:#fff style Offline fill:#2196f3,color:#fff
Types of Features on a Streaming Platform
User-Level Features (computed per user)
user_total_watch_hours_7d — rolling 7-day watch time
user_genre_affinity_vector — softmax over genre engagement
user_avg_session_length_30d — average session in last 30 days
user_skip_rate_7d — fraction of content abandoned < 30s
user_search_to_play_ratio_7d — how often search leads to a play
user_time_of_day_distribution — histogram of activity by hour
user_device_preference — primary device type
user_content_completion_rate_30d — fraction of content watched to end
user_days_since_signup — account age
user_subscription_tier — free / basic / premium
Content-Level Features (computed per item)
content_total_plays_7d — popularity signal
content_avg_completion_rate — quality signal
content_genre_tags — categorical
content_release_recency_days — freshness
content_avg_rating — explicit quality
content_play_to_impression_ratio — CTR proxy
Cross Features (user × content interaction)
user_has_watched_same_genre_7d — genre relevance
user_watched_same_creator — creator affinity
user_x_content_genre_overlap — cosine similarity of genre vectors
Point-in-Time Joins — The Most Critical Concept
When training a model, you must join features as they existed at the time of the event, not as they exist now. Otherwise you leak future information into training data.
# WRONG — uses current features for historical events
features = feature_store.get_latest("user_123")
# RIGHT — uses features as-of the event timestamp
features = feature_store.get_as_of("user_123", timestamp="2026-03-15T14:00:00Z")
Frameworks like Feast, Tecton, and Hopsworks handle point-in-time correctness automatically when you define feature views with timestamps.
Online Feature Serving — Latency Matters
Model inference at serving time must complete within 50-100ms (including feature fetch + model forward pass). This means:
- Redis Cluster or DynamoDB for sub-5ms feature lookups
- Feature caching at the application layer for session-stable features
- Batch precomputation of expensive features (e.g., run nightly, push to online store)
- Feature embedding lookups — precomputed user/content embeddings stored in vector stores
4. Ads / Audience / Recommendation Systems
Ads System Architecture
The full flow of an ad system on a streaming platform:
How Ad Auctions Work
When a user opens a streaming app and hits an ad break, the platform runs an auction in real-time:
- Bid Request — platform sends user context (anonymized) to demand-side platforms or internal ad server
- Bid Response — advertisers bid for the impression, specifying CPM (cost per 1000 impressions)
- Auction — typically a second-price auction (winner pays $0.01 above second-highest bid) or increasingly first-price auction
- Ad Selection — rank by eCPM = , also factor in relevance and user experience
- Impression & Tracking — serve the ad, fire tracking pixels for viewability, completion, clicks
Models Used in Ads
- CTR prediction (Click Through Rate) — probability user clicks on ad. Typically uses gradient boosted trees (XGBoost/LightGBM) or deep learning with embedding layers for sparse features. Training data: billions of impression-click pairs.
- CVR prediction (Conversion Rate) — probability of purchase after click. Harder than CTR due to sparser signal and longer attribution windows.
- Ranking models — order ads by expected value. eCPM = CTR × bid. Multi-objective ranking can also factor in user experience.
- Lookalike modeling — given a seed audience (e.g., people who bought product X), find similar users in the broader population. Typically uses embedding similarity or supervised classification.
- Audience segmentation — cluster users into meaningful groups using behavioral features. KMeans, DBSCAN, or learned segment embeddings.
- Collaborative filtering — recommend based on similar users' behavior. Matrix factorization (ALS), neural collaborative filtering.
- Content-based filtering — recommend based on item features (genre, tags, description embeddings).
- Embeddings for users & items — learn dense vector representations via two-tower models or autoencoders. User embedding captures preferences, item embedding captures attributes.
- Multi-armed bandits — explore vs exploit for ad/content selection. Thompson Sampling, Upper Confidence Bound (UCB), epsilon-greedy.
- Budget pacing — spend advertiser budget evenly over campaign duration. Proportional-integral-derivative (PID) controllers are common.
Recommendation Model Architectures
Two-Tower Model (Retrieval Stage)
User Tower Item Tower
┌─────────────────┐ ┌─────────────────┐
│ user features │ │ item features │
│ watch history │ │ genre, tags │
│ demographics │ │ popularity │
└────────┬────────┘ └────────┬────────┘
│ │
[Dense Layers] [Dense Layers]
│ │
▼ ▼
user_embedding item_embedding
(128-dim) (128-dim)
│ │
└──────── dot product ───────┘
│
similarity score
- Training: In-batch negatives or hard negative mining. Loss: softmax cross-entropy or sampled softmax.
- Serving: Precompute all item embeddings → build ANN index (FAISS, ScaNN, HNSW). At request time, compute user embedding → ANN lookup for top-K candidates in <10ms.
- Scale: Retrieve ~500 candidates from millions of items in under 10ms.
Deep Ranking Model (Ranking Stage)
Takes ~500 candidates from retrieval and scores them:
Input Features:
- User features (dense + sparse)
- Item features (dense + sparse)
- Cross features (user × item)
- Context features (time, device, page)
│
[Embedding Layer — sparse features]
│
[Concatenation with dense features]
│
[Multi-Layer Perceptron or DCN-v2]
│
[Multi-Task Heads]
├── P(click)
├── P(watch > 50%)
├── P(complete)
├── P(like)
└── P(add to list)
│
[Weighted combination → final score]
Multi-Task Learning is essential because optimizing for clicks alone leads to clickbait. The final score:
Sequence Models for Watch History
User watch history is sequential — order matters:
- Transformer-based: Self-attention over recent N items watched. Captures temporal patterns (binge-watching a series).
- SASRec (Self-Attentive Sequential Recommendation): Causal attention mask on item sequence to predict next interaction.
- GRU4Rec: RNN-based session recommendation. Lighter weight, still effective for session-based reco.
Content Cold Start — The Chicken-and-Egg Problem
New content has zero engagement signal. How do you recommend something nobody has watched?
| Strategy | How It Works | Limitations |
|---|---|---|
| Content-based features | Use metadata (genre, cast, director, synopsis embedding) to find similar existing content | Ignores personal taste |
| Explore/exploit | Thompson Sampling or epsilon-greedy — intentionally show new content to a small % of users | Hurts short-term metrics |
| Contextual bandits | Use contextual features (user segment, time of day) to decide who sees new content | Requires fast feedback loop |
| Creator-based transfer | If a creator has a track record, use their historical performance as a prior | Only for known creators |
| Editorial boost | Curators manually promote content for initial impressions | Doesn't scale |
Best systems combine content-based warm-starting + explore/exploit for initial signals, then transition to collaborative filtering once enough data exists.
Handling Bias in Recommendation Data
All training data is biased — users can only engage with what was shown.
Position Bias: Items in position 1 get more clicks regardless of relevance. Solution: train a position bias model separately and debias during training.
Selection Bias: Training data only contains items the model chose to show. Solution: inverse propensity scoring (IPS) — weight training examples by .
Popularity Bias: Popular items get shown more → get more engagement → appear more popular. Rich-get-richer feedback loop. Solution: diversity objectives + calibration to match user interest distribution.
5. Identity Resolution & Device Graph
The Fundamental Problem
How do you know that a smart TV, a mobile phone, a laptop, and a tablet belong to the same user or household? This is the identity resolution problem.
Methods
- Deterministic matching — exact match on email, login, phone number, account ID. High precision, low recall. Requires authenticated sessions.
- Probabilistic matching — statistical models using IP address, Wi-Fi network, location patterns, timing correlation. Lower precision, higher recall.
- Graph-based identity resolution — build a graph of devices and signals, run connected components or community detection.
- Embedding similarity — learn device/user embeddings from interaction patterns, cluster by cosine similarity.
- Graph Neural Networks (GNN) — use message passing on the device graph for more sophisticated entity resolution.
Device Graph Structure
A graph where nodes are devices/users/households and edges represent observed relationships:
Key Graph Algorithms
- Connected Components — find clusters of related devices. Simple BFS/DFS. At scale, use graph-parallel frameworks (GraphX, Pregel) for distributed connected components.
- PageRank — rank importance of nodes. The "primary device" in a household often has the highest PageRank.
- Node Embeddings (Node2Vec) — random walks on the graph → skip-gram model → dense vector per node. Similar devices end up close in embedding space.
- GraphSAGE — inductive GNN that learns node representations by aggregating features from local neighborhoods. Scales to large graphs because it samples neighborhoods rather than using the full graph.
- Link Prediction — predict missing edges (e.g., does this phone belong to this household?). Train a classifier on node-pair features: common neighbors, Jaccard similarity, learned embeddings.
Identity Graph at Scale
At streaming platform scale (100M+ households), the identity graph has:
- Billions of nodes (devices, IPs, accounts, cookies)
- Hundreds of billions of edges
- Continuous updates — new devices appear, IPs rotate, accounts merge
This requires:
- Graph databases: Neo4j, JanusGraph, Amazon Neptune, or custom solutions on Spark GraphX
- Incremental graph updates — don't recompute from scratch; process new edges as they arrive
- Privacy-compliant design — hash/anonymize PII, respect opt-outs, comply with GDPR/CCPA
6. Data Engineering at Scale
Core Concepts Deep Dive
- ETL vs ELT — Traditional ETL transforms before loading to warehouse. Modern ELT loads raw data to lake first, transforms in-place using Spark or dbt. ELT is dominant because storage is cheap, and you preserve raw data.
- Data Lakes — store raw data in any format (S3, ADLS Gen2, GCS). Use open table formats (Delta Lake, Apache Iceberg, Apache Hudi) for ACID transactions, time travel, schema evolution on the lake.
- Data Warehouse — structured, optimized for analytics. Columnar storage, query optimization. Snowflake, BigQuery, Redshift, Synapse, Databricks SQL.
- Lakehouse — hybrid: data lake storage + warehouse query performance. Delta Lake on Databricks, Apache Iceberg on Trino. Eliminates the need to maintain separate lake and warehouse.
- Batch vs Streaming — batch processes data in scheduled chunks (hourly/daily). Streaming processes events as they arrive (Kafka + Flink). Lambda architecture runs both; Kappa architecture uses streaming for everything.
- Apache Spark — distributed compute engine. RDDs → DataFrames → Structured Streaming. Key concepts: lazy evaluation, partitioning, shuffle, catalyst optimizer, adaptive query execution (AQE in Spark 3+).
- Apache Flink — true stream processing with event-time semantics, watermarks, exactly-once guarantees. Better than Spark Streaming for low-latency use cases.
- Presto / Trino — distributed SQL query engine for interactive analytics on data lakes. Federated queries across multiple data sources.
- Parquet — columnar storage format. Efficient for analytics (read only needed columns). Supports predicate pushdown, dictionary encoding, run-length encoding.
- Feature Store — centralized repository for ML features (Feast, Tecton, Hopsworks). Ensures consistency between training and serving.
- Apache Airflow / Dagster / Prefect — workflow orchestration. Define DAGs (directed acyclic graphs) of tasks with dependencies. Airflow is most popular but Dagster has better dev experience and asset-based paradigm.
- Kafka — distributed event streaming. Partitioned, replicated, durable. Key concepts: topics, partitions, consumer groups, offset management, exactly-once semantics.
Data at Scale — Common Challenges
- Data partitioning — split data by date/region for faster queries. Partition pruning avoids scanning irrelevant data.
- Data skew — uneven data distribution across partitions. One partition with 100x more data than others. Fix: salting keys, repartitioning, broadcast joins for small tables.
- Joins at scale — broadcast join (small table fits in memory), sort-merge join (both tables large, co-partitioned), shuffle hash join (default, most expensive).
- Window functions — compute aggregates across rows related to current row. Essential for sessionization, rolling features, ranking.
- Late-arriving data — events that arrive after the processing window has closed. Handle with watermarks (Flink), merge-on-read (Iceberg/Hudi), or reprocessing.
Typical Data Pipeline
Parquet/Iceberg] C --> D[Spark / dbt
Transform] D --> E[Feature Store] E --> F[ML Model Training] D --> G[Analytics
Warehouse] G --> H[Dashboards] F --> I[Model Serving] I --> J[Predictions API]
Sessionization — Deceptively Hard
A "session" isn't straightforward. Is it timeout-based (30 min inactivity)? Content-boundary-based (new show = new session)? Device-specific?
-- Sessionization using inactivity gap (30 minutes)
WITH events_with_gap AS (
SELECT
user_id,
event_timestamp,
LAG(event_timestamp) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS prev_timestamp,
CASE
WHEN EXTRACT(EPOCH FROM event_timestamp - LAG(event_timestamp)
OVER (PARTITION BY user_id ORDER BY event_timestamp)) > 1800
THEN 1
ELSE 0
END AS new_session_flag
FROM raw_events
)
SELECT
user_id,
event_timestamp,
SUM(new_session_flag) OVER (
PARTITION BY user_id ORDER BY event_timestamp
) AS session_id
FROM events_with_gap;
Real-time sessionization in Flink uses session windows with a gap timeout — but requires careful handling of late-arriving events and cross-device sessions.
7. ML System Design
System Design Framework
When designing any ML system, structure it as:
- Problem definition — what are we solving?
- Metrics — how do we measure success? (offline + online)
- Data sources — what data do we have?
- Feature engineering — what features do we build?
- Model choice — what algorithm fits?
- Training pipeline — how do we train at scale?
- Serving architecture — batch vs real-time?
- Monitoring — what do we track in production?
- Retraining — when and how do we update?
- Experimentation — how do we A/B test?
ML System Pipeline
Real-Time vs Batch Training
| Aspect | Batch Training | Real-Time Training |
|---|---|---|
| Freshness | Hours to days stale | Minutes stale |
| Use case | Stable preferences | Trending content, viral items |
| Infrastructure | Spark + GPU cluster | Flink + parameter server |
| Complexity | Lower | Much higher |
| Typical approach | Daily/weekly retrain | Continuous embedding updates |
Most platforms use a hybrid: batch-train the full model daily/weekly, but update embedding tables in near-real-time for new content and shifting user preferences (as described in ByteDance's Monolith paper).
Model Serving Infrastructure
For a platform serving millions of concurrent users:
- Model format: TensorFlow SavedModel, ONNX, or TorchScript
- Serving framework: TensorFlow Serving, Triton Inference Server, BentoML, or custom gRPC services
- Dynamic batching: Group multiple requests into one GPU batch for throughput
- Caching: LRU cache on popular content embeddings; session-level cache for user context
- Fallback: If model inference fails, fall back to a rule-based ranker (popularity-based)
- Latency budget: <50ms p99 for the entire recommend-and-rank pipeline
Example Systems to Understand
- Content recommendation system (home feed ranking)
- Ad targeting and ranking system
- Audience segmentation system
- Search ranking system
- Attribution and measurement system
- Identity resolution system
- Reporting insights AI agent
8. The Experimentation Platform
Why Experimentation Infrastructure Is as Important as ML
A model is only as good as your ability to measure its impact. Streaming platforms run hundreds to thousands of concurrent experiments:
- New ranking models
- UI layout changes
- Content promotion strategies
- Notification timing
- Pricing experiments
- Ad targeting algorithms
Core Components
(variants, allocation,
guardrails)"] --> Assign["👤 Assignment
Service"] Assign --> Client["📱 Client SDK
(get variant)"] Client --> Events["📊 Events
(tagged with
experiment_id + variant)"] Events --> Pipeline["⚙️ Metrics
Pipeline"] Pipeline --> Stats["📈 Statistical
Analysis"] Stats --> Dashboard["📊 Experiment
Dashboard"] style Config fill:#9c27b0,color:#fff style Stats fill:#4caf50,color:#fff
Assignment — Deterministic Hashing
import hashlib
def get_variant(user_id: str, experiment_id: str, num_variants: int) -> int:
"""Deterministic assignment: same user always gets same variant."""
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
return hash_value % num_variants
Properties: - Deterministic — same user always sees same variant (no flickering) - Uniform — SHA256 gives near-perfect uniform distribution - Independent — different experiments use different hash inputs, so assignments are uncorrelated - Stateless — no need to store assignments; recompute on every request
Core Statistical Concepts
- A/B testing — compare control (A) vs treatment (B) with random assignment
- Hypothesis testing — formulate null (: no difference) and alternative (: treatment is better) hypothesis
- P-value — probability of observing the result (or more extreme) under . Typically reject if < 0.05.
- Confidence interval — range where true value likely falls (typically 95%)
- Statistical significance — result unlikely due to chance
- Power analysis — determine required sample size before experiment
Sample Size Calculation
Where: - = z-score for significance level (1.96 for 95%) - = z-score for power (0.84 for 80% power) - = variance of the metric - = minimum detectable effect (MDE)
Typical MDEs for streaming platforms: - Engagement metrics (watch hours): ±0.5% - Retention metrics (day-7 retention): ±0.2% - Revenue metrics (ARPU): ±1.0%
Traffic Allocation & Mutual Exclusion
Total Traffic (100%)
├── Layer: Ranking (40%)
│ ├── Experiment A: New model v3 (50% control / 50% treatment)
│ └── Experiment B: Feature expansion (50/50)
│ Note: A and B are mutually exclusive within this layer
├── Layer: UI (30%)
│ ├── Experiment C: Card size (33/33/33)
│ └── Experiment D: Autoplay threshold (50/50)
├── Layer: Notifications (20%)
│ └── Experiment E: Send time optimization
└── Holdout (10%)
└── No experiments — clean baseline
Within a layer, experiments are mutually exclusive. Across layers, they are orthogonal (independent). This is Google's overlapping experiments architecture.
Variance Reduction Techniques
Raw metrics have high variance (some users binge 8 hours, most watch 30 minutes):
CUPED (Controlled-experiment Using Pre-Experiment Data):
Where is pre-experiment metric value and . Reduces variance by 30-50%.
Stratified sampling: Bucket users by activity level, compute within-stratum estimates.
Winsorization: Cap extreme values at 99th percentile to reduce outlier influence.
Multiple Testing Correction
With hundreds of experiments, false positives are inevitable: - Bonferroni: (conservative) - Benjamini-Hochberg (FDR control): Rank p-values, reject where - Always Valid Sequential Testing — allows peeking at results without inflating false positives
Guardrail Metrics
Every experiment must monitor guardrails — things that must NOT degrade:
| Guardrail | Threshold | Why |
|---|---|---|
| App crash rate | +0.0% | Reliability |
| Page load latency p99 | +50ms | Performance |
| Error rate | +0.1% | Stability |
| Customer support contacts | +5% | UX quality |
| Subscription cancellation rate | +0.5% | Revenue |
If any guardrail is violated, auto-pause the experiment.
Metrics for Experiments
- CTR — click through rate
- Conversion rate — purchases / impressions
- Revenue / ARPU — average revenue per user
- ROAS — Return on Ad Spend (revenue / ad spend)
- Engagement — time spent, interactions, completion rate
- Retention — day-1, day-7, day-28 return rate
9. Gen AI / LLM / Agents
LLM Core Topics
- Transformers — the architecture behind modern LLMs. Self-attention mechanism computes relevance between all token pairs: . Encoder-decoder (T5, BART), decoder-only (GPT, Llama), encoder-only (BERT).
- Embeddings — convert text/entities to dense vectors. Sentence transformers (all-MiniLM, E5, BGE) produce fixed-size embeddings for semantic similarity.
- Tokenization — BPE (Byte-Pair Encoding), SentencePiece, tiktoken. Subword tokenization balances vocabulary size and coverage.
- Vector databases — store and search embeddings at scale. Pinecone, Weaviate, Milvus, Qdrant, pgvector, Chroma. Support approximate nearest neighbor (ANN) search with HNSW or IVF indexes.
- RAG (Retrieval Augmented Generation) — retrieve relevant context from a knowledge base, inject into LLM prompt, then generate answer. Reduces hallucination and keeps knowledge current without retraining.
- Prompt engineering — few-shot examples, chain-of-thought reasoning, system prompts, output formatting. Temperature controls randomness.
- Fine-tuning — adapt a pre-trained LLM to a specific domain or task. LoRA (Low-Rank Adaptation) and QLoRA achieve this efficiently without full parameter updates.
- Evaluation — BLEU, ROUGE (automated), human eval, LLM-as-judge, factual accuracy, faithfulness to retrieved context.
- Hallucination — model generates plausible but incorrect information. Mitigate with RAG, grounding, and retrieval verification.
- Guardrails — input/output validation, content filtering, PII detection. Libraries: Guardrails AI, NeMo Guardrails.
- Agents — LLMs that can take actions using tools. ReAct pattern: Reason about what to do → Act using a tool → Observe the result → Repeat.
- Tool calling / Function calling — LLM generates structured output to invoke external functions (APIs, databases, code interpreters).
- Semantic search — search by meaning rather than keywords using embedding similarity.
- NL→SQL systems — convert natural language questions to SQL queries. Text2SQL with LLMs + schema awareness.
- Knowledge graphs + LLM — structured data enhancing LLM reasoning. GraphRAG combines graph traversal with retrieval.
RAG Architecture
in Vector DB] C --> D[Top-K Relevant
Documents] D --> E[Construct Prompt
Question + Context] E --> F[LLM] F --> G[Answer] style C fill:#ff9800,color:#fff style F fill:#4caf50,color:#fff
RAG Pipeline Details
- Indexing Phase (offline):
- Chunk documents (512-1024 tokens, with overlap)
- Generate embeddings per chunk
- Store in vector database with metadata
- Query Phase (online):
- Embed the user query
- Retrieve top-K similar chunks (cosine similarity)
- (Optional) Rerank retrieved chunks with a cross-encoder
- Inject context + query into LLM prompt
- Generate answer with citations
Reporting Agent Architecture
This is increasingly common on streaming platforms — analysts ask questions in natural language and get SQL-backed answers with visualizations.
10. SQL & Data Modeling
SQL Deep Dive
- Joins — INNER (matching rows), LEFT (all from left + matching right), RIGHT (all from right + matching left), FULL OUTER (all rows), CROSS (cartesian product). In distributed systems, join order matters enormously for performance.
- Group by + Aggregations — SUM, COUNT, AVG, MIN, MAX. GROUP BY with HAVING for filtered aggregates.
- Window functions — the most powerful SQL feature for analytics:
ROW_NUMBER()— sequential numbering within partitionRANK() / DENSE_RANK()— rank with ties handlingLAG() / LEAD()— access previous/next rowSUM() OVER (ORDER BY ... ROWS BETWEEN ...)— running totalsNTILE(n)— divide into n equal groups- CTEs (Common Table Expressions) — WITH clause for readable, composable queries. Recursive CTEs for hierarchical data.
- Indexing — B-tree (default, range queries), hash (equality lookups), GiST/GIN (full-text search, arrays). Index selection is critical for query performance.
- Query optimization — EXPLAIN ANALYZE, predicate pushdown, partition pruning, broadcast hints.
Data Modeling for Analytics
Star Schema — central fact table surrounded by dimension tables. Denormalized for query performance.
- Fact table — stores measurable events (impressions, plays, clicks, conversions) with foreign keys to dimensions and numeric measures.
- Dimension table — descriptive attributes (user demographics, content metadata, device info). Slowly changing dimensions (SCD Type 1/2) for tracking history.
- OLAP vs OLTP — analytics (column-oriented, aggregations, full scans) vs transactions (row-oriented, lookups, updates).
11. KPIs / Objective Functions
Choosing the right objective function for each system is critical:
| System | Objective | Why |
|---|---|---|
| CTR model | Log Loss (Binary Cross-Entropy) | Penalizes confident wrong predictions, produces calibrated probabilities |
| Content ranking | NDCG (Normalized Discounted Cumulative Gain) | Measures quality of ordered list, weights top positions higher |
| Recommendation | MAP / Recall@K | Evaluates relevance in top-K results |
| Audience segmentation | Silhouette Score / Calinski-Harabasz | Measures cluster quality and separation |
| Attribution | Incrementality (causal lift) | Measures true causal impact, not just correlation |
| Budget pacing | Spend vs Target deviation | Minimize under/over-delivery |
| Identity resolution | Precision / Recall / F1 of matches | Correct device-to-user mapping accuracy |
| Engagement | Composite: quality watch hours + diversity | Prevents Goodhart's Law from gaming single metrics |
The North Star Metric Problem
Streaming platforms have many metrics that often conflict:
| Metric | Optimizes For | Risk |
|---|---|---|
| Watch hours | Engagement | Autoplay addiction, low-quality binges |
| DAU / MAU | Retention | Doesn't capture depth |
| Content starts | Discovery | Doesn't measure satisfaction |
| Completion rate | Satisfaction | Penalizes long content |
| Revenue / ARPU | Business | May sacrifice long-term engagement |
Solution: define a composite metric:
Where QualityWatchHours filters out background/autoplay — only counting engaged viewing.
12. MLOps / Production ML
Core MLOps Concepts
- Model deployment — serving predictions via API (real-time) or batch job (scheduled)
- Batch vs real-time inference — precompute predictions for all users nightly vs compute on-demand per request
- Feature store — consistent features for training and serving (prevents skew)
- Model versioning — track model artifacts, hyperparameters, training data version, metrics. Tools: MLflow, Weights & Biases, Neptune.
- Monitoring
- Prediction quality — track accuracy/AUC on incoming data vs hold-out
- Data drift — input feature distributions shift over time (KL divergence, PSI). Seasonal changes, new content catalog.
- Concept drift — relationship between features and target changes. User behavior evolves.
- Latency & throughput — p50/p95/p99 inference latency, requests per second
- Retraining pipelines — automated model refresh on new data. Can be scheduled (daily/weekly) or triggered by drift detection.
- Deployment strategies
- Shadow deployment — run new model alongside old, log predictions but don't serve to users. Compare outputs.
- Canary deployment — roll out to 1-5% of traffic first. Monitor guardrails. Gradually increase.
- Blue-green — two identical environments. Switch traffic entirely from blue (old) to green (new).
- A/B model testing — statistically rigorous comparison on live traffic with proper sample size.
MLOps Architecture
GPU Cluster] C --> D[Model Registry
MLflow / W&B] D --> E[Offline Evaluation
AUC, NDCG] E --> F{Deployment} F -->|Shadow| G1[Log Only] F -->|Canary| G2[5% Traffic] F -->|A/B Test| G3[Experiment] F -->|Full| G4[100% Traffic] G1 --> H[Monitoring
Drift + Latency + Quality] G2 --> H G3 --> H G4 --> H H -->|"Drift Detected"| I[Auto-Retrain] I --> C
13. Data Quality & Observability
Why Data Quality Is the Biggest Risk
ML models are only as good as their input data. Common issues:
| Issue | Impact | Detection |
|---|---|---|
| Missing events (SDK bug) | Undercounting → wrong experiment results | Volume anomaly detection |
| Duplicate events (retry storms) | Overcounting → inflated metrics | Dedup by event_id |
| Schema changes (unannounced) | Pipeline breakages | Schema registry enforcement |
| Clock skew (device time wrong) | Feature computation errors | Server-side timestamp validation |
| Bot/fraud traffic | Pollutes training data | Behavioral anomaly detection |
| Late-arriving data | Incomplete aggregations | Watermark-based processing |
Data Quality Framework
Validation"] Schema --> Volume["📊 Volume
Checks"] Volume --> Freshness["⏱️ Freshness
SLAs"] Freshness --> Distribution["📈 Distribution
Checks"] Distribution --> Alert["🚨 Alert &
Circuit Breaker"] style Alert fill:#f44336,color:#fff
Tools: Great Expectations, dbt tests, Monte Carlo, Anomalo, Soda, or custom solutions on statistical process control.
Circuit Breakers: If input data quality drops below a threshold, automatically stop the ML training pipeline from consuming bad data. Better to serve a slightly stale model than one trained on corrupted data.
Data Lineage & Cataloging
- Lineage tracking: From raw event → derived table → feature → model → experiment metric. Tools: OpenLineage, DataHub, Amundsen.
- Data catalog: Searchable inventory of all datasets with ownership, freshness, schema, and usage statistics.
- SLA monitoring: Each critical table has a freshness SLA. Breaches trigger alerts.
14. End-to-End Architecture
Putting It All Together
┌─────────────────────────────────────────────────────────────────────────┐
│ CLIENT DEVICES │
│ (Smart TV, Mobile, Web, Console, Set-Top Box, Gaming Console) │
└─────────────┬───────────────────────────────────────────┬───────────────┘
│ events │ API calls
▼ ▼
┌─────────────────────────┐ ┌───────────────────────────────┐
│ EVENT INGESTION │ │ API GATEWAY │
│ ┌─────────────────┐ │ │ ┌────────────────────────┐ │
│ │ Schema Registry │ │ │ │ Experiment Assignment │ │
│ │ Kafka Clusters │ │ │ │ (deterministic hash) │ │
│ │ (multi-region) │ │ │ └────────────────────────┘ │
│ └────────┬────────┘ │ │ ┌────────────────────────┐ │
│ │ │ │ │ Recommendation API │ │
└────────────┼─────────────┘ │ │ retrieve → rank → │ │
│ │ │ rerank → serve │ │
▼ │ └────────────────────────┘ │
┌─────────────────────────────┐ └──────────────┬────────────────┘
│ STREAM PROCESSING │ │
│ (Flink / Spark Streaming) │ │
│ ┌───────────────────────┐ │ ┌──────────────▼────────────────┐
│ │ Sessionization │ │ │ ML SERVING │
│ │ Real-time features │ │ │ ┌────────────────────────┐ │
│ │ Real-time metrics │ │ │ │ Retrieval (ANN index) │ │
│ │ Anomaly detection │ │ │ │ Ranking (GPU/CPU) │ │
│ └───────────┬───────────┘ │ │ │ Feature Store (Redis) │ │
└───────────────┼─────────────┘ │ │ Model Registry │ │
│ │ └────────────────────────┘ │
▼ └───────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA LAKE │
│ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Raw Events │ │ Sessionized │ │ Feature │ │ Experiment │ │
│ │ (Parquet) │ │ Events │ │ Tables │ │ Results │ │
│ └──────────────┘ └───────────────┘ └──────────────┘ └─────────────┘ │
│ │
│ Storage: S3 / ADLS / GCS — Format: Delta Lake / Iceberg / Hudi │
│ Compute: Spark / Trino / dbt │
└─────────────────────────────────────────────────────────────────────────┘
Technology Choices — Practical Summary
| Component | Open Source | Managed Service |
|---|---|---|
| Event streaming | Apache Kafka | Confluent Cloud, Amazon MSK, Azure Event Hubs |
| Stream processing | Apache Flink, Spark Structured Streaming | Kinesis Data Analytics, Dataflow |
| Data lake storage | HDFS, MinIO | S3, ADLS Gen2, GCS |
| Table format | Delta Lake, Apache Iceberg, Apache Hudi | Databricks, Snowflake |
| Batch compute | Apache Spark, Trino/Presto | Databricks, EMR, Synapse, BigQuery |
| Feature store | Feast, Hopsworks | Tecton, SageMaker Feature Store, Vertex AI |
| ML training | PyTorch, TensorFlow, XGBoost | SageMaker, Vertex AI, Azure ML |
| Model serving | TF Serving, Triton, BentoML | SageMaker Endpoints, Vertex AI |
| Experiment platform | Custom (most common) | Eppo, Statsig, LaunchDarkly, Optimizely |
| Data quality | Great Expectations, dbt tests, Soda | Monte Carlo, Anomalo |
| Orchestration | Apache Airflow, Dagster, Prefect | MWAA, Cloud Composer, Astronomer |
| Data catalog | DataHub, Amundsen, OpenMetadata | Collibra, Alation |
Learning Path
Phase 1: Foundations
- ML basics — regression, classification, trees, neural networks
- Metrics — precision, recall, F1, AUC, NDCG
- Feature engineering techniques
- SQL deep dive — window functions, CTEs, optimization
Phase 2: Domain Knowledge
- Ads systems — CTR, ranking, auctions, attribution
- Recommendation systems — collaborative & content-based filtering, two-tower, sequence models
- Identity resolution — deterministic, probabilistic, graph-based
- A/B testing — hypothesis testing, power analysis, CUPED
Phase 3: Infrastructure
- Data engineering — Spark, Flink, Kafka, data lakes, Iceberg/Delta
- ML system design patterns
- Feature store architecture
- MLOps & deployment strategies
Phase 4: GenAI
- RAG architecture and implementation
- Agents & tool calling (ReAct pattern)
- NL to SQL
- Embeddings & vector databases
- LLM evaluation and guardrails
Priority Topics (Focus These First)
If limited on time, prioritize:
- ML basics — regression, classification, trees, metrics
- Feature engineering — the #1 differentiator in applied ML
- Recommendation systems — two-tower, ranking, cold start
- A/B testing & experimentation — fundamental to data-driven culture
- Data pipelines — Spark, Kafka, ETL, data lakes
- Identity resolution / device graph — unique cross-device challenge
- ML system design — end-to-end thinking
- RAG / LLM / Agents — the GenAI wave
- SQL & data modeling — the universal data language
- Metrics / KPIs — knowing what to optimize
Further Reading
- Google's Overlapping Experiments Paper — Layered experimentation architecture
- Monolith: Real-Time Reco System (ByteDance) — Real-time embedding updates
- Deep Neural Networks for YouTube Recommendations — Two-stage retrieve + rank
- CUPED: Variance Reduction
- Feast Feature Store — Open-source feature store
- Trustworthy Online Controlled Experiments (Kohavi et al.) — The bible of A/B testing
This is a living document — I'll keep going deeper into each topic as I learn more. Stay focused while studying — try AstroYuga for mindful focus sessions. 🧘