Data Infrastructure for AI & Experimentation at Scale

A comprehensive deep-dive into the data backbone powering ML, personalization, experimentation, and GenAI on modern streaming platforms

Data Infrastructure for AI & Experimentation at Scale

A complete deep-dive into Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI + Experimentation — the foundational pillars behind modern streaming and content platforms.


What Does a Data & AI Platform Power?

Understanding the bigger picture — on any large-scale streaming platform (video, music, podcasts, live content), the data infrastructure powers:

This is Applied ML + Data Platform + Experimentation + GenAI — not pure ML research.


Table of Contents

  1. The Event Backbone — Ingestion & Streaming
  2. Machine Learning Fundamentals
  3. The Feature Platform
  4. Ads / Audience / Recommendation Systems
  5. Identity Resolution & Device Graph
  6. Data Engineering at Scale
  7. ML System Design
  8. The Experimentation Platform
  9. Gen AI / LLM / Agents
  10. SQL & Data Modeling
  11. KPIs / Objective Functions
  12. MLOps / Production ML
  13. Data Quality & Observability
  14. End-to-End Architecture

1. The Event Backbone — Ingestion & Streaming

What Needs to Be Captured

On a streaming platform, user interactions generate a staggering volume of telemetry:

Event Type Examples Volume (large platform)
Playback events play, pause, seek, buffer, complete ~50B/day
Navigation events impression, click, scroll, search ~30B/day
Engagement signals like, save, share, add-to-list ~2B/day
Device/context device type, OS, network, geo, time Attached to every event
Content metadata genre, duration, cast, release date Updated in batch

Streaming Architecture — Apache Kafka at the Core

┌──────────────┐     ┌──────────────────┐     ┌────────────────────┐
  Client SDKs │────▶│  Kafka Clusters  │────▶│  Stream Processors 
  (mobile,           (partitioned by        (Flink / Spark    
   web, TV,           user_id or             Structured       
   console)           device_id)             Streaming)       
└──────────────┘     └──────────────────┘     └────────────────────┘
                                                       
                                                       
                     ┌────────────────┐       ┌────────────────────┐
                       Raw Event              Derived Streams   
                       Data Lake              (sessionized,     
                       (S3/ADLS/GCS)           enriched,        
                                               aggregated)      
                     └────────────────┘       └────────────────────┘

Key Design Decisions

Schema Registry (Avro/Protobuf): Every event must conform to a registered schema. Without this, downstream consumers break constantly. Use Apache Avro with Confluent Schema Registry or Protobuf definitions checked into version control. Schema evolution rules (backward/forward compatibility) are critical.

Partitioning Strategy: Partition by user_id for user-centric analytics (session analysis, feature computation). Partition by content_id for content-centric tasks (popularity counters, trending). Some platforms maintain dual topics — one partitioned each way.

Exactly-Once Semantics: Kafka supports idempotent producers and transactional writes. For streaming jobs consuming from Kafka → writing to a data lake, use Flink checkpointing with exactly-once sink connectors to avoid duplicate or missing events.

Backpressure & Dead Letter Queues: When a consumer can't keep up, events must be buffered, not dropped. Dead letter queues capture malformed events for later reprocessing rather than silently discarding them.

Event Schema Design — Example

{
  "event_id": "uuid-v4",
  "event_type": "playback.started",
  "timestamp_ms": 1712419200000,
  "user_id": "u_abc123",
  "session_id": "s_xyz789",
  "content_id": "c_movie_456",
  "device": {
    "type": "smart_tv",
    "os": "tvOS",
    "app_version": "5.12.0"
  },
  "context": {
    "page": "home_feed",
    "row": "continue_watching",
    "position": 3,
    "algorithm": "collab_filter_v2",
    "experiment_id": "exp_2026_q2_reco_v3",
    "variant": "treatment_b"
  },
  "content_metadata": {
    "genre": ["sci-fi", "thriller"],
    "duration_sec": 7200,
    "release_year": 2025
  }
}

Why context.algorithm and context.experiment_id matter: They tie every interaction back to the model and experiment that produced the recommendation. Without this, you can't measure model performance or run valid A/B tests.


2. Machine Learning Fundamentals

Supervised Learning

Unsupervised Learning

Important Concepts

Metrics

Metric What It Measures When to Use
Accuracy % of correct predictions Balanced classes only
Precision Of predicted positive, how many are actually positive When false positives are costly (spam detection)
Recall Of actual positives, how many did we catch When false negatives are costly (fraud detection)
F1 Score Harmonic mean of precision and recall Imbalanced classes, need balance
ROC AUC Area under ROC curve — tradeoff between TPR and FPR Binary classification, threshold-independent
PR AUC Area under Precision-Recall curve Highly imbalanced datasets
Log Loss Penalizes confident wrong predictions CTR prediction, calibrated probabilities
RMSE Root Mean Squared Error Regression, penalizes large errors
MAE Mean Absolute Error Regression, robust to outliers
NDCG Normalized Discounted Cumulative Gain Ranking quality
MAP Mean Average Precision Information retrieval
MRR Mean Reciprocal Rank First relevant result position

3. The Feature Platform

The Feature Engineering Problem

ML models need features. Features come from raw events. The challenge:

Feature Store Architecture

flowchart TD Raw["📊 Raw Events
(Kafka / Data Lake)"] --> Batch["⏱️ Batch Pipeline
(Spark / dbt)"] Raw --> Stream["⚡ Stream Pipeline
(Flink / Spark Streaming)"] Batch --> Offline["🗄️ Offline Store
(Hive / Delta Lake / BigQuery)"] Stream --> Online["⚡ Online Store
(Redis / DynamoDB / Bigtable)"] Offline --> Training["🤖 Model Training"] Online --> Serving["🔮 Model Serving"] Offline -.->|"point-in-time join"| Training Batch -.->|"backfill"| Online style Online fill:#ff9800,color:#fff style Offline fill:#2196f3,color:#fff

Types of Features on a Streaming Platform

User-Level Features (computed per user)

user_total_watch_hours_7d           rolling 7-day watch time
user_genre_affinity_vector          softmax over genre engagement
user_avg_session_length_30d         average session in last 30 days
user_skip_rate_7d                   fraction of content abandoned < 30s
user_search_to_play_ratio_7d       how often search leads to a play
user_time_of_day_distribution       histogram of activity by hour
user_device_preference              primary device type
user_content_completion_rate_30d    fraction of content watched to end
user_days_since_signup              account age
user_subscription_tier              free / basic / premium

Content-Level Features (computed per item)

content_total_plays_7d              popularity signal
content_avg_completion_rate         quality signal
content_genre_tags                  categorical
content_release_recency_days        freshness
content_avg_rating                  explicit quality
content_play_to_impression_ratio    CTR proxy

Cross Features (user × content interaction)

user_has_watched_same_genre_7d     — genre relevance
user_watched_same_creator          — creator affinity
user_x_content_genre_overlap       — cosine similarity of genre vectors

Point-in-Time Joins — The Most Critical Concept

When training a model, you must join features as they existed at the time of the event, not as they exist now. Otherwise you leak future information into training data.

# WRONG — uses current features for historical events
features = feature_store.get_latest("user_123")

# RIGHT — uses features as-of the event timestamp
features = feature_store.get_as_of("user_123", timestamp="2026-03-15T14:00:00Z")

Frameworks like Feast, Tecton, and Hopsworks handle point-in-time correctness automatically when you define feature views with timestamps.

Online Feature Serving — Latency Matters

Model inference at serving time must complete within 50-100ms (including feature fetch + model forward pass). This means:


4. Ads / Audience / Recommendation Systems

Ads System Architecture

The full flow of an ad system on a streaming platform:

graph LR A[User] --> B[Device] B --> C[Identity Resolution] C --> D[Audience Segment] D --> E[Ad Selection] E --> F[Auction] F --> G[Impression] G --> H[Click] H --> I[Conversion] I --> J[Attribution] J --> K[Reporting]

How Ad Auctions Work

When a user opens a streaming app and hits an ad break, the platform runs an auction in real-time:

  1. Bid Request — platform sends user context (anonymized) to demand-side platforms or internal ad server
  2. Bid Response — advertisers bid for the impression, specifying CPM (cost per 1000 impressions)
  3. Auction — typically a second-price auction (winner pays $0.01 above second-highest bid) or increasingly first-price auction
  4. Ad Selection — rank by eCPM = , also factor in relevance and user experience
  5. Impression & Tracking — serve the ad, fire tracking pixels for viewability, completion, clicks

Models Used in Ads

Recommendation Model Architectures

Two-Tower Model (Retrieval Stage)

          User Tower                    Item Tower
    ┌─────────────────┐          ┌─────────────────┐
      user features               item features   
      watch history               genre, tags     
      demographics                popularity      
    └────────┬────────┘          └────────┬────────┘
                                         
        [Dense Layers]              [Dense Layers]
                                         
                                         
      user_embedding              item_embedding
       (128-dim)                   (128-dim)
                                         
             └──────── dot product ───────┘
                          
                     similarity score

Deep Ranking Model (Ranking Stage)

Takes ~500 candidates from retrieval and scores them:

Input Features:
  - User features (dense + sparse)
  - Item features (dense + sparse)
  - Cross features (user × item)
  - Context features (time, device, page)
       │
  [Embedding Layer — sparse features]
       │
  [Concatenation with dense features]
       │
  [Multi-Layer Perceptron or DCN-v2]
       │
  [Multi-Task Heads]
       ├── P(click)
       ├── P(watch > 50%)
       ├── P(complete)
       ├── P(like)
       └── P(add to list)
       │
  [Weighted combination → final score]

Multi-Task Learning is essential because optimizing for clicks alone leads to clickbait. The final score:

Sequence Models for Watch History

User watch history is sequential — order matters:

Content Cold Start — The Chicken-and-Egg Problem

New content has zero engagement signal. How do you recommend something nobody has watched?

Strategy How It Works Limitations
Content-based features Use metadata (genre, cast, director, synopsis embedding) to find similar existing content Ignores personal taste
Explore/exploit Thompson Sampling or epsilon-greedy — intentionally show new content to a small % of users Hurts short-term metrics
Contextual bandits Use contextual features (user segment, time of day) to decide who sees new content Requires fast feedback loop
Creator-based transfer If a creator has a track record, use their historical performance as a prior Only for known creators
Editorial boost Curators manually promote content for initial impressions Doesn't scale

Best systems combine content-based warm-starting + explore/exploit for initial signals, then transition to collaborative filtering once enough data exists.

Handling Bias in Recommendation Data

All training data is biased — users can only engage with what was shown.

Position Bias: Items in position 1 get more clicks regardless of relevance. Solution: train a position bias model separately and debias during training.

Selection Bias: Training data only contains items the model chose to show. Solution: inverse propensity scoring (IPS) — weight training examples by .

Popularity Bias: Popular items get shown more → get more engagement → appear more popular. Rich-get-richer feedback loop. Solution: diversity objectives + calibration to match user interest distribution.


5. Identity Resolution & Device Graph

The Fundamental Problem

How do you know that a smart TV, a mobile phone, a laptop, and a tablet belong to the same user or household? This is the identity resolution problem.

Methods

Device Graph Structure

A graph where nodes are devices/users/households and edges represent observed relationships:

graph TD H[Household] --> TV[Smart TV] H --> M[Mobile Phone] H --> L[Laptop] H --> T[Tablet] TV -.->|same IP / WiFi| M M -.->|same login| L L -.->|co-occurrence| T

Key Graph Algorithms

Identity Graph at Scale

At streaming platform scale (100M+ households), the identity graph has:

This requires:


6. Data Engineering at Scale

Core Concepts Deep Dive

Data at Scale — Common Challenges

Typical Data Pipeline

graph LR A[Client Events] --> B[Kafka] B --> C[Raw Data Lake
Parquet/Iceberg] C --> D[Spark / dbt
Transform] D --> E[Feature Store] E --> F[ML Model Training] D --> G[Analytics
Warehouse] G --> H[Dashboards] F --> I[Model Serving] I --> J[Predictions API]

Sessionization — Deceptively Hard

A "session" isn't straightforward. Is it timeout-based (30 min inactivity)? Content-boundary-based (new show = new session)? Device-specific?

-- Sessionization using inactivity gap (30 minutes)
WITH events_with_gap AS (
    SELECT
        user_id,
        event_timestamp,
        LAG(event_timestamp) OVER (
            PARTITION BY user_id ORDER BY event_timestamp
        ) AS prev_timestamp,
        CASE
            WHEN EXTRACT(EPOCH FROM event_timestamp - LAG(event_timestamp)
                 OVER (PARTITION BY user_id ORDER BY event_timestamp)) > 1800
            THEN 1
            ELSE 0
        END AS new_session_flag
    FROM raw_events
)
SELECT
    user_id,
    event_timestamp,
    SUM(new_session_flag) OVER (
        PARTITION BY user_id ORDER BY event_timestamp
    ) AS session_id
FROM events_with_gap;

Real-time sessionization in Flink uses session windows with a gap timeout — but requires careful handling of late-arriving events and cross-device sessions.


7. ML System Design

System Design Framework

When designing any ML system, structure it as:

  1. Problem definition — what are we solving?
  2. Metrics — how do we measure success? (offline + online)
  3. Data sources — what data do we have?
  4. Feature engineering — what features do we build?
  5. Model choice — what algorithm fits?
  6. Training pipeline — how do we train at scale?
  7. Serving architecture — batch vs real-time?
  8. Monitoring — what do we track in production?
  9. Retraining — when and how do we update?
  10. Experimentation — how do we A/B test?

ML System Pipeline

graph TD A[Data Ingestion] --> B[Data Cleaning] B --> C[Feature Engineering] C --> D[Feature Store] D --> E[Model Training] E --> F[Model Evaluation] F --> G[Model Registry] G --> H{Deployment Strategy} H -->|Shadow| I1[Log predictions, don't serve] H -->|Canary| I2[5% traffic, monitor] H -->|A/B Test| I3[50/50 with old model] H -->|Full| I4[100% traffic] I1 --> J[Monitoring & Alerting] I2 --> J I3 --> J I4 --> J J -->|"drift / degradation"| K[Retrain Trigger] K --> E

Real-Time vs Batch Training

Aspect Batch Training Real-Time Training
Freshness Hours to days stale Minutes stale
Use case Stable preferences Trending content, viral items
Infrastructure Spark + GPU cluster Flink + parameter server
Complexity Lower Much higher
Typical approach Daily/weekly retrain Continuous embedding updates

Most platforms use a hybrid: batch-train the full model daily/weekly, but update embedding tables in near-real-time for new content and shifting user preferences (as described in ByteDance's Monolith paper).

Model Serving Infrastructure

For a platform serving millions of concurrent users:

Example Systems to Understand


8. The Experimentation Platform

Why Experimentation Infrastructure Is as Important as ML

A model is only as good as your ability to measure its impact. Streaming platforms run hundreds to thousands of concurrent experiments:

Core Components

flowchart LR Config["🎛️ Experiment Config
(variants, allocation,
guardrails)"] --> Assign["👤 Assignment
Service"] Assign --> Client["📱 Client SDK
(get variant)"] Client --> Events["📊 Events
(tagged with
experiment_id + variant)"] Events --> Pipeline["⚙️ Metrics
Pipeline"] Pipeline --> Stats["📈 Statistical
Analysis"] Stats --> Dashboard["📊 Experiment
Dashboard"] style Config fill:#9c27b0,color:#fff style Stats fill:#4caf50,color:#fff

Assignment — Deterministic Hashing

import hashlib

def get_variant(user_id: str, experiment_id: str, num_variants: int) -> int:
    """Deterministic assignment: same user always gets same variant."""
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    return hash_value % num_variants

Properties: - Deterministic — same user always sees same variant (no flickering) - Uniform — SHA256 gives near-perfect uniform distribution - Independent — different experiments use different hash inputs, so assignments are uncorrelated - Stateless — no need to store assignments; recompute on every request

Core Statistical Concepts

Sample Size Calculation

Where: - = z-score for significance level (1.96 for 95%) - = z-score for power (0.84 for 80% power) - = variance of the metric - = minimum detectable effect (MDE)

Typical MDEs for streaming platforms: - Engagement metrics (watch hours): ±0.5% - Retention metrics (day-7 retention): ±0.2% - Revenue metrics (ARPU): ±1.0%

Traffic Allocation & Mutual Exclusion

Total Traffic (100%)
├── Layer: Ranking (40%)   ├── Experiment A: New model v3 (50% control / 50% treatment)   └── Experiment B: Feature expansion (50/50)       Note: A and B are mutually exclusive within this layer
├── Layer: UI (30%)   ├── Experiment C: Card size (33/33/33)   └── Experiment D: Autoplay threshold (50/50)
├── Layer: Notifications (20%)   └── Experiment E: Send time optimization
└── Holdout (10%)
    └── No experiments  clean baseline

Within a layer, experiments are mutually exclusive. Across layers, they are orthogonal (independent). This is Google's overlapping experiments architecture.

Variance Reduction Techniques

Raw metrics have high variance (some users binge 8 hours, most watch 30 minutes):

CUPED (Controlled-experiment Using Pre-Experiment Data):

Where is pre-experiment metric value and . Reduces variance by 30-50%.

Stratified sampling: Bucket users by activity level, compute within-stratum estimates.

Winsorization: Cap extreme values at 99th percentile to reduce outlier influence.

Multiple Testing Correction

With hundreds of experiments, false positives are inevitable: - Bonferroni: (conservative) - Benjamini-Hochberg (FDR control): Rank p-values, reject where - Always Valid Sequential Testing — allows peeking at results without inflating false positives

Guardrail Metrics

Every experiment must monitor guardrails — things that must NOT degrade:

Guardrail Threshold Why
App crash rate +0.0% Reliability
Page load latency p99 +50ms Performance
Error rate +0.1% Stability
Customer support contacts +5% UX quality
Subscription cancellation rate +0.5% Revenue

If any guardrail is violated, auto-pause the experiment.

Metrics for Experiments


9. Gen AI / LLM / Agents

LLM Core Topics

RAG Architecture

graph TD A[User Question] --> B[Embedding Model] B --> C[Vector Search
in Vector DB] C --> D[Top-K Relevant
Documents] D --> E[Construct Prompt
Question + Context] E --> F[LLM] F --> G[Answer] style C fill:#ff9800,color:#fff style F fill:#4caf50,color:#fff

RAG Pipeline Details

  1. Indexing Phase (offline):
  2. Chunk documents (512-1024 tokens, with overlap)
  3. Generate embeddings per chunk
  4. Store in vector database with metadata
  5. Query Phase (online):
  6. Embed the user query
  7. Retrieve top-K similar chunks (cosine similarity)
  8. (Optional) Rerank retrieved chunks with a cross-encoder
  9. Inject context + query into LLM prompt
  10. Generate answer with citations

Reporting Agent Architecture

graph TD A[User: Natural Language Query] --> B[Agent / Orchestrator] B --> C{Route} C -->|SQL Query| D[NL-to-SQL Engine] C -->|Dashboard| E[Visualization Tool] C -->|Analysis| F[Analytics Engine] D --> G[Data Warehouse] G --> H[Results] H --> I[LLM Summarization] I --> J[Response to User]

This is increasingly common on streaming platforms — analysts ask questions in natural language and get SQL-backed answers with visualizations.


10. SQL & Data Modeling

SQL Deep Dive

Data Modeling for Analytics

Star Schema — central fact table surrounded by dimension tables. Denormalized for query performance.

graph TD F[Fact: Impressions / Plays] --> D1[Dim: User] F --> D2[Dim: Content] F --> D3[Dim: Device] F --> D4[Dim: Time] F --> D5[Dim: Campaign] F --> D6[Dim: Geography]

11. KPIs / Objective Functions

Choosing the right objective function for each system is critical:

System Objective Why
CTR model Log Loss (Binary Cross-Entropy) Penalizes confident wrong predictions, produces calibrated probabilities
Content ranking NDCG (Normalized Discounted Cumulative Gain) Measures quality of ordered list, weights top positions higher
Recommendation MAP / Recall@K Evaluates relevance in top-K results
Audience segmentation Silhouette Score / Calinski-Harabasz Measures cluster quality and separation
Attribution Incrementality (causal lift) Measures true causal impact, not just correlation
Budget pacing Spend vs Target deviation Minimize under/over-delivery
Identity resolution Precision / Recall / F1 of matches Correct device-to-user mapping accuracy
Engagement Composite: quality watch hours + diversity Prevents Goodhart's Law from gaming single metrics

The North Star Metric Problem

Streaming platforms have many metrics that often conflict:

Metric Optimizes For Risk
Watch hours Engagement Autoplay addiction, low-quality binges
DAU / MAU Retention Doesn't capture depth
Content starts Discovery Doesn't measure satisfaction
Completion rate Satisfaction Penalizes long content
Revenue / ARPU Business May sacrifice long-term engagement

Solution: define a composite metric:

Where QualityWatchHours filters out background/autoplay — only counting engaged viewing.


12. MLOps / Production ML

Core MLOps Concepts

MLOps Architecture

graph TD A[Data Pipeline] --> B[Feature Store] B --> C[Training Pipeline
GPU Cluster] C --> D[Model Registry
MLflow / W&B] D --> E[Offline Evaluation
AUC, NDCG] E --> F{Deployment} F -->|Shadow| G1[Log Only] F -->|Canary| G2[5% Traffic] F -->|A/B Test| G3[Experiment] F -->|Full| G4[100% Traffic] G1 --> H[Monitoring
Drift + Latency + Quality] G2 --> H G3 --> H G4 --> H H -->|"Drift Detected"| I[Auto-Retrain] I --> C

13. Data Quality & Observability

Why Data Quality Is the Biggest Risk

ML models are only as good as their input data. Common issues:

Issue Impact Detection
Missing events (SDK bug) Undercounting → wrong experiment results Volume anomaly detection
Duplicate events (retry storms) Overcounting → inflated metrics Dedup by event_id
Schema changes (unannounced) Pipeline breakages Schema registry enforcement
Clock skew (device time wrong) Feature computation errors Server-side timestamp validation
Bot/fraud traffic Pollutes training data Behavioral anomaly detection
Late-arriving data Incomplete aggregations Watermark-based processing

Data Quality Framework

flowchart LR Ingest["📥 Ingestion"] --> Schema["📋 Schema
Validation"] Schema --> Volume["📊 Volume
Checks"] Volume --> Freshness["⏱️ Freshness
SLAs"] Freshness --> Distribution["📈 Distribution
Checks"] Distribution --> Alert["🚨 Alert &
Circuit Breaker"] style Alert fill:#f44336,color:#fff

Tools: Great Expectations, dbt tests, Monte Carlo, Anomalo, Soda, or custom solutions on statistical process control.

Circuit Breakers: If input data quality drops below a threshold, automatically stop the ML training pipeline from consuming bad data. Better to serve a slightly stale model than one trained on corrupted data.

Data Lineage & Cataloging


14. End-to-End Architecture

Putting It All Together

┌─────────────────────────────────────────────────────────────────────────┐
                         CLIENT DEVICES                                   
   (Smart TV, Mobile, Web, Console, Set-Top Box, Gaming Console)         
└─────────────┬───────────────────────────────────────────┬───────────────┘
               events                                      API calls
                                                          
┌─────────────────────────┐              ┌───────────────────────────────┐
   EVENT INGESTION                         API GATEWAY                  
   ┌─────────────────┐                    ┌────────────────────────┐  
     Schema Registry                       Experiment Assignment   
     Kafka Clusters                        (deterministic hash)    
     (multi-region)                      └────────────────────────┘  
   └────────┬────────┘                    ┌────────────────────────┐  
                                            Recommendation API      
└────────────┼─────────────┘                   retrieve  rank       
                                              rerank  serve         
                                            └────────────────────────┘  
┌─────────────────────────────┐          └──────────────┬────────────────┘
   STREAM PROCESSING                                   
   (Flink / Spark Streaming)                           
   ┌───────────────────────┐           ┌──────────────▼────────────────┐
     Sessionization                     ML SERVING                   
     Real-time features                 ┌────────────────────────┐  
     Real-time metrics                    Retrieval (ANN index)   
     Anomaly detection                    Ranking (GPU/CPU)       
   └───────────┬───────────┘                Feature Store (Redis)   
└───────────────┼─────────────┘               Model Registry          
                                           └────────────────────────┘  
                                        └───────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
                          DATA LAKE                                       
   ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ 
     Raw Events     Sessionized     Feature        Experiment  
     (Parquet)      Events          Tables         Results     
   └──────────────┘ └───────────────┘ └──────────────┘ └─────────────┘ 
                                                                         
   Storage: S3 / ADLS / GCS  Format: Delta Lake / Iceberg / Hudi       
   Compute: Spark / Trino / dbt                                          
└─────────────────────────────────────────────────────────────────────────┘

Technology Choices — Practical Summary

Component Open Source Managed Service
Event streaming Apache Kafka Confluent Cloud, Amazon MSK, Azure Event Hubs
Stream processing Apache Flink, Spark Structured Streaming Kinesis Data Analytics, Dataflow
Data lake storage HDFS, MinIO S3, ADLS Gen2, GCS
Table format Delta Lake, Apache Iceberg, Apache Hudi Databricks, Snowflake
Batch compute Apache Spark, Trino/Presto Databricks, EMR, Synapse, BigQuery
Feature store Feast, Hopsworks Tecton, SageMaker Feature Store, Vertex AI
ML training PyTorch, TensorFlow, XGBoost SageMaker, Vertex AI, Azure ML
Model serving TF Serving, Triton, BentoML SageMaker Endpoints, Vertex AI
Experiment platform Custom (most common) Eppo, Statsig, LaunchDarkly, Optimizely
Data quality Great Expectations, dbt tests, Soda Monte Carlo, Anomalo
Orchestration Apache Airflow, Dagster, Prefect MWAA, Cloud Composer, Astronomer
Data catalog DataHub, Amundsen, OpenMetadata Collibra, Alation

Learning Path

Phase 1: Foundations

Phase 2: Domain Knowledge

Phase 3: Infrastructure

Phase 4: GenAI


Priority Topics (Focus These First)

If limited on time, prioritize:

  1. ML basics — regression, classification, trees, metrics
  2. Feature engineering — the #1 differentiator in applied ML
  3. Recommendation systems — two-tower, ranking, cold start
  4. A/B testing & experimentation — fundamental to data-driven culture
  5. Data pipelines — Spark, Kafka, ETL, data lakes
  6. Identity resolution / device graph — unique cross-device challenge
  7. ML system design — end-to-end thinking
  8. RAG / LLM / Agents — the GenAI wave
  9. SQL & data modeling — the universal data language
  10. Metrics / KPIs — knowing what to optimize

Further Reading


This is a living document — I'll keep going deeper into each topic as I learn more. Stay focused while studying — try AstroYuga for mindful focus sessions. 🧘

Back to Blog About the Author
🧘