Data Infrastructure for AI and Experimentation

A comprehensive deep-dive into ML, Ads Systems, Identity Graphs, Data Engineering, GenAI, and Experimentation

Data Infrastructure for AI and Experimentation

A complete learning roadmap covering Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI — the foundational pillars behind modern ad data activation platforms.


What Does an Ad Data Activation Team Build?

Understanding the bigger picture first:

This is Applied ML + Data Platform + Ads + GenAI — not pure ML research.


Topics Map (High Level)

The core areas to understand:

  1. Python for ML
  2. Machine Learning Fundamentals
  3. Feature Engineering
  4. Experimentation / A/B Testing
  5. Recommendation / Ads / Ranking Systems
  6. Identity Resolution & Graph ML
  7. Data Engineering (Spark, Data Lakes, ETL)
  8. ML System Design
  9. Gen AI / LLM Systems / RAG / Agents
  10. Metrics / KPIs / Objective Functions
  11. Model Deployment / MLOps
  12. SQL & Data Modeling
  13. Distributed Systems Basics

Machine Learning Fundamentals

Supervised Learning

Unsupervised Learning

Important Concepts

Metrics

Metric What It Measures
Accuracy % of correct predictions
Precision Out of predicted positive, how many are actually positive
Recall Out of actual positives, how many did we catch
F1 Score Harmonic mean of precision and recall
ROC AUC Area under ROC curve — tradeoff between TPR and FPR
Log Loss Penalizes confident wrong predictions (used in CTR)
RMSE Root Mean Squared Error — regression metric
MAE Mean Absolute Error — regression metric

Ads / Audience / Recommendation Systems

Ads System Architecture

The full flow of an ad system:

graph LR A[User] --> B[Device] B --> C[Identity Resolution] C --> D[Audience Segment] D --> E[Ad Selection] E --> F[Auction] F --> G[Impression] G --> H[Click] H --> I[Conversion] I --> J[Attribution] J --> K[Reporting]

Models Used in Ads


Identity Resolution & Device Graph

Identity Resolution

The fundamental problem: how do you know that this phone, this TV, and this laptop belong to the same user or household?

Methods:

Device Graph

A graph structure where:

Nodes = devices / users / households
Edges = relationships (same IP, same login, co-occurrence)
graph TD H[Household] --> TV[Roku TV] H --> M[Mobile Phone] H --> L[Laptop] H --> T[Tablet] TV -.->|same IP| M M -.->|same login| L L -.->|co-occurrence| T

Key Algorithms:


Data Engineering

Core Concepts

Typical Data Pipeline

graph LR A[Logs / Events] --> B[Kafka] B --> C[Data Lake] C --> D[Spark ETL] D --> E[Feature Store] E --> F[ML Model] F --> G[Predictions] G --> H[Reporting DB]

ML System Design

ML System Pipeline

When designing any ML system, follow this structure:

graph TD A[Data Ingestion] --> B[Data Cleaning] B --> C[Feature Engineering] C --> D[Feature Store] D --> E[Model Training] E --> F[Model Evaluation] F --> G[Model Deployment] G --> H1[Batch Predictions] G --> H2[Online Predictions] H1 --> I[Monitoring] H2 --> I I --> J[Retraining] J --> E

System Design Framework

Always structure system designs as:

  1. Problem definition — what are we solving?
  2. Metrics — how do we measure success?
  3. Data sources — what data do we have?
  4. Feature engineering — what features do we build?
  5. Model choice — what algorithm fits?
  6. Training pipeline — how do we train at scale?
  7. Serving architecture — batch vs real-time?
  8. Monitoring — what do we track in production?
  9. Retraining — when and how do we update?
  10. Experimentation — how do we A/B test?

Example Systems to Understand


Experimentation / A/B Testing

Core Concepts

Metrics for Ads Experiments


Gen AI / LLM / Agents

LLM Core Topics

Typical GenAI Pipeline

graph LR A[User Question] --> B[Embedding Model] B --> C[Vector Search] C --> D[Retrieve Relevant Data] D --> E[LLM] E --> F[Answer]

Reporting Agent Architecture

graph TD A[User: Natural Language Query] --> B[Agent / Orchestrator] B --> C{Route} C -->|SQL Query| D[NL-to-SQL Engine] C -->|Dashboard| E[Visualization Tool] C -->|Analysis| F[Analytics Engine] D --> G[Data Warehouse] G --> H[Results] H --> I[LLM Summarization] I --> J[Response to User]

SQL & Data Modeling

SQL Essentials

Data Modeling for Ads

graph TD F[Fact: Ad Impressions] --> D1[Dim: User] F --> D2[Dim: Campaign] F --> D3[Dim: Creative] F --> D4[Dim: Device] F --> D5[Dim: Time]

KPIs / Objective Functions

Understanding which objective function to use for each system:

System Objective Why
CTR model Log Loss Penalizes confident wrong predictions
Ranking NDCG Measures quality of ranked list
Recommendation MAP (Mean Average Precision) Evaluates precision at each relevant item
Segmentation Silhouette Score Measures cluster quality
Attribution Incrementality Measures true causal impact of ad
Budget pacing Spend vs Target Minimize deviation from planned spend
Identity resolution Match accuracy / precision-recall Correct device-to-user mapping

MLOps / Production ML

Core MLOps Concepts

MLOps Architecture

graph TD A[Data Pipeline] --> B[Feature Store] B --> C[Training Pipeline] C --> D[Model Registry] D --> E{Deployment Strategy} E -->|Canary| F1[5% Traffic - New Model] E -->|Shadow| F2[Log Only - New Model] E -->|Full| F3[100% Traffic - New Model] F1 --> G[Monitoring] F2 --> G F3 --> G G -->|Drift Detected| H[Retrain Trigger] H --> C

Learning Path

Phase 1: Foundations

Phase 2: Domain Knowledge

Phase 3: Infrastructure

Phase 4: GenAI


Most Important Topics (Focus These First)

If limited on time, prioritize in this order:

  1. ML basics — regression, classification, trees, metrics
  2. Feature engineering — the #1 differentiator in applied ML
  3. Ads ranking / CTR prediction — core of ad tech ML
  4. Identity resolution / device graph — unique to this domain
  5. A/B testing — fundamental to experimentation culture
  6. Data pipelines — Spark, ETL, data lakes
  7. ML system design — end-to-end thinking
  8. RAG / LLM / Agents — the GenAI wave in ads
  9. SQL — the universal data language
  10. Metrics / KPIs — knowing what to optimize

This is a living document. I'll keep updating as I dive deeper into each topic.

Back to Blog About the Author