Data Infrastructure for AI and Experimentation
A complete learning roadmap covering Data Engineering + Machine Learning + Ads Systems + Identity Graphs + GenAI — the foundational pillars behind modern ad data activation platforms.
What Does an Ad Data Activation Team Build?
Understanding the bigger picture first:
- Identity systems — who is the user across devices
- Device graph — connect TV, mobile, tablet to same person/household
- Audience platform — segments like "sports lovers"
- Measurement & attribution — did an ad lead to a purchase?
- Reporting & insights
- Gen AI agents for reporting & analytics
This is Applied ML + Data Platform + Ads + GenAI — not pure ML research.
Topics Map (High Level)
The core areas to understand:
- Python for ML
- Machine Learning Fundamentals
- Feature Engineering
- Experimentation / A/B Testing
- Recommendation / Ads / Ranking Systems
- Identity Resolution & Graph ML
- Data Engineering (Spark, Data Lakes, ETL)
- ML System Design
- Gen AI / LLM Systems / RAG / Agents
- Metrics / KPIs / Objective Functions
- Model Deployment / MLOps
- SQL & Data Modeling
- Distributed Systems Basics
Machine Learning Fundamentals
Supervised Learning
- Regression — predict continuous values
- Classification — predict categories
- Logistic Regression — binary classification using sigmoid function
- Linear Regression — fit a line to data using least squares
- Decision Trees — tree-based splitting on feature thresholds
- Random Forest — ensemble of decision trees (bagging)
- Gradient Boosting (XGBoost, LightGBM) — sequential trees correcting errors of previous ones
- Neural Networks basics — layers of neurons with activation functions
Unsupervised Learning
- Clustering (KMeans) — group similar data points
- PCA — dimensionality reduction via principal components
- Embeddings — dense vector representations of entities
- Similarity search — find nearest neighbors in embedding space
Important Concepts
- Bias vs Variance — underfitting vs overfitting tradeoff
- Overfitting — model memorizes training data, performs poorly on unseen data
- Regularization (L1, L2) — penalize model complexity to prevent overfitting
- L1 (Lasso): encourages sparsity (some weights become zero)
- L2 (Ridge): shrinks weights evenly
- Cross Validation — evaluate model on multiple train/test splits (e.g., k-fold)
- Feature Engineering — creating useful input features from raw data
- Feature Scaling — normalize/standardize features so they're on same scale
- Handling missing values — imputation (mean, median, mode), indicator columns, or model-based
- Class imbalance — techniques like SMOTE, class weights, undersampling, oversampling
Metrics
| Metric | What It Measures |
|---|---|
| Accuracy | % of correct predictions |
| Precision | Out of predicted positive, how many are actually positive |
| Recall | Out of actual positives, how many did we catch |
| F1 Score | Harmonic mean of precision and recall |
| ROC AUC | Area under ROC curve — tradeoff between TPR and FPR |
| Log Loss | Penalizes confident wrong predictions (used in CTR) |
| RMSE | Root Mean Squared Error — regression metric |
| MAE | Mean Absolute Error — regression metric |
Ads / Audience / Recommendation Systems
Ads System Architecture
The full flow of an ad system:
Models Used in Ads
- CTR prediction (Click Through Rate) — probability user clicks on ad
- CVR prediction (Conversion Rate) — probability of purchase after click
- Ranking models — order ads by expected value (eCPM = CTR × bid)
- Lookalike modeling — find users similar to seed audience
- Audience segmentation — cluster users into meaningful groups
- Recommendation systems — suggest relevant content/ads
- Collaborative filtering — recommend based on similar users' behavior
- Content-based filtering — recommend based on item features
- Embeddings for users & items — learn dense representations for matching
- Multi-armed bandits — explore vs exploit for ad selection
- Auction systems — second-price auction, VCG, first-price auction
- Budget pacing — spend advertiser budget evenly over campaign duration
Identity Resolution & Device Graph
Identity Resolution
The fundamental problem: how do you know that this phone, this TV, and this laptop belong to the same user or household?
Methods:
- Deterministic matching — exact match on email, login, phone number
- Probabilistic matching — statistical models using IP address, location, timing patterns
- Graph-based identity resolution — build a graph, run connected components
- Similarity models — embeddings + cosine similarity
- Graph ML — use graph neural networks for entity resolution
Device Graph
A graph structure where:
Nodes = devices / users / households
Edges = relationships (same IP, same login, co-occurrence)
Key Algorithms:
- Connected components — find clusters of related devices
- PageRank — rank importance of nodes
- Node embeddings — learn vector representations of graph nodes (Node2Vec, GraphSAGE)
- Graph neural networks (GNN) — neural networks that operate on graph structure
- Link prediction — predict missing edges (e.g., does this phone belong to this household?)
Data Engineering
Core Concepts
- ETL pipelines — Extract, Transform, Load
- Data lakes — store raw data in any format (S3, ADLS, GCS)
- Data warehouse — structured, optimized for analytics (Snowflake, BigQuery, Redshift)
- Batch vs Streaming — scheduled processing vs real-time processing
- Spark — distributed data processing engine
- Presto / Trino — distributed SQL query engines
- Parquet — columnar storage format, efficient for analytics
- Feature Store — centralized repository for ML features (Feast, Tecton)
- Airflow — workflow orchestration (DAGs)
- Kafka — distributed event streaming platform
- Data partitioning — split data by date/region for faster queries
- Data skew — uneven distribution of data across partitions
- Joins at scale — broadcast joins, sort-merge joins, shuffle hash joins
- Window functions — compute across rows related to current row
- Aggregations at scale — group by with distributed compute
Typical Data Pipeline
ML System Design
ML System Pipeline
When designing any ML system, follow this structure:
System Design Framework
Always structure system designs as:
- Problem definition — what are we solving?
- Metrics — how do we measure success?
- Data sources — what data do we have?
- Feature engineering — what features do we build?
- Model choice — what algorithm fits?
- Training pipeline — how do we train at scale?
- Serving architecture — batch vs real-time?
- Monitoring — what do we track in production?
- Retraining — when and how do we update?
- Experimentation — how do we A/B test?
Example Systems to Understand
- Ad ranking system
- Recommendation system
- Audience segmentation system
- Attribution system
- Reporting insights AI agent
- Identity resolution system
Experimentation / A/B Testing
Core Concepts
- A/B testing — compare control (A) vs treatment (B) with random assignment
- Hypothesis testing — formulate null and alternative hypothesis
- P-value — probability of observed result under null hypothesis (typically < 0.05)
- Confidence interval — range where true value likely falls (typically 95%)
- Statistical significance — result unlikely due to chance
- Power analysis — determine required sample size before experiment
- Online experiments — experiments running on live traffic
Metrics for Ads Experiments
- CTR — click through rate
- Conversion rate — purchases / impressions
- Revenue — total revenue generated
- ROAS — Return on Ad Spend (revenue / ad spend)
- Engagement — time spent, interactions, completions
Gen AI / LLM / Agents
LLM Core Topics
- Transformers — self-attention mechanism, encoder-decoder architecture
- Embeddings — convert text/entities to dense vectors
- Vector databases — store and search embeddings (Pinecone, Weaviate, Milvus, pgvector)
- RAG (Retrieval Augmented Generation) — retrieve relevant context, then generate answer
- Prompt engineering — crafting effective prompts (few-shot, chain-of-thought, system prompts)
- Evaluation of LLMs — BLEU, ROUGE, human eval, LLM-as-judge
- Hallucination — model generates plausible but incorrect information
- Guardrails — input/output validation, content filtering
- Safety — preventing harmful outputs, jailbreak detection
- Agents — LLMs that can take actions using tools
- Tool calling / Function calling — LLM invokes external functions
- Semantic search — search by meaning rather than keywords
- NL → SQL systems — convert natural language questions to SQL queries
- Knowledge graphs + LLM — structured data enhancing LLM reasoning
Typical GenAI Pipeline
Reporting Agent Architecture
SQL & Data Modeling
SQL Essentials
- Joins — INNER, LEFT, RIGHT, FULL OUTER, CROSS
- Group by — aggregate rows by column values
- Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER
- Partition by — window function partitioning
- Ranking — order results within groups
- CTEs — Common Table Expressions (WITH clause) for readable queries
- Indexing — speed up queries on specific columns
Data Modeling for Ads
- Star schema — central fact table surrounded by dimension tables
- Fact table — stores measurable events (impressions, clicks, conversions)
- Dimension table — stores descriptive attributes (user, campaign, creative, device)
- OLAP vs OLTP — analytics vs transactions
KPIs / Objective Functions
Understanding which objective function to use for each system:
| System | Objective | Why |
|---|---|---|
| CTR model | Log Loss | Penalizes confident wrong predictions |
| Ranking | NDCG | Measures quality of ranked list |
| Recommendation | MAP (Mean Average Precision) | Evaluates precision at each relevant item |
| Segmentation | Silhouette Score | Measures cluster quality |
| Attribution | Incrementality | Measures true causal impact of ad |
| Budget pacing | Spend vs Target | Minimize deviation from planned spend |
| Identity resolution | Match accuracy / precision-recall | Correct device-to-user mapping |
MLOps / Production ML
Core MLOps Concepts
- Model deployment — serving model predictions via API or batch job
- Batch vs real-time inference — scheduled predictions vs on-demand
- Feature store — consistent features for training and serving
- Model versioning — track model artifacts and lineage
- Monitoring — track prediction quality, latency, throughput
- Data drift — input data distribution changes over time
- Concept drift — relationship between features and target changes
- Retraining pipelines — automated model refresh on new data
- A/B model testing — compare old model vs new model on live traffic
- Shadow deployment — run new model alongside old without serving to users
- Canary deployment — roll out new model to small % of traffic first
MLOps Architecture
Learning Path
Phase 1: Foundations
- ML basics — regression, classification, trees
- Metrics — precision, recall, F1, AUC
- Feature engineering techniques
Phase 2: Domain Knowledge
- Ads systems — CTR, ranking, auctions
- Recommendation systems — collaborative & content-based filtering
- Identity resolution — deterministic, probabilistic, graph-based
- Device graph algorithms
- A/B testing fundamentals
Phase 3: Infrastructure
- Data engineering — Spark, data pipelines, data lakes
- ML system design patterns
- Feature store architecture
- MLOps & deployment strategies
Phase 4: GenAI
- RAG architecture
- Agents & tool calling
- NL to SQL
- Embeddings & vector databases
- LLM evaluation methods
Most Important Topics (Focus These First)
If limited on time, prioritize in this order:
- ML basics — regression, classification, trees, metrics
- Feature engineering — the #1 differentiator in applied ML
- Ads ranking / CTR prediction — core of ad tech ML
- Identity resolution / device graph — unique to this domain
- A/B testing — fundamental to experimentation culture
- Data pipelines — Spark, ETL, data lakes
- ML system design — end-to-end thinking
- RAG / LLM / Agents — the GenAI wave in ads
- SQL — the universal data language
- Metrics / KPIs — knowing what to optimize
This is a living document. I'll keep updating as I dive deeper into each topic.