Apache Airflow - Going deep

Background & Prerequisites — What You Need to Know Before Writing This Blog

Before completing this blog, the following topics need to be studied and understood in depth. Each section below explains what the topic is, why it matters for this blog, and what you need to learn.

1. Python Fundamentals (Intermediate Level)

Why: Airflow DAGs are written in Python. You need solid Python skills to write operators, hooks, and custom logic. - Decorators — Airflow 2.x uses the @dag and @task decorators extensively. Understand how Python decorators wrap functions, accept arguments, and return callables. - Context managers — Used for resource management (DB connections, file handles) inside tasks. - Generators & iterators — Useful for processing large datasets in tasks without loading everything into memory. - Type hints — Airflow's TaskFlow API uses them for automatic XCom serialization.

2. Airflow Core Architecture

Why: You cannot write about Airflow without understanding its internal components and how they interact. - Scheduler — The brain of Airflow. It parses DAG files, determines which tasks are ready to run, and places them in a queue. Understand how the scheduler loop works, how it reads DAG files from the dags_folder, and its heartbeat mechanism. - Webserver — Flask-based UI that shows DAG status, task logs, Gantt charts, and allows manual triggers. Know how it connects to the metadata DB. - Executor — The execution engine. Different types: - SequentialExecutor — runs one task at a time (dev only) - LocalExecutor — runs tasks as local subprocesses (good for single-machine) - CeleryExecutor — distributes tasks across worker machines using Celery + a message broker (Redis/RabbitMQ) - KubernetesExecutor — spins up a new Kubernetes pod per task (best for cloud-native) - Metadata Database — PostgreSQL/MySQL database storing DAG definitions, task states, XCom values, connections, variables. Every component talks to this DB. - Workers — Processes that actually execute the tasks (relevant for Celery/K8s executors).

3. DAGs (Directed Acyclic Graphs)

Why: The central concept in Airflow — every workflow is a DAG. - What is a DAG — A collection of tasks with directed dependencies and no cycles. If Task A → Task B → Task C, then C cannot depend back on A. - DAG definition — Writing a Python file that instantiates a DAG object with dag_id, schedule_interval, start_date, catchup, default_args. - Schedule intervals — Cron expressions (0 2 * * *), presets (@daily, @hourly), timedelta objects, or dataset-triggered schedules (Airflow 2.4+). - Catchup & Backfill — When catchup=True, Airflow runs all missed intervals since start_date. Understand when to enable/disable this. - DAG dependencies — Using TriggerDagRunOperator or Datasets to create cross-DAG dependencies.

4. Operators, Sensors & Hooks

Why: These are the building blocks of tasks in Airflow. - Operators — Pre-built task templates: - BashOperator — runs shell commands - PythonOperator — runs Python callables - EmailOperator — sends emails - Cloud-specific: BigQueryOperator, S3ToGCSOperator, AzureDataFactoryRunPipelineOperator - Sensors — Special operators that wait for a condition: - FileSensor — waits for a file to appear - HttpSensor — polls an HTTP endpoint - ExternalTaskSensor — waits for a task in another DAG - Poke mode vs Reschedule mode — Poke holds the worker slot; Reschedule frees it between checks. - Hooks — Interfaces to external systems (databases, APIs, cloud services). They handle authentication and connection management. Connections are stored in the metadata DB.

5. TaskFlow API (Airflow 2.x)

Why: Modern way of writing Airflow DAGs that is cleaner and more Pythonic. - @task decorator — Turns a Python function into an Airflow task with automatic XCom push/pull. - @dag decorator — Defines the DAG as a decorated function. - Automatic dependency inference — When you pass the output of one @task to another, Airflow infers the dependency automatically. - XCom serialization — TaskFlow uses XCom to pass data between tasks. Understand size limits and serialization formats.

6. XComs (Cross-Communication)

Why: The mechanism for tasks to share data. - Push/Pull — Tasks push values to XCom (return value or xcom_push()), others pull with xcom_pull(). - Limitations — XComs are stored in the metadata DB. Don't pass large datasets through XCom (use external storage instead). Default limit varies by backend but typically < 64KB is safe. - Custom XCom backends — You can configure S3, GCS, or Azure Blob as XCom backends for larger payloads.

7. Connections, Variables & Secrets

Why: How Airflow manages credentials and configuration. - Connections — Stored in the metadata DB or secrets backend. Each connection has a conn_id, type (e.g., aws, azure, postgres), host, port, login, password, and extras (JSON). - Variables — Key-value pairs for configuration. Accessed via Variable.get('key'). - Secrets Backends — Airflow can pull connections/variables from AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, etc.

8. ML Pipeline Integration with Airflow

Why: The blog's core goal — orchestrating ML training and monitoring. - Data ingestion tasks — Pulling training data from external sources, validating schema. - Feature engineering — Transformation tasks that prepare features. - Model training — Triggering training jobs (local Python, Spark, or cloud ML services like Azure ML, SageMaker, Vertex AI). - Model evaluation — Computing metrics, comparing against baseline, deciding whether to deploy. - Model registry — Logging the trained model to MLflow, Azure ML, or a custom registry. - Monitoring tasks — Checking model drift, data drift, performance degradation. - Retraining triggers — Dataset-driven or scheduled retraining pipelines.

9. Deployment & Infrastructure

Why: Practical knowledge needed for running Airflow in production. - Docker Compose — The standard way to run Airflow locally (official docker-compose.yaml). - Helm Chart — For Kubernetes deployments (official Airflow Helm chart). - Managed Airflow — Cloud-managed options: Google Cloud Composer, AWS MWAA, Astronomer. - DAG deployment — Git-sync, CI/CD pipelines, or shared NFS.

TODO / Remaining Work

[ ] Set up local Airflow using Docker Compose and study/screenshot the UI
[ ] Write a simple DAG with PythonOperator + BashOperator and document each line
[ ] Explain the scheduler/webserver/executor architecture with a Mermaid diagram
[ ] Write a sensor + hook example (e.g., wait for a file, then process it)
[ ] Build a TaskFlow API DAG and compare with traditional operator style
[ ] Design and implement an ML training pipeline DAG (data → train → evaluate → register)
[ ] Add a monitoring/alerting task (email on failure, Slack notification)
[ ] Document connection/variable setup and secrets backend integration
[ ] Add a section on testing DAGs (unit tests, DAG validation)
[ ] Add a section on production deployment (Docker, Kubernetes, Cloud Composer)
[ ] Screenshot and annotate the Airflow UI (Graph view, Gantt chart, logs)

Reference DAG — End-to-End ML Pipeline with TaskFlow API

This DAG shows the shape of a real training pipeline: ingest → validate → train → evaluate → register. It runs daily and only promotes a model if it beats the current production baseline.

# dags/train_reco_model.py
from __future__ import annotations
import pendulum
from airflow.decorators import dag, task
from airflow.exceptions import AirflowSkipException

@dag(
    dag_id="train_reco_model",
    schedule="0 2 * * *",                     # daily at 02:00 UTC
    start_date=pendulum.datetime(2025, 1, 1, tz="UTC"),
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 2, "retry_delay": pendulum.duration(minutes=5)},
    tags=["ml", "training"],
)
def train_reco_model():

    @task
    def ingest(ds: str) -> str:
        """Pull training data for the given day, return S3 path."""
        path = f"s3://data-lake/reco/train/{ds}.parquet"
        # ... actual ingestion logic (Spark/SQL/Python) ...
        return path

    @task
    def validate(path: str) -> str:
        """Great Expectations / Pandera schema check. Fail fast on bad data."""
        import pandas as pd
        df = pd.read_parquet(path)
        assert len(df) > 100_000, f"Only {len(df)} rows — suspicious"
        assert df["label"].between(0, 1).all(), "label out of range"
        return path

    @task
    def train(path: str) -> dict:
        """Train model, return metrics + model URI."""
        # ... fit XGBoost / PyTorch etc ...
        return {"model_uri": "s3://models/reco/2025-01-15/", "auc": 0.82}

    @task
    def evaluate(result: dict) -> dict:
        """Compare against current prod baseline."""
        baseline_auc = 0.80
        if result["auc"] < baseline_auc:
            raise AirflowSkipException(
                f"New AUC {result['auc']:.3f} < baseline {baseline_auc} — skip promotion"
            )
        return result

    @task
    def register(result: dict) -> None:
        """Log to MLflow / Azure ML / model registry and tag as 'production'."""
        print(f"Promoted {result['model_uri']} (AUC={result['auc']})")

    from airflow.operators.empty import EmptyOperator
    from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator

    notify = SlackWebhookOperator(
        task_id="notify_on_failure",
        slack_webhook_conn_id="slack_alerts",
        message=":x: `train_reco_model` failed on {{ ds }}",
        trigger_rule="one_failed",
    )

    path = ingest("{{ ds }}")
    validated = validate(path)
    metrics = train(validated)
    passed = evaluate(metrics)
    register(passed) >> notify

train_reco_model()

What this demonstrates:

TaskFlow @task decorator — return values flow through XCom automatically; dependencies inferred from data flow.
Idempotent tasks — every task is keyed by ds (the run's logical date); reruns are safe.
Fail-fast validation — validate halts the pipeline before expensive training if upstream data is broken.
Skip, don't fail — AirflowSkipException when the model didn't beat baseline; downstream register is skipped but the DAG isn't marked failed.
trigger_rule="one_failed" — the Slack task only runs when something upstream fails.

Architecture — How the Scheduler Drives This

flowchart LR Parse[Scheduler parses dags/] --> Queue[Ready tasks → queue] Queue --> Exec[Executor
Celery/K8s/Local] Exec --> Workers[Worker processes] Workers --> DB[(Metadata DB)] Workers --> XCom[(XCom)] DB --> UI[Webserver UI] XCom --> Workers

When every TODO box above is ticked — local setup, ML pipeline running, monitoring wired — flip this post to status: published.

Apache Airflow - open source orchestration engine