Apache Airflow - open source orchestration engine

Architecture of Apache Airflow, how DAGs help design complex flows and dependencies, and how we can leverage Apache airflow to train a ML Model and monitor.

Apache Airflow - Going deep


Background & Prerequisites — What You Need to Know Before Writing This Blog

Before completing this blog, the following topics need to be studied and understood in depth. Each section below explains what the topic is, why it matters for this blog, and what you need to learn.


1. Python Fundamentals (Intermediate Level)

Why: Airflow DAGs are written in Python. You need solid Python skills to write operators, hooks, and custom logic. - Decorators — Airflow 2.x uses the @dag and @task decorators extensively. Understand how Python decorators wrap functions, accept arguments, and return callables. - Context managers — Used for resource management (DB connections, file handles) inside tasks. - Generators & iterators — Useful for processing large datasets in tasks without loading everything into memory. - Type hints — Airflow's TaskFlow API uses them for automatic XCom serialization.

2. Airflow Core Architecture

Why: You cannot write about Airflow without understanding its internal components and how they interact. - Scheduler — The brain of Airflow. It parses DAG files, determines which tasks are ready to run, and places them in a queue. Understand how the scheduler loop works, how it reads DAG files from the dags_folder, and its heartbeat mechanism. - Webserver — Flask-based UI that shows DAG status, task logs, Gantt charts, and allows manual triggers. Know how it connects to the metadata DB. - Executor — The execution engine. Different types: - SequentialExecutor — runs one task at a time (dev only) - LocalExecutor — runs tasks as local subprocesses (good for single-machine) - CeleryExecutor — distributes tasks across worker machines using Celery + a message broker (Redis/RabbitMQ) - KubernetesExecutor — spins up a new Kubernetes pod per task (best for cloud-native) - Metadata Database — PostgreSQL/MySQL database storing DAG definitions, task states, XCom values, connections, variables. Every component talks to this DB. - Workers — Processes that actually execute the tasks (relevant for Celery/K8s executors).

3. DAGs (Directed Acyclic Graphs)

Why: The central concept in Airflow — every workflow is a DAG. - What is a DAG — A collection of tasks with directed dependencies and no cycles. If Task A → Task B → Task C, then C cannot depend back on A. - DAG definition — Writing a Python file that instantiates a DAG object with dag_id, schedule_interval, start_date, catchup, default_args. - Schedule intervals — Cron expressions (0 2 * * *), presets (@daily, @hourly), timedelta objects, or dataset-triggered schedules (Airflow 2.4+). - Catchup & Backfill — When catchup=True, Airflow runs all missed intervals since start_date. Understand when to enable/disable this. - DAG dependencies — Using TriggerDagRunOperator or Datasets to create cross-DAG dependencies.

4. Operators, Sensors & Hooks

Why: These are the building blocks of tasks in Airflow. - Operators — Pre-built task templates: - BashOperator — runs shell commands - PythonOperator — runs Python callables - EmailOperator — sends emails - Cloud-specific: BigQueryOperator, S3ToGCSOperator, AzureDataFactoryRunPipelineOperator - Sensors — Special operators that wait for a condition: - FileSensor — waits for a file to appear - HttpSensor — polls an HTTP endpoint - ExternalTaskSensor — waits for a task in another DAG - Poke mode vs Reschedule mode — Poke holds the worker slot; Reschedule frees it between checks. - Hooks — Interfaces to external systems (databases, APIs, cloud services). They handle authentication and connection management. Connections are stored in the metadata DB.

5. TaskFlow API (Airflow 2.x)

Why: Modern way of writing Airflow DAGs that is cleaner and more Pythonic. - @task decorator — Turns a Python function into an Airflow task with automatic XCom push/pull. - @dag decorator — Defines the DAG as a decorated function. - Automatic dependency inference — When you pass the output of one @task to another, Airflow infers the dependency automatically. - XCom serialization — TaskFlow uses XCom to pass data between tasks. Understand size limits and serialization formats.

6. XComs (Cross-Communication)

Why: The mechanism for tasks to share data. - Push/Pull — Tasks push values to XCom (return value or xcom_push()), others pull with xcom_pull(). - Limitations — XComs are stored in the metadata DB. Don't pass large datasets through XCom (use external storage instead). Default limit varies by backend but typically < 64KB is safe. - Custom XCom backends — You can configure S3, GCS, or Azure Blob as XCom backends for larger payloads.

7. Connections, Variables & Secrets

Why: How Airflow manages credentials and configuration. - Connections — Stored in the metadata DB or secrets backend. Each connection has a conn_id, type (e.g., aws, azure, postgres), host, port, login, password, and extras (JSON). - Variables — Key-value pairs for configuration. Accessed via Variable.get('key'). - Secrets Backends — Airflow can pull connections/variables from AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, etc.

8. ML Pipeline Integration with Airflow

Why: The blog's core goal — orchestrating ML training and monitoring. - Data ingestion tasks — Pulling training data from external sources, validating schema. - Feature engineering — Transformation tasks that prepare features. - Model training — Triggering training jobs (local Python, Spark, or cloud ML services like Azure ML, SageMaker, Vertex AI). - Model evaluation — Computing metrics, comparing against baseline, deciding whether to deploy. - Model registry — Logging the trained model to MLflow, Azure ML, or a custom registry. - Monitoring tasks — Checking model drift, data drift, performance degradation. - Retraining triggers — Dataset-driven or scheduled retraining pipelines.

9. Deployment & Infrastructure

Why: Practical knowledge needed for running Airflow in production. - Docker Compose — The standard way to run Airflow locally (official docker-compose.yaml). - Helm Chart — For Kubernetes deployments (official Airflow Helm chart). - Managed Airflow — Cloud-managed options: Google Cloud Composer, AWS MWAA, Astronomer. - DAG deployment — Git-sync, CI/CD pipelines, or shared NFS.


TODO / Remaining Work

Coming soon..

Back to Blog About the Author