Get the FREE Ultimate OpenClaw Setup Guide →

airflow

npx machina-cli add skill G1Joshi/Agent-Skills/airflow --openclaw
Files (1)
SKILL.md
1.0 KB

Airflow

Apache Airflow is the standard for data engineering pipelines. v3.0 (2025) introduces Event-driven Triggers and a modern React UI.

When to Use

  • ETL/ELT: Scheduling nightly data warehouse loads.
  • ML Ops: Retraining models when new data arrives.
  • Dependency Management: "Run Task B only if Task A succeeds".

Core Concepts

DAGs (Directed Acyclic Graphs)

Defined in Python.

Task SDK

New in v3.0. Allows writing tasks in any language, not just Python.

Edge Executor

Run tasks on remote edge devices.

Best Practices (2025)

Do:

  • Use the TaskFlow API: @task decorators are cleaner than PythonOperator.
  • Use Datasets: Define data-aware scheduling (schedule=[Dataset("s3://bucket/file")]).

Don't:

  • Don't put top-level code in DAG files: It runs every scheduler heartbeat.

References

Source

git clone https://github.com/G1Joshi/Agent-Skills/blob/main/skills/ai-ml/airflow/SKILL.mdView on GitHub

Overview

Airflow is the standard for data engineering pipelines. It orchestrates ETL/ELT workflows, ML model retraining, and task dependencies, delivering reliable scheduling and orchestration. Airflow v3.0 introduces Event-driven Triggers and a modern React UI.

How This Skill Works

Pipelines are defined as DAGs written in Python. The Task SDK in v3.0 lets you write tasks in languages other than Python, expanding what you can run in a workflow. The Edge Executor enables executing tasks on remote edge devices, broadening where tasks can run.

When to Use It

  • ETL/ELT: Scheduling nightly data warehouse loads.
  • ML Ops: Retraining models when new data arrives.
  • Dependency management: Run Task B only if Task A succeeds.
  • Edge computing: Execute tasks on remote edge devices with the Edge Executor.
  • Event-driven workflows: Trigger pipelines in response to data events.

Quick Start

  1. Step 1: Install Airflow and initialize the metadata database.
  2. Step 2: Create a DAG using the TaskFlow API with @task-decorated functions.
  3. Step 3: Configure a Dataset or an event trigger to enable data-aware or event-driven scheduling.

Best Practices

  • Use the TaskFlow API: @task decorators provide cleaner DAGs than traditional operators.
  • Use Datasets: Define data-aware scheduling to react to data presence.
  • Don't put top-level code in DAG files: It runs on every scheduler heartbeat.
  • Explore the Task SDK to write tasks in languages other than Python.
  • Leverage Event-driven Triggers in v3.0 to start jobs when data events occur.

Example Use Cases

  • Nightly ETL for a data warehouse to refresh dashboards.
  • ML model retraining triggered by the arrival of new training data.
  • Coordinate Task B to run only after Task A completes successfully.
  • Running analytics tasks on edge devices using the Edge Executor.
  • Event-driven pipelines that kick off when new data lands in a data lake.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers