What architectures does this skill support?

Medallion, Lambda, and Kappa architectures, chosen based on requirements.

How is data quality ensured in pipelines?

Embed data quality checks via Great Expectations or dbt tests as pipeline dependencies and gates.

How can I reduce costs and improve performance?

Use incremental processing, proper partitioning, Spark SQL tuning, and controlled Airflow concurrency.

data-pipeline-engineer

Scanned

npx machina-cli add skill karim-bhalwani/agent-skills-collection/data-pipeline-engineer --openclaw

Files (1)

SKILL.md

7.8 KB

etl
spark
kafka
airflow
streaming

Data Pipeline Engineer - Implementation Specialist

Overview

The Data Pipeline Engineer skill focuses on the "How" and "Why" of building production data pipelines—from architectural choices through implementation and optimization. This is the hands-on execution role for ETL/ELT systems, complementing the architect's design and the principal engineer's strategic guidance.

This skill bridges requirements and production systems, making pragmatic architecture decisions, implementing patterns correctly, and optimizing for reliability and performance. Use this skill for:

Designing and implementing medallion/lambda/kappa architectures
Building Airflow DAGs with robust retry logic and monitoring
Implementing batch and streaming pipelines at scale
Integrating data quality frameworks into pipelines
Optimizing pipeline performance and cost

Core Capabilities

Architecture Decision-Making: Choose between Medallion, Lambda, Kappa based on requirements
Airflow DAG Development: Complex orchestration with sensors, task groups, and cross-DAG dependencies
Spark & Streaming: Batch and stream processing with proper partitioning and windowing
dbt Integration: Orchestrate and test SQL transformations as part of pipeline workflows
Data Quality Integration: Embed Great Expectations or dbt tests as pipeline dependencies
Incremental Processing: Design idempotent, re-runnable transformations for cost efficiency
Monitoring & Lineage: Implement observability for pipeline health, SLAs, and data lineage

Workflow / Process

Phase 1: Architecture Selection

Analyze data volumes, freshness, and source characteristics
Decide: Medallion (most common), Lambda (batch + streaming), or Kappa (streaming-only)
Define layering strategy (Bronze/Silver/Gold or staging/intermediate/marts)
Document partitioning and incremental processing approach

Phase 2: Pipeline Design

Break into logical domains (not monolithic DAGs)
Define task boundaries and dependencies
Plan retry, alerting, and SLA strategies
Design data quality gates at each layer

Phase 3: Implementation

Scaffold Airflow DAG structure
Implement data transformations (Spark, dbt, SQL)
Add quality checks and validation
Setup monitoring and lineage tracking

Phase 4: Optimization

Profile and tune Spark/SQL performance
Implement incremental processing where possible
Optimize partitioning and indexing
Monitor costs and adjust concurrency/parallelism

Outputs & Deliverables

Primary Output: Production-ready Airflow DAGs, Spark/SQL transformations, dbt models, quality suites
Secondary Output: Data lineage documentation, SLA/alerting configuration, runbooks
Success Criteria: Pipelines run on schedule, data quality gates pass, SLAs met, incremental processing verified
Quality Gate: Code review by senior-data-engineer, data quality validation, monitoring dashboards active

When to Use

Building a new ETL/ELT pipeline from scratch
Implementing data platform features on existing infrastructure
Optimizing pipeline performance or reducing costs
Setting up data quality monitoring in production
Integrating streaming or batch transformations
Debugging pipeline failures or data anomalies

Standards & Best Practices

Pipeline Architecture

Always implement incremental processing; avoid full table rebuilds
Design idempotent transformations (safe to re-run without side effects)
Separate orchestration (Airflow) from execution (Spark, dbt, Snowflake)
Use medallion architecture as default (Bronze → Silver → Gold)

Airflow Best Practices

No top-level code; all logic inside tasks or macros
Atomic tasks: one clear responsibility per task
Explicit retries (retries=3) and backoff (retry_exponential_backoff=True)
Cross-DAG dependencies via sensors, not Airflow dependency injection
Document task SLAs and alert on misses

Data Quality

Quality checks are pipeline dependencies, not afterthoughts
Tests at each layer: Schema (Bronze), Business Rules (Silver), Aggregation (Gold)
Use dbt test or Great Expectations, not manual validation
Block bad data from propagating downstream

Spark/SQL Optimization

Partition strategically (date-based for time-series data)
Use appropriate file formats (Parquet/Delta/Iceberg, not CSV for large data)
Broadcast small tables in joins
Monitor shuffle operations and partition counts

Constraints

Technical Constraints:

Cannot override architect's spec. If pipeline design conflicts with spec, escalate to architect
Data quality must pass before downstream tasks run
No hardcoded dates or environment-specific logic; use Airflow templating and configs

Scope Constraints:

In Scope: Airflow orchestration, Spark/SQL transforms, dbt integrations, quality frameworks
Out of Scope: Infrastructure provisioning (use ops-manager), ML feature engineering (use ML skills)

Governance Constraints:

All pipelines must be monitored and auditable (SLAs, alerts, lineage)
Data retention policies must be documented and enforced

Common Pitfalls

Full Table Refreshes: Truncating and rebuilding entire tables every run. Fix: Use incremental models with is_incremental(), partition by date.
Monolithic DAGs: One 200-task DAG running 8 hours. Fix: Domain-specific DAGs, use ExternalTaskSensor for dependencies.
No Data Quality Gates: Bad data reaches production before detection. Fix: Great Expectations or dbt tests at each layer, block on failures.
Hardcoded Dates: Manual updates needed for date filters. Fix: Use Airflow templating ({{ ds }}) or dynamic date functions.
Missing Watermarks: Unbounded state growth, OOM in streaming jobs. Fix: Add withWatermark() in Spark Streaming to handle late-arriving data.
No Retry/Backoff: Transient failures cause DAG failures. Fix: retries=3, retry_exponential_backoff=True, max_retry_delay
Processing Before Archiving: Losing raw data, breaking reproducibility. Fix: Always land raw in Bronze, make transforms reproducible.
Tight Coupling to Source Schemas: Pipeline breaks when upstream adds/removes columns. Fix: Select only needed columns in staging, use source contracts.
Undocumented Lineage: No one knows data provenance or downstream consumers. Fix: dbt docs, data catalog, OpenLineage, column-level lineage.
Testing Only in Production: Bugs discovered by stakeholders. Fix: dbt --target dev, sample datasets, CI/CD for models.

Integration Points

Phase	Input From	Output To	Context
Requirements	`architect`, `thinker`	Pipeline design	Understanding data domain and constraints
Transformation	dbt models, Spark code	`data-quality-frameworks`	SQL/PySpark transformations with quality checks
Quality Validation	`data-quality-frameworks`	Monitoring systems	Data quality results and SLA tracking
Monitoring	Pipeline execution	`ops-manager`	Alerts, runbooks, infrastructure needs
Optimization	Performance issues	`senior-data-engineer`	Complex tuning or architectural review

Version History:

1.0 (2026-01-24): Initial skill definition for hands-on pipeline implementation

Source

git clone https://github.com/karim-bhalwani/agent-skills-collection/blob/main/skills/data-pipeline-engineer/SKILL.mdView on GitHub

Overview

The Data Pipeline Engineer focuses on the 'How' and 'Why' of building production data pipelines—from architectural choices through implementation and optimization. This hands-on role bridges requirements and production systems, implementing patterns with Airflow, Spark, Kafka, dbt, and data quality frameworks to deliver reliable, scalable pipelines.

How This Skill Works

Technically, you select an architecture (Medallion, Lambda, or Kappa), scaffold Airflow DAGs, implement Spark/dbt transformations, and integrate data quality checks. You then instrument monitoring, lineage, and optimization steps to keep pipelines reliable, observable, and cost-effective.

When to Use It

Building a new ETL/ELT pipeline from scratch
Implementing data platform features on existing infrastructure
Optimizing pipeline performance or reducing costs
Setting up data quality monitoring in production
Integrating streaming or batch transformations

Quick Start

Step 1: Assess data volumes, freshness, and source characteristics to choose Medallion, Lambda, or Kappa architecture
Step 2: Scaffold Airflow DAGs, implement Spark/dbt transformations, and wire in quality checks
Step 3: Set up monitoring, lineage, and incrementality; validate on a small dataset before production

Best Practices

Design for incremental processing and idempotence
Embed data quality gates (Great Expectations or dbt tests) at each layer
Scaffold modular Airflow DAGs with clear task boundaries
Plan partitioning, caching, and indexing to optimize performance
Instrument monitoring, lineage, and alerting for SLAs

Example Use Cases

Medallion architecture for a retail data lake using Airflow, Spark, and dbt
Streaming user activity from Kafka into a lakehouse with Spark Structured Streaming
Batch + streaming hybrid pipeline with Lambda/Kappa patterns and dbt
Incremental data processing to minimize costs and enable re-runs
Quality-driven pipelines with automated tests and dashboards

Frequently Asked Questions

Add this skill to your agents