What is the ETL Pipeline Builder?

A skill that builds and manages ETL pipelines for data migration, including transformations, incremental loads, CDC configuration, error handling, and monitoring.

Which tools does it integrate with?

Supports Airflow (orchestration), dbt (transformation), Airbyte (data integration), Fivetran (ETL SaaS), AWS DMS (cloud migration), and Debezium (CDC).

What artifacts are produced?

Output includes pipeline definitions, DAGs (dagFile), config files, SQL files, and deployment status accessible via the Output Schema.

etl-pipeline-builder

npx machina-cli add skill a5c-ai/babysitter/etl-pipeline-builder --openclaw

Files (1)

SKILL.md

2.4 KB

ETL Pipeline Builder Skill

Builds and manages ETL (Extract, Transform, Load) pipelines for data migration, supporting incremental loads, CDC, and comprehensive monitoring.

Purpose

Enable data pipeline creation for:

Source-to-target mapping
Transformation definition
Incremental load setup
CDC configuration
Pipeline monitoring

Capabilities

1. Source-to-Target Mapping

Define column mappings
Handle schema differences
Configure data type conversions
Manage derived columns

2. Transformation Definition

Data type transformations
Value mappings
Aggregations
Lookups and enrichments

3. Incremental Load Setup

Define watermarks
Configure incremental columns
Handle deletes
Manage merge logic

4. CDC Configuration

Log-based CDC
Trigger-based CDC
Timestamp-based CDC
Full load comparison

5. Error Handling

Define retry policies
Configure dead letter queues
Handle data quality issues
Implement alerting

6. Pipeline Monitoring

Track pipeline metrics
Monitor data volumes
Alert on failures
Generate SLA reports

Tool Integrations

Tool	Type	Integration Method
Apache Airflow	Orchestration	Python
dbt	Transformation	CLI
Airbyte	Data integration	API
Fivetran	SaaS ETL	API
AWS DMS	Cloud migration	CLI
Debezium	CDC	Config

Output Schema

{
  "pipelineId": "string",
  "timestamp": "ISO8601",
  "pipeline": {
    "name": "string",
    "source": {},
    "target": {},
    "mappings": [],
    "transformations": [],
    "schedule": "string"
  },
  "artifacts": {
    "dagFile": "string",
    "configFile": "string",
    "sqlFiles": []
  },
  "deployment": {
    "status": "string",
    "url": "string"
  }
}

Integration with Migration Processes

database-schema-migration: Data movement
cloud-migration: Cloud data pipelines
data-format-migration: Format transformation

Related Skills

data-migration-validator: Validation
schema-comparator: Schema mapping

Related Agents

database-migration-orchestrator: Pipeline orchestration
data-architect-agent: Pipeline design

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/code-migration-modernization/skills/etl-pipeline-builder/SKILL.md

View on GitHub

Overview

ETL Pipeline Builder creates and manages end-to-end ETL pipelines for data migration, including incremental loads, CDC, and robust monitoring. It supports source-to-target mappings, transformation definitions, and error handling, with integrations to Airflow, dbt, Airbyte, Fivetran, AWS DMS, and Debezium.

How This Skill Works

Users define source-to-target mappings, transformation rules, and incremental load settings. The skill orchestrates pipeline components, configures CDC modes, handles errors with retries and dead-letter queues, and exposes monitoring and SLA reporting through integrated tools.

When to Use It

Migrating data from on-premises sources to the cloud with incremental loads
Needing near-real-time replication via CDC (log-based, trigger-based, or timestamp-based)
Applying transformations and enrichments during migration with lookups and aggregations
Setting up end-to-end monitoring, alerts, and SLA reports for data pipelines
Handling data quality issues and implementing configurable retry and dead-letter strategies

Quick Start

Step 1: Define the source and target schemas and mapping rules
Step 2: Add transformations, set incremental load (watermarks, incremental columns), and configure CDC
Step 3: Enable monitoring, error handling (retry/DLQ) and deploy to your orchestrator (e.g., Airflow) and data integration tools

Best Practices

Clearly define source-target mappings and handle schema differences up front
Reserve watermarks and incremental columns for reliable incremental loads
Configure retry policies and dead-letter queues to handle transient failures
Choose appropriate CDC mode per source and monitor full-load vs incremental behavior
Test end-to-end pipelines with representative data and validate outputs against schemas

Example Use Cases

Migrate a customer database from on-prem Oracle to Snowflake with Airflow orchestration and dbt transformations
CDC-based replication from MySQL to BigQuery using Debezium and Airbyte
Incremental ETL for a SaaS product feeding a data lake with lookups and enrichments
Full-load migration with Delta checks and SLA reporting for regulatory data
Data-format migration converting heterogeneous sources into a unified schema with transformation rules

Frequently Asked Questions

Add this skill to your agents