etl-pipeline-builder
npx machina-cli add skill a5c-ai/babysitter/etl-pipeline-builder --openclawETL Pipeline Builder Skill
Builds and manages ETL (Extract, Transform, Load) pipelines for data migration, supporting incremental loads, CDC, and comprehensive monitoring.
Purpose
Enable data pipeline creation for:
- Source-to-target mapping
- Transformation definition
- Incremental load setup
- CDC configuration
- Pipeline monitoring
Capabilities
1. Source-to-Target Mapping
- Define column mappings
- Handle schema differences
- Configure data type conversions
- Manage derived columns
2. Transformation Definition
- Data type transformations
- Value mappings
- Aggregations
- Lookups and enrichments
3. Incremental Load Setup
- Define watermarks
- Configure incremental columns
- Handle deletes
- Manage merge logic
4. CDC Configuration
- Log-based CDC
- Trigger-based CDC
- Timestamp-based CDC
- Full load comparison
5. Error Handling
- Define retry policies
- Configure dead letter queues
- Handle data quality issues
- Implement alerting
6. Pipeline Monitoring
- Track pipeline metrics
- Monitor data volumes
- Alert on failures
- Generate SLA reports
Tool Integrations
| Tool | Type | Integration Method |
|---|---|---|
| Apache Airflow | Orchestration | Python |
| dbt | Transformation | CLI |
| Airbyte | Data integration | API |
| Fivetran | SaaS ETL | API |
| AWS DMS | Cloud migration | CLI |
| Debezium | CDC | Config |
Output Schema
{
"pipelineId": "string",
"timestamp": "ISO8601",
"pipeline": {
"name": "string",
"source": {},
"target": {},
"mappings": [],
"transformations": [],
"schedule": "string"
},
"artifacts": {
"dagFile": "string",
"configFile": "string",
"sqlFiles": []
},
"deployment": {
"status": "string",
"url": "string"
}
}
Integration with Migration Processes
- database-schema-migration: Data movement
- cloud-migration: Cloud data pipelines
- data-format-migration: Format transformation
Related Skills
data-migration-validator: Validationschema-comparator: Schema mapping
Related Agents
database-migration-orchestrator: Pipeline orchestrationdata-architect-agent: Pipeline design
Source
git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/code-migration-modernization/skills/etl-pipeline-builder/SKILL.mdView on GitHub Overview
ETL Pipeline Builder creates and manages end-to-end ETL pipelines for data migration, including incremental loads, CDC, and robust monitoring. It supports source-to-target mappings, transformation definitions, and error handling, with integrations to Airflow, dbt, Airbyte, Fivetran, AWS DMS, and Debezium.
How This Skill Works
Users define source-to-target mappings, transformation rules, and incremental load settings. The skill orchestrates pipeline components, configures CDC modes, handles errors with retries and dead-letter queues, and exposes monitoring and SLA reporting through integrated tools.
When to Use It
- Migrating data from on-premises sources to the cloud with incremental loads
- Needing near-real-time replication via CDC (log-based, trigger-based, or timestamp-based)
- Applying transformations and enrichments during migration with lookups and aggregations
- Setting up end-to-end monitoring, alerts, and SLA reports for data pipelines
- Handling data quality issues and implementing configurable retry and dead-letter strategies
Quick Start
- Step 1: Define the source and target schemas and mapping rules
- Step 2: Add transformations, set incremental load (watermarks, incremental columns), and configure CDC
- Step 3: Enable monitoring, error handling (retry/DLQ) and deploy to your orchestrator (e.g., Airflow) and data integration tools
Best Practices
- Clearly define source-target mappings and handle schema differences up front
- Reserve watermarks and incremental columns for reliable incremental loads
- Configure retry policies and dead-letter queues to handle transient failures
- Choose appropriate CDC mode per source and monitor full-load vs incremental behavior
- Test end-to-end pipelines with representative data and validate outputs against schemas
Example Use Cases
- Migrate a customer database from on-prem Oracle to Snowflake with Airflow orchestration and dbt transformations
- CDC-based replication from MySQL to BigQuery using Debezium and Airbyte
- Incremental ETL for a SaaS product feeding a data lake with lookups and enrichments
- Full-load migration with Delta checks and SLA reporting for regulatory data
- Data-format migration converting heterogeneous sources into a unified schema with transformation rules