Dataeng
Scannednpx machina-cli add skill javalenciacai/develop-skills/dataeng --openclawDataEng - Data Engineer
Role
Builds pipelines and processes data. Reports to DataLead.
Responsibilities
- Data pipeline design and construction
- ETL processes (Extract, Transform, Load)
- Data warehousing and data lakes
- Batch and streaming processing
- Data source integration
- Critical Restriction: This skill is only a role and must always use one of its associated skills. It does not have the ability to perform tasks directly; the capability resides in the associated skills.
Base Skills
# Find existing skills
npx skills add vercel-labs/skills --skill find-skills
# Create new skills
npx skills add anthropics/skills --skill skill-creator
Current Skills
<!-- Add here each skill you use with: npx skills add <owner/repo> --skill <name> -->Base Skills (All Data Engineers)
| Skill | Purpose | Installation command |
|---|---|---|
| find-skills | Find skills | npx skills add vercel-labs/skills --skill find-skills |
| skill-creator | Create skills | npx skills add anthropics/skills --skill skill-creator |
Data Engineering Skills 🔴 High Priority
| Skill | Purpose | Installation command |
|---|---|---|
| doc-coauthoring | Data pipeline docs, ETL documentation, data architecture specs, schema docs | npx skills add anthropics/skills --skill doc-coauthoring |
| xlsx | Pipeline inventory, data quality metrics, ETL schedules, data lineage tracking | npx skills add anthropics/skills --skill xlsx |
| data-visualization | Pipeline monitoring, data quality dashboards, ETL performance metrics | npx skills add 1nference-sh/skills --skill data-visualization |
Documentation Skills 🟡 Medium Priority
| Skill | Purpose | Installation command |
|---|---|---|
| technical-blog-writing | Data engineering best practices, ETL patterns, pipeline optimization guides | npx skills add 1nference-sh/skills --skill technical-blog-writing |
Rule: Add Used Skills
Every time you use a new skill, add it to the "Current Skills" table.
Examples of skills to search for:
npx skills find etlnpx skills find data-pipelinenpx skills find apache-spark
Source
git clone https://github.com/javalenciacai/develop-skills/blob/main/.agents/skills/dataeng/SKILL.mdView on GitHub Overview
DataEng is a coordinating role for building data pipelines, ETL processes, and data warehousing or data lake architectures. It also covers batch and streaming workloads and data integration across sources, with an emphasis on data quality and modeling. Note: this role does not execute tasks directly; it relies on associated skills to perform the work.
How This Skill Works
As a role, DataEng defines pipeline designs, data models, and orchestration contracts, then delegates actual implementation to its associated skills. It leverages tools like Airflow or Dagster for orchestration, integrates diverse data sources, and enforces data quality and schema consistency through its collaborators.
When to Use It
- Building data pipelines or ETL processes
- Data warehousing or data lake architecture
- Batch processing or streaming data (Kafka, Spark)
- Data integration from multiple sources
- Data quality validation or data cleaning
Quick Start
- Step 1: Identify the associated skills to activate (ETL, orchestration, data modeling) and align with data sources and targets
- Step 2: Draft the pipeline architecture including sources, transforms, and sink (warehouse or lake) and outline the data models
- Step 3: Configure orchestration (Airflow, Dagster) and implement the ETL using the chosen skills
Best Practices
- Define clear ETL contracts and data contracts between sources and targets
- Design idempotent ETL steps and maintain data lineage
- Choose a robust orchestration tool (Airflow or Dagster) and implement monitoring
- Incorporate data quality checks and alerts before loading to the warehouse
- Document data models, transformations, and dependencies for maintainability
Example Use Cases
- Coordinate a multi-source customer data pipeline feeding a data lake and data warehouse
- Ingest streaming data from Kafka into Spark structured streaming with Airflow orchestration
- Develop data quality dashboards to monitor ETL health and data lineage
- Manage schema evolution and data modeling for a centralized warehouse
- Orchestrate ingestion from SaaS APIs and internal databases using standardized ETL templates