DataEng is a coordinating role that designs data pipelines, ETL processes, and data architecture, and delegates execution to its associated skills.

How does DataEng relate to DataLead?

DataEng reports to DataLead and ensures data strategies are implemented through the appropriate skills and workflows.

Which tools should I know for DataEng?

Familiarity with orchestration tools like Airflow or Dagster, and streaming/batch frameworks (Kafka, Spark) helps coordinate ingestion, processing, and loading into data warehouses or lakes.

Dataeng

Scanned

npx machina-cli add skill javalenciacai/develop-skills/dataeng --openclaw

Files (1)

SKILL.md

2.7 KB

DataEng - Data Engineer

Role

Builds pipelines and processes data. Reports to DataLead.

Responsibilities

Data pipeline design and construction
ETL processes (Extract, Transform, Load)
Data warehousing and data lakes
Batch and streaming processing
Data source integration
Critical Restriction: This skill is only a role and must always use one of its associated skills. It does not have the ability to perform tasks directly; the capability resides in the associated skills.

Base Skills

# Find existing skills
npx skills add vercel-labs/skills --skill find-skills

# Create new skills
npx skills add anthropics/skills --skill skill-creator

Current Skills

Base Skills (All Data Engineers)

Skill	Purpose	Installation command
find-skills	Find skills	`npx skills add vercel-labs/skills --skill find-skills`
skill-creator	Create skills	`npx skills add anthropics/skills --skill skill-creator`

Data Engineering Skills 🔴 High Priority

Skill	Purpose	Installation command
doc-coauthoring	Data pipeline docs, ETL documentation, data architecture specs, schema docs	`npx skills add anthropics/skills --skill doc-coauthoring`
xlsx	Pipeline inventory, data quality metrics, ETL schedules, data lineage tracking	`npx skills add anthropics/skills --skill xlsx`
data-visualization	Pipeline monitoring, data quality dashboards, ETL performance metrics	`npx skills add 1nference-sh/skills --skill data-visualization`

Documentation Skills 🟡 Medium Priority

Skill	Purpose	Installation command
technical-blog-writing	Data engineering best practices, ETL patterns, pipeline optimization guides	`npx skills add 1nference-sh/skills --skill technical-blog-writing`

Rule: Add Used Skills

Every time you use a new skill, add it to the "Current Skills" table.

Examples of skills to search for:

npx skills find etl
npx skills find data-pipeline
npx skills find apache-spark

Source

git clone https://github.com/javalenciacai/develop-skills/blob/main/.agents/skills/dataeng/SKILL.mdView on GitHub

Overview

DataEng is a coordinating role for building data pipelines, ETL processes, and data warehousing or data lake architectures. It also covers batch and streaming workloads and data integration across sources, with an emphasis on data quality and modeling. Note: this role does not execute tasks directly; it relies on associated skills to perform the work.

How This Skill Works

As a role, DataEng defines pipeline designs, data models, and orchestration contracts, then delegates actual implementation to its associated skills. It leverages tools like Airflow or Dagster for orchestration, integrates diverse data sources, and enforces data quality and schema consistency through its collaborators.

When to Use It

Building data pipelines or ETL processes
Data warehousing or data lake architecture
Batch processing or streaming data (Kafka, Spark)
Data integration from multiple sources
Data quality validation or data cleaning

Quick Start

Step 1: Identify the associated skills to activate (ETL, orchestration, data modeling) and align with data sources and targets
Step 2: Draft the pipeline architecture including sources, transforms, and sink (warehouse or lake) and outline the data models
Step 3: Configure orchestration (Airflow, Dagster) and implement the ETL using the chosen skills

Best Practices

Define clear ETL contracts and data contracts between sources and targets
Design idempotent ETL steps and maintain data lineage
Choose a robust orchestration tool (Airflow or Dagster) and implement monitoring
Incorporate data quality checks and alerts before loading to the warehouse
Document data models, transformations, and dependencies for maintainability

Example Use Cases

Coordinate a multi-source customer data pipeline feeding a data lake and data warehouse
Ingest streaming data from Kafka into Spark structured streaming with Airflow orchestration
Develop data quality dashboards to monitor ETL health and data lineage
Manage schema evolution and data modeling for a centralized warehouse
Orchestrate ingestion from SaaS APIs and internal databases using standardized ETL templates

Frequently Asked Questions

Add this skill to your agents