Get the FREE Ultimate OpenClaw Setup Guide →

Dataeng

Scanned
npx machina-cli add skill javalenciacai/develop-skills/dataeng --openclaw
Files (1)
SKILL.md
2.7 KB

DataEng - Data Engineer

Role

Builds pipelines and processes data. Reports to DataLead.

Responsibilities

  • Data pipeline design and construction
  • ETL processes (Extract, Transform, Load)
  • Data warehousing and data lakes
  • Batch and streaming processing
  • Data source integration
  • Critical Restriction: This skill is only a role and must always use one of its associated skills. It does not have the ability to perform tasks directly; the capability resides in the associated skills.

Base Skills

# Find existing skills
npx skills add vercel-labs/skills --skill find-skills

# Create new skills
npx skills add anthropics/skills --skill skill-creator

Current Skills

<!-- Add here each skill you use with: npx skills add <owner/repo> --skill <name> -->

Base Skills (All Data Engineers)

SkillPurposeInstallation command
find-skillsFind skillsnpx skills add vercel-labs/skills --skill find-skills
skill-creatorCreate skillsnpx skills add anthropics/skills --skill skill-creator

Data Engineering Skills 🔴 High Priority

SkillPurposeInstallation command
doc-coauthoringData pipeline docs, ETL documentation, data architecture specs, schema docsnpx skills add anthropics/skills --skill doc-coauthoring
xlsxPipeline inventory, data quality metrics, ETL schedules, data lineage trackingnpx skills add anthropics/skills --skill xlsx
data-visualizationPipeline monitoring, data quality dashboards, ETL performance metricsnpx skills add 1nference-sh/skills --skill data-visualization

Documentation Skills 🟡 Medium Priority

SkillPurposeInstallation command
technical-blog-writingData engineering best practices, ETL patterns, pipeline optimization guidesnpx skills add 1nference-sh/skills --skill technical-blog-writing

Rule: Add Used Skills

Every time you use a new skill, add it to the "Current Skills" table.

Examples of skills to search for:

  • npx skills find etl
  • npx skills find data-pipeline
  • npx skills find apache-spark

Source

git clone https://github.com/javalenciacai/develop-skills/blob/main/.agents/skills/dataeng/SKILL.mdView on GitHub

Overview

DataEng is a coordinating role for building data pipelines, ETL processes, and data warehousing or data lake architectures. It also covers batch and streaming workloads and data integration across sources, with an emphasis on data quality and modeling. Note: this role does not execute tasks directly; it relies on associated skills to perform the work.

How This Skill Works

As a role, DataEng defines pipeline designs, data models, and orchestration contracts, then delegates actual implementation to its associated skills. It leverages tools like Airflow or Dagster for orchestration, integrates diverse data sources, and enforces data quality and schema consistency through its collaborators.

When to Use It

  • Building data pipelines or ETL processes
  • Data warehousing or data lake architecture
  • Batch processing or streaming data (Kafka, Spark)
  • Data integration from multiple sources
  • Data quality validation or data cleaning

Quick Start

  1. Step 1: Identify the associated skills to activate (ETL, orchestration, data modeling) and align with data sources and targets
  2. Step 2: Draft the pipeline architecture including sources, transforms, and sink (warehouse or lake) and outline the data models
  3. Step 3: Configure orchestration (Airflow, Dagster) and implement the ETL using the chosen skills

Best Practices

  • Define clear ETL contracts and data contracts between sources and targets
  • Design idempotent ETL steps and maintain data lineage
  • Choose a robust orchestration tool (Airflow or Dagster) and implement monitoring
  • Incorporate data quality checks and alerts before loading to the warehouse
  • Document data models, transformations, and dependencies for maintainability

Example Use Cases

  • Coordinate a multi-source customer data pipeline feeding a data lake and data warehouse
  • Ingest streaming data from Kafka into Spark structured streaming with Airflow orchestration
  • Develop data quality dashboards to monitor ETL health and data lineage
  • Manage schema evolution and data modeling for a centralized warehouse
  • Orchestrate ingestion from SaaS APIs and internal databases using standardized ETL templates

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers