What is data-engineer best used for?

Use this skill for SQL queries, database schema design, ETL pipelines, data transformations, and data validation.

How do I ensure pipelines are idempotent?

Design scripts to be idempotent, deduplicate where needed, and maintain thorough logs of processed and failed rows.

How should NULLs be handled?

Handle NULLs with COALESCE or fillna, and validate constraints to enforce data quality at ingestion.

data-engineer

npx machina-cli add skill k1lgor/virtual-company/11-data-engineer --openclaw

Files (1)

SKILL.md

1.8 KB

Data Engineer

You handle data with precision, focusing on efficiency, correctness, and type safety.

When to use

"Write a SQL query to..."
"Design the database schema for..."
"Clean/transform this dataset."
"Set up an ETL job."

Instructions

Schema Design:
- Normalize where appropriate to reduce redundancy.
- Use appropriate data types (INT, VARCHAR, TIMESTAMP, DECIMAL).
- Define indexes on columns frequently used in WHERE or JOIN clauses.
SQL Efficiency:
- Avoid SELECT *; specify columns.
- Watch for N+1 query problems if generating code.
- Use CTEs (Common Table Expressions) for readability.
Transformations:
- Handle NULLs explicitly (COALESCE, IFNULL, fillna).
- Validate data constraints (no negative prices, valid emails).
Pipelines:
- Ensure idempotency (running the script twice is safe).
- Log rows processed/failed.

Examples

User Request: "Write a query to find the top 5 users by spend in 2023."

Agent Action/Output:

SELECT
    user_id,
    SUM(amount) AS total_spend
FROM orders
WHERE
    order_date >= '2023-01-01'
    AND order_date < '2024-01-01'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;

User Request: "Design a schema for a book library."

Agent Action/Output: "Entities: Books, Authors, Members, Loans. Relationships:

Book belongs to Author (N:1)
Member borrows Book (N:M via Loans) Schema:"

CREATE TABLE authors (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

CREATE TABLE books (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    author_id INT REFERENCES authors(id),
    isbn VARCHAR(13) UNIQUE
);

Source

git clone https://github.com/k1lgor/virtual-company/blob/main/skills/11-data-engineer/SKILL.mdView on GitHub

Overview

Data Engineer focuses on precise data handling with efficiency, correctness, and type safety. It covers writing SQL queries, designing robust database schemas, building ETL pipelines, performing data transformations with pandas or Spark, and validating data quality. This skill helps ensure scalable, reliable data systems.

How This Skill Works

Work through a structured flow: design normalized schemas with approved data types and indexing, then implement SQL using explicit column lists and CTEs for readability. In transformations, handle NULLs explicitly and validate constraints (e.g., non-negative values, valid emails). In pipelines, ensure idempotent runs and log rows processed and failed.

When to Use It

Write a SQL query to extract insights or build reports.
Design the database schema for a system with normalized tables and appropriate indexes.
Clean and transform a dataset, handling NULLs, type conversions, and validations.
Set up an ETL job with idempotent data loads and robust logging.
Perform data validation and quality checks to enforce business rules.

Quick Start

Step 1: Identify whether you need a schema, a query, an ETL, or a transformation.
Step 2: Draft schema with normalized tables and typed columns, then write a representative SQL or pipeline skeleton.
Step 3: Run with test data, verify results, and add logging and idempotency safeguards.

Best Practices

Normalize schemas where appropriate to reduce redundancy.
Use explicit data types (INT, VARCHAR, TIMESTAMP, DECIMAL).
Define indexes on columns frequently used in WHERE or JOIN clauses.
Avoid SELECT *; specify columns and use CTEs for readability.
Ensure idempotent ETL runs with logging and error handling.

Example Use Cases

Write a query to find the top 5 users by spend in 2023.
Design a schema for a book library with entities Books, Authors, Members, Loans and their relationships.
Create an idempotent ETL pipeline that ingests daily sales data and loads it into a warehouse while logging processing stats.
Transform a dataset containing NULLs and invalid emails by applying COALESCE and format validation.
Rewrite a slow query using explicit column selection and a CTE to improve performance.

Frequently Asked Questions

Add this skill to your agents