What is the Data Modeler skill used for?

Design and implement data pipelines using Medallion architecture, Data Vault 2.0, and Spark optimization.

What are Bronze, Silver, Gold layers?

Bronze is raw, Silver is clean/Vault, and Gold is business-ready. Data Vault adds hubs, links, and satellites to model keys, relationships, and attributes.

What constraints apply?

No application business logic, no infrastructure deployment, and all data moves must include data quality checks before production.

data-modeler

npx machina-cli add skill karim-bhalwani/agent-skills-collection/data-modeler --openclaw

Files (1)

SKILL.md

3.9 KB

Data Modeler Skill - Lakehouse & Data Engineering

Overview

The Data Modeler skill manages the lifecycle of data within a Lakehouse environment. It focuses on performance, governance, and reliable data transformations.

Core Capabilities

Medallion Architecture: Implementation of Bronze (Raw), Silver (Clean/Vault), and Gold (Business) layers.
Data Vault 2.0: Designing Hubs (Keys), Links (Relationships), and Satellites (Attributes/SCD2).
Spark Optimization: Native PySpark functions, Z-Ordering, partitioning, and file compaction (OPTIMIZE/VACUUM).
Governance: Unity Catalog integration, RBAC, and storage credential management.
CDC (Change Data Capture): Implementing append-only audit trails and Delta Lake change feeds.

Standards

PySpark: Avoid UDFs. Use explicit imports and aliases (F, T). Use .transform() for modularity.
Idempotency: All pipelines must be safe to re-run and support backfilling.
Partitioning: Date-based for time-series; 100-1000 partitions ideal.
Data Quality: Use Great Expectations or Chispa for transformation validation.

When to Use

Designing database schemas or lakehouse layouts.
Building ETL/ELT pipelines.
Optimizing slow queries or high-cost data processing.
Implementing data privacy and governance controls.

Constraints

NO implementation of application business logic.
NO infrastructure deployment (route to ops-manager).
NO code without data quality checks.

Outputs & Deliverables

Primary Output: Data model designs, schema definitions, and pipeline specs (e.g., specs/<pipeline>.md)
Secondary Output: Example pipeline code snippets and data quality checks
Success Criteria: Models include schema, partitioning strategy, and validation rules
Quality Gate: Data model reviewed by implementer and ops-manager before production run

Additional Constraints

Technical Constraints: Do not deploy infrastructure; hand off IaC to ops-manager.
Scope Constraints: In scope: model design and pipeline specs. Out of scope: production deployment and cluster configuration.
Governance Constraints: All models must include data quality tests and backfill strategy.

Common Pitfalls

Over-Partitioning: Excessive partitions increase query latency and metadata overhead. 100-1000 is ideal; validate before deployment.
Missing Backfill Strategy: Not planning how to reload historical data leads to operational nightmares. Every model needs a backfill procedure.
Ignoring Data Quality: Garbage in = garbage out. Define validation rules before implementation, not after bugs appear.
Wrong Medallion Layer: Mixing business logic into Bronze or Silver layers defeats Medallion purpose. Keep layers clean and separated.
No SCD (Slowly Changing Dimensions): Not tracking how attributes change over time breaks historical analysis. Use SCD2 for audit trails.
Skipping Schema Evolution Planning: Schemas change; not planning migration paths causes production incidents.

Integration Points

Phase	Input From	Output To	Context
Design	`architect`, requirements	Schema specification	Understand business primitives and data flows
Pipeline Dev	Schema specs	`implementer`	Hand off for ETL/ELT implementation
Optimization	Performance metrics	`ops-manager`	Request infrastructure scaling if needed
Quality	Data rules	Monitoring/alerts	Set up data quality checks and validations

Source

git clone https://github.com/karim-bhalwani/agent-skills-collection/blob/main/skills/data-modeler/SKILL.mdView on GitHub

Overview

The Data Modeler skill manages the lifecycle of data within a Lakehouse environment, focusing on performance, governance, and reliable transformations. It combines Medallion architecture with Data Vault 2.0 to structure Bronze, Silver, and Gold layers and enable scalable, auditable data pipelines.

How This Skill Works

It implements Bronze (Raw), Silver (Clean/Vault), and Gold (Business) layers using Medallion architecture, while applying Data Vault 2.0 constructs (Hubs, Links, Satellites) to model keys, relationships, and attributes. Spark optimization techniques (native PySpark, Z-Ordering, partitioning, and OPTIMIZE/VACUUM) drive performance, and governance is enforced via Unity Catalog, RBAC, and storage credential management; CDC support provides append-only audit trails and Delta Lake change feeds.

When to Use It

Designing database schemas or lakehouse layouts
Building ETL/ELT pipelines
Optimizing slow queries or high-cost data processing
Implementing data privacy and governance controls
Architecting lakehouse solutions with Medallion architecture and Data Vault 2.0

Quick Start

Step 1: Map requirements and design medallion layers plus Data Vault hubs/links/satellites
Step 2: Implement PySpark transformations with explicit imports, no UDFs, and introduce data quality checks
Step 3: Validate governance and performance with Unity Catalog, RBAC, and run backfills

Best Practices

Follow PySpark standards: explicit imports (F, T), avoid UDFs, and use .transform() for modularity
Design for idempotency and backfill safety; pipelines should re-run cleanly
Plan partitioning as date-based with 100-1000 partitions; assess overhead
Incorporate data quality checks with Great Expectations or Chispa before moving to next layer
Keep Medallion layers clean (Bronze/Silver/Gold) and enforce governance integration (Unity Catalog, RBAC) plus backfill strategy

Example Use Cases

Implement Bronze/Silver/Gold for a retail dataset with Hub/Link/Satellites in a lakehouse
CDC-enabled Delta Lake pipeline capturing changes into Gold layer with audit trails
Spark optimization for clickstream data using Z-ordering and optimized partitions
Data governance integration using Unity Catalog and RBAC in a lakehouse
Backfill-ready pipelines with idempotent loads and schema evolution handling

Frequently Asked Questions

Add this skill to your agents