Get the FREE Ultimate OpenClaw Setup Guide →

data-modeler

npx machina-cli add skill karim-bhalwani/agent-skills-collection/data-modeler --openclaw
Files (1)
SKILL.md
3.9 KB

Data Modeler Skill - Lakehouse & Data Engineering

Overview

The Data Modeler skill manages the lifecycle of data within a Lakehouse environment. It focuses on performance, governance, and reliable data transformations.

Core Capabilities

  1. Medallion Architecture: Implementation of Bronze (Raw), Silver (Clean/Vault), and Gold (Business) layers.
  2. Data Vault 2.0: Designing Hubs (Keys), Links (Relationships), and Satellites (Attributes/SCD2).
  3. Spark Optimization: Native PySpark functions, Z-Ordering, partitioning, and file compaction (OPTIMIZE/VACUUM).
  4. Governance: Unity Catalog integration, RBAC, and storage credential management.
  5. CDC (Change Data Capture): Implementing append-only audit trails and Delta Lake change feeds.

Standards

  • PySpark: Avoid UDFs. Use explicit imports and aliases (F, T). Use .transform() for modularity.
  • Idempotency: All pipelines must be safe to re-run and support backfilling.
  • Partitioning: Date-based for time-series; 100-1000 partitions ideal.
  • Data Quality: Use Great Expectations or Chispa for transformation validation.

When to Use

  • Designing database schemas or lakehouse layouts.
  • Building ETL/ELT pipelines.
  • Optimizing slow queries or high-cost data processing.
  • Implementing data privacy and governance controls.

Constraints

  • NO implementation of application business logic.
  • NO infrastructure deployment (route to ops-manager).
  • NO code without data quality checks.

Outputs & Deliverables

  • Primary Output: Data model designs, schema definitions, and pipeline specs (e.g., specs/<pipeline>.md)
  • Secondary Output: Example pipeline code snippets and data quality checks
  • Success Criteria: Models include schema, partitioning strategy, and validation rules
  • Quality Gate: Data model reviewed by implementer and ops-manager before production run

Additional Constraints

  • Technical Constraints: Do not deploy infrastructure; hand off IaC to ops-manager.
  • Scope Constraints: In scope: model design and pipeline specs. Out of scope: production deployment and cluster configuration.
  • Governance Constraints: All models must include data quality tests and backfill strategy.

Common Pitfalls

  • Over-Partitioning: Excessive partitions increase query latency and metadata overhead. 100-1000 is ideal; validate before deployment.
  • Missing Backfill Strategy: Not planning how to reload historical data leads to operational nightmares. Every model needs a backfill procedure.
  • Ignoring Data Quality: Garbage in = garbage out. Define validation rules before implementation, not after bugs appear.
  • Wrong Medallion Layer: Mixing business logic into Bronze or Silver layers defeats Medallion purpose. Keep layers clean and separated.
  • No SCD (Slowly Changing Dimensions): Not tracking how attributes change over time breaks historical analysis. Use SCD2 for audit trails.
  • Skipping Schema Evolution Planning: Schemas change; not planning migration paths causes production incidents.

Integration Points

PhaseInput FromOutput ToContext
Designarchitect, requirementsSchema specificationUnderstand business primitives and data flows
Pipeline DevSchema specsimplementerHand off for ETL/ELT implementation
OptimizationPerformance metricsops-managerRequest infrastructure scaling if needed
QualityData rulesMonitoring/alertsSet up data quality checks and validations

Source

git clone https://github.com/karim-bhalwani/agent-skills-collection/blob/main/skills/data-modeler/SKILL.mdView on GitHub

Overview

The Data Modeler skill manages the lifecycle of data within a Lakehouse environment, focusing on performance, governance, and reliable transformations. It combines Medallion architecture with Data Vault 2.0 to structure Bronze, Silver, and Gold layers and enable scalable, auditable data pipelines.

How This Skill Works

It implements Bronze (Raw), Silver (Clean/Vault), and Gold (Business) layers using Medallion architecture, while applying Data Vault 2.0 constructs (Hubs, Links, Satellites) to model keys, relationships, and attributes. Spark optimization techniques (native PySpark, Z-Ordering, partitioning, and OPTIMIZE/VACUUM) drive performance, and governance is enforced via Unity Catalog, RBAC, and storage credential management; CDC support provides append-only audit trails and Delta Lake change feeds.

When to Use It

  • Designing database schemas or lakehouse layouts
  • Building ETL/ELT pipelines
  • Optimizing slow queries or high-cost data processing
  • Implementing data privacy and governance controls
  • Architecting lakehouse solutions with Medallion architecture and Data Vault 2.0

Quick Start

  1. Step 1: Map requirements and design medallion layers plus Data Vault hubs/links/satellites
  2. Step 2: Implement PySpark transformations with explicit imports, no UDFs, and introduce data quality checks
  3. Step 3: Validate governance and performance with Unity Catalog, RBAC, and run backfills

Best Practices

  • Follow PySpark standards: explicit imports (F, T), avoid UDFs, and use .transform() for modularity
  • Design for idempotency and backfill safety; pipelines should re-run cleanly
  • Plan partitioning as date-based with 100-1000 partitions; assess overhead
  • Incorporate data quality checks with Great Expectations or Chispa before moving to next layer
  • Keep Medallion layers clean (Bronze/Silver/Gold) and enforce governance integration (Unity Catalog, RBAC) plus backfill strategy

Example Use Cases

  • Implement Bronze/Silver/Gold for a retail dataset with Hub/Link/Satellites in a lakehouse
  • CDC-enabled Delta Lake pipeline capturing changes into Gold layer with audit trails
  • Spark optimization for clickstream data using Z-ordering and optimized partitions
  • Data governance integration using Unity Catalog and RBAC in a lakehouse
  • Backfill-ready pipelines with idempotent loads and schema evolution handling

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers