data-quality-frameworks
npx machina-cli add skill karim-bhalwani/agent-skills-collection/data-quality-frameworks --openclawData Quality Frameworks
Expert in embedding quality validation into data pipelines as a dependency, not an afterthought.
When to Use This Skill
Use when:
- Building comprehensive data quality validation into pipelines
- Setting up Great Expectations suites and automated checkpoints
- Creating dbt test suites (schema tests, relationship tests, custom tests)
- Establishing data contracts between producer and consumer teams
- Monitoring data quality metrics, SLAs, and anomalies
- Debugging data quality failures or regressions
- Implementing layer-based validation (Bronze schema, Silver business rules, Gold aggregations)
- Blocking bad data from proceeding downstream
Core Capabilities
- Great Expectations - Build expectation suites, checkpoints, automated validation
- dbt Testing - Schema, relationship, and custom test strategies
- Data Contracts - Producer/consumer agreements (ODCS, datacontract-cli)
- Quality Monitoring - Continuous validation, metrics tracking, alerting
- Debugging - Root-cause analysis of data quality anomalies
- Layer-Based Validation - Schema (Bronze), rules (Silver), aggregation checks (Gold)
Framework References
For detailed implementation guidance, see:
Great Expectations
Use when: Building GE validation suites and checkpoints
Covers:
- Building comprehensive expectation suites
- Checkpoint configuration and automation
- Running validations and handling failures
- Integration patterns and alerting
dbt Testing
Use when: Creating dbt test suites
Covers:
- Schema tests (unique, not_null, accepted_values, relationships)
- Custom generic tests (reusable across models)
- Singular tests (specific business rules)
- Test coverage best practices
Data Contracts
Use when: Establishing producer/consumer agreements
Covers:
- Data contract specification format
- Schema definitions with PII classification
- Quality expectations and SLA definitions
- Contract versioning and evolution
- Validation against contracts
Automated Quality Pipeline
Use when: Orchestrating end-to-end quality validation
Covers:
- Building orchestrated quality pipelines
- Multi-table validation workflows
- Quality reporting and metrics
- Integration with Airflow/orchestrators
- Blocking pipelines on failures
Quick Decision Guide
| Goal | Reference |
|---|---|
| Build GE validation suite | Great Expectations |
| Add dbt tests to models | dbt Testing |
| Define producer/consumer contract | Data Contracts |
| Orchestrate multi-table validation | Automated Quality Pipeline |
Quality Strategy
Layer-Based Testing
- Bronze (Schema): Validate schema, data types, null constraints
- Silver (Business Rules): Test foreign keys, categorical values, ranges
- Gold (Aggregations): Verify aggregation logic, metric calculations
Test Pyramid
- Most tests: Single column validations (fast, focused)
- Fewer tests: Cross-table relationships (slower, broader)
- Blocking vs Warning: Block bad data; warn on minor issues
Best Practices
Do's:
- ✅ Test early - Validate source data before transformations
- ✅ Test incrementally - Add tests as you find issues
- ✅ Document expectations - Clear descriptions for each test
- ✅ Alert on failures - Integrate with monitoring
- ✅ Version contracts - Track schema changes
Don'ts:
- ❌ Don't test everything - Focus on critical columns
- ❌ Don't ignore warnings - They often precede failures
- ❌ Don't skip freshness - Stale data is bad data
- ❌ Don't hardcode thresholds - Use dynamic baselines
- ❌ Don't test in isolation - Test relationships too
Common Pitfalls & Fixes
| Pitfall | Fix |
|---|---|
| Testing only prod | Run tests on dev first: dbt test --target dev |
| Generic thresholds | Tailor tests to data characteristics |
| No alerting | Integrate with monitoring; block failures |
| Outdated expectations | Review and refresh expectations quarterly |
| Too many tests | Focus on business-critical quality dimensions |
| Ignoring false positives | Configure expectations to handle edge cases |
Dependencies
- data-pipeline-engineer - For pipeline orchestration and debugging
- dbt-transformation-patterns - For dbt project integration
- ops-manager - For monitoring dashboards and alerting (out of scope for this skill)
Source
git clone https://github.com/karim-bhalwani/agent-skills-collection/blob/main/skills/data-quality-frameworks/SKILL.mdView on GitHub Overview
Specialist in embedding quality validation into data pipelines as a dependency, not an afterthought. This skill covers building and enforcing data quality gates using Great Expectations, dbt tests, and data contracts to ensure reliable, trustworthy analytics.
How This Skill Works
You design expectation suites in Great Expectations, implement dbt schema, relationship, and custom tests, and codify producer and consumer contracts. Automated checks run at defined pipeline stages (Bronze, Silver, Gold) with metrics, alerts, and failure blockages to prevent bad data from advancing.
When to Use It
- Building comprehensive data quality validation into pipelines
- Setting up Great Expectations suites and automated checkpoints
- Creating dbt test suites (schema tests, relationship tests, custom tests)
- Establishing data contracts between producer and consumer teams
- Monitoring data quality metrics, SLAs, and anomalies
Quick Start
- Step 1: Define Great Expectations suites for critical pipelines
- Step 2: Add dbt schema/relationship tests to models
- Step 3: Establish data contracts and wire up automated validations
Best Practices
- Test early: validate source data before transformations
- Test incrementally: add tests as issues arise
- Document expectations: provide clear descriptions for each test
- Alert on failures: integrate with monitoring and alerts
- Apply layer-based validation: Bronze (schema), Silver (rules), Gold (aggregations)
Example Use Cases
- A retail data lake implementing GE suites to validate daily sales feeds
- dbt tests covering model schemas, relationships, and custom validations
- Data contracts between data producer teams and downstream analytics consumers
- An automated quality pipeline orchestrated by Airflow with multi-table checks
- Layer-based validation enforcing Bronze, Silver, and Gold checks before reporting