What is the purpose of layered validation?

To detect structural, schema, cross-entry, source, and content quality issues in large JSON data files generated by scripts.

How does collect-all-errors help?

It accumulates all issues and reports them in one run instead of stopping at the first failure, speeding diagnosis.

What if source data is missing?

Tests can gracefully skip using pytest.skip to avoid false negatives when source data is unavailable.

json-data-validation-test-design

Scanned

npx machina-cli add skill shimo4228/claude-code-learned-skills/json-data-validation-test-design --openclaw

Files (1)

SKILL.md

3.6 KB

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)
Schema violations that the app silently swallows
Cross-reference mismatches (source data vs generated output)
Missing or duplicate entries
Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure:
    def test_required_fields(self, data): ...
    def test_count_matches(self, data):
        assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields:
    def test_required_fields(self, entries):
        for e in entries:
            missing = REQUIRED - e.keys()
            assert not missing, f"Entry {e['id']}: missing {missing}"

    def test_enum_values(self, entries):
        for e in entries:
            assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency:
    def test_no_duplicates(self, entries):
        ids = [e["id"] for e in entries]
        assert len(ids) == len(set(ids))

    def test_references_resolve(self, entries, categories):
        # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference:
    @pytest.fixture
    def source_data(self):
        # Parse original source files
        ...

    def test_values_match_source(self, entries, source_data):
        mismatches = []
        for e in entries:
            if e["answer"] != source_data[e["id"]]:
                mismatches.append(...)
        assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality:
    def test_min_text_length(self, entries):
        for e in entries:
            assert len(e["text"]) >= THRESHOLD

    def test_no_stray_characters(self, entries):
        stray = {"ß", "€", "£"}  # Characters unlikely in this domain
        issues = []
        for e in entries:
            for ch in stray:
                if ch in e["text"]:
                    issues.append(f"{e['id']}: '{ch}'")
        assert not issues

Key Design Decisions

Module-scoped fixtures for the parsed JSON (scope="module") to avoid re-reading per test
Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like "過学習")

When to Use

After generating/rebuilding JSON data files from external sources
As a CI gate for data files that feed into apps
When a data file is too large for manual review
When data is parsed from inconsistent sources (OCR, PDF export, manual entry)

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/json-data-validation-test-design/SKILL.md

View on GitHub

Overview

This skill describes a layered pytest-based approach to validating large auto-generated JSON data files against their schema, source data, and business rules. It targets files too big for manual review by structuring tests from structural checks to semantic quality.

How This Skill Works

Tests are organized into five layers: top-level structure, per-entry schema validation, cross-entry consistency, source cross-reference, and content quality. It uses module-scoped fixtures to parse JSON once and a collect-all-errors pattern to accumulate issues before asserting. It also supports graceful skipping when source data is unavailable, and promotes domain-aware thresholds.

When to Use It

After generating/rebuilding JSON data files from external sources
As a CI gate for data files that feed into apps
When a data file is too large for manual review
When data is parsed from inconsistent sources such as OCR or PDF exports
To validate cross-reference with source data and business rules

Quick Start

Step 1: Structure tests into the five layered levels (top-level, per-entry, cross-entry, source cross-reference, content quality)
Step 2: Use a module-scoped fixture to load and share the parsed JSON, and implement collect-all-errors
Step 3: Run pytest in CI to gate changes to the data files

Best Practices

Use module-scoped fixtures to parse JSON once per test session
Adopt collect-all-errors to report all issues in a single run
Gracefully skip tests if source files are missing
Apply domain-aware thresholds, e.g., minimum text length
Follow the layered test design from structure to content quality

Example Use Cases

Validating a large exam questions JSON file generated by a build script
QA for a product catalog feed imported from CSV or API
Cross-checking source answers against generated question data
Spot stray characters detected after OCR in multilingual datasets
Enforcing that every entry's answer is present in its choices

Frequently Asked Questions

Add this skill to your agents