Get the FREE Ultimate OpenClaw Setup Guide →

json-data-validation-test-design

Scanned
npx machina-cli add skill shimo4228/claude-code-learned-skills/json-data-validation-test-design --openclaw
Files (1)
SKILL.md
3.6 KB

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

  • Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)
  • Schema violations that the app silently swallows
  • Cross-reference mismatches (source data vs generated output)
  • Missing or duplicate entries
  • Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure:
    def test_required_fields(self, data): ...
    def test_count_matches(self, data):
        assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields:
    def test_required_fields(self, entries):
        for e in entries:
            missing = REQUIRED - e.keys()
            assert not missing, f"Entry {e['id']}: missing {missing}"

    def test_enum_values(self, entries):
        for e in entries:
            assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency:
    def test_no_duplicates(self, entries):
        ids = [e["id"] for e in entries]
        assert len(ids) == len(set(ids))

    def test_references_resolve(self, entries, categories):
        # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference:
    @pytest.fixture
    def source_data(self):
        # Parse original source files
        ...

    def test_values_match_source(self, entries, source_data):
        mismatches = []
        for e in entries:
            if e["answer"] != source_data[e["id"]]:
                mismatches.append(...)
        assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality:
    def test_min_text_length(self, entries):
        for e in entries:
            assert len(e["text"]) >= THRESHOLD

    def test_no_stray_characters(self, entries):
        stray = {"ß", "€", "£"}  # Characters unlikely in this domain
        issues = []
        for e in entries:
            for ch in stray:
                if ch in e["text"]:
                    issues.append(f"{e['id']}: '{ch}'")
        assert not issues

Key Design Decisions

  • Module-scoped fixtures for the parsed JSON (scope="module") to avoid re-reading per test
  • Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
  • Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
  • Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like "過学習")

When to Use

  • After generating/rebuilding JSON data files from external sources
  • As a CI gate for data files that feed into apps
  • When a data file is too large for manual review
  • When data is parsed from inconsistent sources (OCR, PDF export, manual entry)

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/json-data-validation-test-design/SKILL.mdView on GitHub

Overview

This skill describes a layered pytest-based approach to validating large auto-generated JSON data files against their schema, source data, and business rules. It targets files too big for manual review by structuring tests from structural checks to semantic quality.

How This Skill Works

Tests are organized into five layers: top-level structure, per-entry schema validation, cross-entry consistency, source cross-reference, and content quality. It uses module-scoped fixtures to parse JSON once and a collect-all-errors pattern to accumulate issues before asserting. It also supports graceful skipping when source data is unavailable, and promotes domain-aware thresholds.

When to Use It

  • After generating/rebuilding JSON data files from external sources
  • As a CI gate for data files that feed into apps
  • When a data file is too large for manual review
  • When data is parsed from inconsistent sources such as OCR or PDF exports
  • To validate cross-reference with source data and business rules

Quick Start

  1. Step 1: Structure tests into the five layered levels (top-level, per-entry, cross-entry, source cross-reference, content quality)
  2. Step 2: Use a module-scoped fixture to load and share the parsed JSON, and implement collect-all-errors
  3. Step 3: Run pytest in CI to gate changes to the data files

Best Practices

  • Use module-scoped fixtures to parse JSON once per test session
  • Adopt collect-all-errors to report all issues in a single run
  • Gracefully skip tests if source files are missing
  • Apply domain-aware thresholds, e.g., minimum text length
  • Follow the layered test design from structure to content quality

Example Use Cases

  • Validating a large exam questions JSON file generated by a build script
  • QA for a product catalog feed imported from CSV or API
  • Cross-checking source answers against generated question data
  • Spot stray characters detected after OCR in multilingual datasets
  • Enforcing that every entry's answer is present in its choices

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers