What does the behavior-preservation-checker compare?

Runtime behavior differences across two repository versions, including test results, execution traces, API responses, and observable outputs.

How should I respond to differences?

Use the generated guidance to fix deviations and re-run until behavioral equivalence is achieved.

Can it handle non-deterministic tests?

Prefer deterministic inputs and fixtures; where nondeterminism exists, record traces or seed randomness, and compare stable observables.

behavior-preservation-checker

Scanned

npx machina-cli add skill ArabelaTso/Skills-4-SE/behavior-preservation-checker --openclaw

Files (1)

SKILL.md

9.0 KB

Behavior Preservation Checker

Overview

Validate that a migrated or refactored codebase preserves the original behavior by automatically comparing runtime behavior, test results, execution traces, and observable outputs between two repository versions.

Core Workflow

1. Setup Repositories

Prepare both repositories for comparison:

# Clone or locate repositories
ORIGINAL_REPO=/path/to/original
MIGRATED_REPO=/path/to/migrated

# Ensure both are at comparable states
cd $ORIGINAL_REPO && git checkout main
cd $MIGRATED_REPO && git checkout main

2. Run Behavior Comparison

Use the comparison script to analyze behavioral differences:

python scripts/behavior_checker.py \
    --original $ORIGINAL_REPO \
    --migrated $MIGRATED_REPO \
    --output behavior_report.json

3. Review Results

Examine the generated report for:

Test result differences
Execution trace divergences
Output mismatches
Performance regressions
API contract violations

4. Fix Deviations

Follow actionable guidance to resolve behavioral differences.

Comparison Methods

Method 1: Test-Based Comparison

Run the same test suite on both repositories and compare results:

Workflow:

Identify common test suite (or create equivalent tests)
Run tests on original repository
Run tests on migrated repository
Compare pass/fail status, assertions, and outputs

Example:

# Run on original
cd $ORIGINAL_REPO
pytest tests/ --json-report --json-report-file=original_results.json

# Run on migrated
cd $MIGRATED_REPO
pytest tests/ --json-report --json-report-file=migrated_results.json

# Compare
python scripts/compare_test_results.py \
    original_results.json \
    migrated_results.json

Method 2: Execution Trace Comparison

Capture and compare execution traces:

Workflow:

Instrument code to capture function calls, arguments, and return values
Run identical inputs through both versions
Compare execution traces for divergences

Example:

# Trace original
python scripts/trace_execution.py \
    --repo $ORIGINAL_REPO \
    --input test_inputs.json \
    --output original_trace.json

# Trace migrated
python scripts/trace_execution.py \
    --repo $MIGRATED_REPO \
    --input test_inputs.json \
    --output migrated_trace.json

# Compare traces
python scripts/compare_traces.py \
    original_trace.json \
    migrated_trace.json

Method 3: Observable Output Comparison

Compare program outputs for identical inputs:

Workflow:

Define test inputs (API requests, CLI commands, function calls)
Capture outputs from both versions (stdout, files, API responses)
Compare outputs for differences

Example:

# Test API endpoints
python scripts/compare_api_outputs.py \
    --original-url http://localhost:8000 \
    --migrated-url http://localhost:8001 \
    --test-cases api_test_cases.json

Method 4: Property-Based Testing

Use property-based testing to find behavioral differences:

Workflow:

Define behavioral properties (invariants, contracts)
Generate random inputs
Verify properties hold for both versions
Report any property violations

Example:

# Property: sorting should produce same result
from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort_equivalence(input_list):
    original_result = original_sort(input_list)
    migrated_result = migrated_sort(input_list)
    assert original_result == migrated_result

Difference Detection

Test Result Differences

What to check:

Tests that pass in original but fail in migrated
Tests that fail in original but pass in migrated
New test failures
Changed assertion messages

Severity levels:

Critical: Core functionality tests fail
High: Integration tests fail
Medium: Edge case tests fail
Low: Flaky tests or timing-dependent failures

Execution Trace Differences

What to check:

Different function call sequences
Different argument values
Different return values
Missing or extra function calls

Example divergence:

Original trace:
  calculate(x=10) -> 20
  validate(20) -> True
  save(20) -> Success

Migrated trace:
  calculate(x=10) -> 21  # ← Difference!
  validate(21) -> True
  save(21) -> Success

Output Differences

What to check:

Different stdout/stderr
Different file contents
Different API response bodies
Different status codes
Different error messages

Tolerance levels:

# Exact match required
assert original_output == migrated_output

# Numerical tolerance
assert abs(original_value - migrated_value) < 0.001

# Structural equivalence (ignore formatting)
assert json.loads(original) == json.loads(migrated)

Actionable Guidance

Pattern 1: Logic Error

Symptom: Different outputs for same inputs

Diagnosis:

python scripts/isolate_difference.py \
    --original $ORIGINAL_REPO \
    --migrated $MIGRATED_REPO \
    --failing-test test_calculation

Guidance:

Identify the diverging function
Compare implementations side-by-side
Check for off-by-one errors, operator changes, or logic inversions
Add unit test for the specific case

Pattern 2: Missing Functionality

Symptom: Tests pass in original but fail in migrated with "not implemented" or "attribute error"

Diagnosis:

python scripts/find_missing_functions.py \
    --original $ORIGINAL_REPO \
    --migrated $MIGRATED_REPO

Guidance:

List all missing functions/methods
Implement missing functionality
Verify with targeted tests

Pattern 3: API Contract Violation

Symptom: Different response structure or status codes

Diagnosis:

python scripts/compare_api_contracts.py \
    --original-spec openapi_original.yaml \
    --migrated-spec openapi_migrated.yaml

Guidance:

Document API contract differences
Update migrated API to match original contract
Add contract tests to prevent future violations

Pattern 4: Performance Regression

Symptom: Migrated version is significantly slower

Diagnosis:

python scripts/benchmark_comparison.py \
    --original $ORIGINAL_REPO \
    --migrated $MIGRATED_REPO \
    --iterations 100

Guidance:

Profile both versions to identify bottlenecks
Check for algorithmic changes (O(n) → O(n²))
Look for missing optimizations or caching
Verify database query efficiency

Pattern 5: State Management Issues

Symptom: Tests fail intermittently or depend on execution order

Diagnosis:

python scripts/detect_state_issues.py \
    --repo $MIGRATED_REPO \
    --test-suite tests/

Guidance:

Identify shared state between tests
Add proper setup/teardown
Ensure test isolation
Check for global variable usage

Report Format

The behavior checker generates a comprehensive JSON report:

{
  "summary": {
    "total_tests": 150,
    "passed_both": 140,
    "failed_both": 2,
    "passed_original_failed_migrated": 5,
    "failed_original_passed_migrated": 3,
    "behavioral_equivalence": "92.7%"
  },
  "differences": [
    {
      "type": "test_failure",
      "test_name": "test_user_authentication",
      "severity": "critical",
      "original_result": "passed",
      "migrated_result": "failed",
      "error_message": "AssertionError: Expected 200, got 401",
      "guidance": "Check authentication logic in migrated version",
      "affected_files": ["auth/login.py"]
    }
  ],
  "recommendations": [
    "Fix 5 critical test failures before deployment",
    "Review 3 output differences for correctness"
  ]
}

Best Practices

Start with tests: Ensure comprehensive test coverage before migration
Incremental validation: Check behavior after each migration step
Document intentional changes: Mark expected behavioral differences
Use multiple comparison methods: Combine tests, traces, and outputs
Automate the process: Integrate into CI/CD pipeline
Set tolerance thresholds: Define acceptable differences (e.g., timing, formatting)

Resources

references/comparison_techniques.md: Detailed comparison methodologies
references/difference_patterns.md: Common behavioral difference patterns
scripts/behavior_checker.py: Main comparison orchestrator
scripts/compare_test_results.py: Test result comparison
scripts/trace_execution.py: Execution trace capture
scripts/compare_traces.py: Trace comparison and analysis

Source

git clone https://github.com/ArabelaTso/Skills-4-SE/blob/main/skills/behavior-preservation-checker/SKILL.mdView on GitHub

Overview

Validate that a migrated or refactored codebase preserves original behavior by automatically comparing runtime results between two repository versions. It looks at tests, traces, and observable outputs to surface regressions and semantic changes, providing actionable guidance to restore equivalence.

How This Skill Works

Set up both repositories (original and migrated), run the analysis script to generate a behavior_report.json, and the tool will compare test results, execution traces, and observable outputs. It highlights differences and surfaces concrete guidance to fix deviations and restore behavioral equivalence.

When to Use It

Validating code migrations to ensure behavior is preserved.
Assessing refactorings that should be behaviorally equivalent.
Porting code to a new language version or runtime.
Upgrading frameworks or dependencies with backward-compat changes.
Transformations where observable behavior must remain the same.

Quick Start

Step 1: Define ORIGINAL_REPO and MIGRATED_REPO paths and ensure both are on comparable states.
Step 2: Run the checker: python scripts/behavior_checker.py --original $ORIGINAL_REPO --migrated $MIGRATED_REPO --output behavior_report.json
Step 3: Open behavior_report.json, review differences, and follow guidance to fix deviations

Best Practices

Define deterministic inputs and fixtures to enable reproducible comparisons.
Run comparisons in isolated environments with locked dependencies.
Ensure test suites in both versions cover equivalent functionality.
Compare not only test results but also execution traces and API responses.
Automate reporting and integrate the checker into CI with deviation thresholds.

Example Use Cases

Porting Python 2 code to Python 3 while preserving outputs.
Migrating a microservice from Flask to FastAPI without changing behavior.
Upgrading a data processing pipeline with API changes but identical results.
Refactoring a core module to improve structure while keeping semantics.
Replacing a REST API client with a vendored version without breaking contracts.

Frequently Asked Questions

Add this skill to your agents