Get the FREE Ultimate OpenClaw Setup Guide →

forge

npx machina-cli add skill quantsquirrel/claude-forge-smith/forge --openclaw
Files (1)
SKILL.md
12.1 KB

Forge Skill

Systematically improve a skill through test-driven evaluation and statistical validation.

REQUIRED BACKGROUND: First invoke forge:smelt skill for TDD methodology applied to skills. See testing-skills-with-subagents.md in that skill's directory for pressure scenario templates.

Quick Reference

StepAction
1Trial Branch 생성
2스킬 찾기 및 읽기
3Pressure Scenario 생성
4기준선 평가 (3회) - evaluator 사용
5Discoverability 평가 (CSO 체크)
6개선 사항 식별
7개선 적용 (GREEN Phase)
8개선 후 평가 (3회)
9신뢰구간 분리 확인
10Trial Branch 결과 처리 (병합/폐기)
11Stats 업데이트 (upgraded: true)

Upgrade Mode Selection

스킬 업그레이드 시작 전, 적합한 모드를 자동 선택합니다.

Mode Decision Flow

스킬 분석 (get_upgrade_mode 호출)
    │
    ├─ "TDD_FIT" ──→ TDD Mode (기존 워크플로우)
    │                 - Trial Branch
    │                 - 3x 평가 + 95% CI (또는 n=5 고정밀)
    │                 - 통계적 검증
    │
    └─ "HEURISTIC" ──→ Heuristic Mode (신규)
                       - Usage 데이터 분석
                       - 구조 품질 평가
                       - 자동 개선 제안

Mode Detection

업그레이드 시작 시 다음 bash 함수를 호출하여 모드 결정:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/storage-local.sh"
MODE=$(get_upgrade_mode "$skill_name")

TDD Mode Options

OptionSample SizeCI WidthWhen to Use
Standardn=3Wider빠른 피드백, 대부분의 경우
High Precisionn=5Narrower미묘한 개선 검증, 중요한 스킬

사용자가 /forge --precision=high 또는 /forge -n5로 n=5 모드 선택 가능

TDD Mode (기존)

  • 조건: 테스트 파일 또는 pressure-scenarios.md 존재
  • 검증: check_skill_has_test() → true
  • 워크플로우: Step 1-11 (기존 그대로)

Heuristic Mode (신규)

  • 조건: 테스트 파일 없음
  • 검증: get_upgrade_mode() → "HEURISTIC"
  • 워크플로우:
    1. Usage 데이터 로드 (get_all_skills_summary())
    2. 서브에이전트 호출: Task(subagent_type="forge:heuristic-evaluator", prompt="Evaluate skill: <skill-name>")
    3. 점수 60 미만 → 자동 개선 제안 적용
    4. Trial Branch에서 개선 적용
    5. 1주일 후 사용량 변화로 검증 (get_usage_trend())

Hybrid Fallback

TDD Mode에서 신뢰구간 분리 실패 시 → Heuristic Mode로 전환 옵션 제공

When to Use

Use this skill when:

  • A skill has low discoverability or unclear instructions
  • You want to improve skill quality with measurable metrics
  • You need to upgrade a skill with verified improvements (confidence interval separation)

Do NOT use when:

  • The skill doesn't exist yet (create it first)
  • Quick one-off fixes are needed (just edit directly)

Arguments

ArgumentRequiredDescription
skill-nameYesThe skill to upgrade (e.g., superpowers:tdd, forge:monitor)
--iterations NNoMaximum upgrade iterations (default: 6)
--dry-runNoEvaluate only, don't apply changes

Example: /forge:forge superpowers:tdd --iterations 3

Prerequisites

Before using this skill, ensure:

  1. Required files exist:

    • ~/.claude/plugins/local/forge/hooks/lib/statistics.sh
    • ~/.claude/plugins/local/forge/hooks/lib/trial-branch.sh
    • forge:evaluator subskill
  2. Git initialized in target directory (for Trial Branch)

  3. First-time setup: Invoke forge:smelt once to understand TDD methodology

Workflow

1. Trial Branch 생성

스킬 변경을 격리된 브랜치에서 진행합니다.

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

# 현재 브랜치 저장 후 Trial Branch 생성
ORIGINAL=$(git branch --show-current)
TRIAL=$(create_trial_branch "skill-name")

Trial Branch 목적:

  • 스킬 변경을 실험적으로 격리
  • 향상 실패 시 깔끔하게 폐기
  • 향상 성공 시 원본 브랜치에 병합

2. Locate Skill

Skill locations (check in order):
~/.claude/skills/{skill-name}/SKILL.md
~/.claude/plugins/**/skills/{skill-name}/SKILL.md
.claude/skills/{skill-name}/SKILL.md

If skill not found, report and exit.

3. Create Pressure Scenario Test

Generate 2-3 realistic scenarios that would trigger this skill:

Good scenario template:

SCENARIO: [Realistic situation where skill should apply]

Context:
- [Specific constraint 1]
- [Specific constraint 2]
- [Time/resource pressure]

Task: [What agent must do]

Success criteria:
- [ ] Agent recognizes skill applies
- [ ] Agent follows skill correctly
- [ ] Agent produces expected outcome

Pressure types to combine (pick 3+):

  • Time pressure
  • Sunk cost
  • Authority override
  • Exhaustion
  • Pragmatic shortcuts

4. 기준선 평가 (3회)

evaluator 서브스킬 사용 - critic 에이전트로 평가 실행:

Task(subagent_type="oh-my-claudecode:critic",
     model="opus",
     prompt="forge:evaluator 스킬을 사용하여 {skill-name} 평가.
             시나리오: {scenario}
             rubric.md의 기준 참조.")

3회 반복하여 scores 배열 수집:

# scores.txt에 한 줄씩 저장
echo "85" >> baseline-scores.txt
echo "88" >> baseline-scores.txt
echo "82" >> baseline-scores.txt

신뢰구간 계산:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/statistics.sh"

# 기준선 평균 및 95% CI
BASELINE_MEAN=$(calc_mean baseline-scores.txt)
read BASELINE_LOWER BASELINE_UPPER < <(calc_ci baseline-scores.txt)

echo "기준선: 평균=$BASELINE_MEAN, CI=[$BASELINE_LOWER, $BASELINE_UPPER]"

5. Evaluate Discoverability (CSO Check)

Check skill's search optimization:

CheckPass/Fail
Description starts with "Use when..."
Description has specific triggers
Keywords match search terms
Name is verb-first, descriptive
No workflow summary in description

6. Identify Improvements

Based on test results:

If agent didn't find skill:

  • Improve description triggers
  • Add error message keywords
  • Add symptom keywords

If agent found but didn't follow:

  • Add explicit counters for rationalizations
  • Create red flags section
  • Add foundational principle

If skill is unclear:

  • Simplify structure
  • Add concrete examples
  • Remove ambiguity

7. Apply Improvements (GREEN Phase)

Edit the skill with identified improvements:

  • Keep changes minimal and targeted
  • Address specific test failures
  • Don't add hypothetical content

Trial Branch에서 커밋:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

commit_trial "Improve discoverability with CSO keywords"

8. 개선 후 평가 (3회)

동일한 시나리오로 개선된 스킬을 3회 재평가:

Task(subagent_type="oh-my-claudecode:critic",
     model="opus",
     prompt="forge:evaluator 스킬을 사용하여 개선된 {skill-name} 평가.
             시나리오: {scenario}
             rubric.md의 기준 참조.")
# improved-scores.txt에 수집
echo "90" >> improved-scores.txt
echo "92" >> improved-scores.txt
echo "89" >> improved-scores.txt

# 신뢰구간 계산
IMPROVED_MEAN=$(calc_mean improved-scores.txt)
read IMPROVED_LOWER IMPROVED_UPPER < <(calc_ci improved-scores.txt)

echo "개선 후: 평균=$IMPROVED_MEAN, CI=[$IMPROVED_LOWER, $IMPROVED_UPPER]"

9. 향상 판단 (신뢰구간 분리)

신뢰구간이 분리되면 유의미한 향상으로 판단:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/statistics.sh"

# 기준선 CI 상한 < 개선 CI 하한 확인
if ci_separated "$BASELINE_UPPER" "$IMPROVED_LOWER"; then
  echo "✓ 유의미한 향상 인정 (신뢰구간 분리됨)"
  echo "  기준선 CI 상한: $BASELINE_UPPER"
  echo "  개선 CI 하한: $IMPROVED_LOWER"
  APPROVED=true
else
  echo "✗ 향상 미달 (신뢰구간 중복)"
  echo "  기준선 CI 상한: $BASELINE_UPPER"
  echo "  개선 CI 하한: $IMPROVED_LOWER"
  APPROVED=false
fi

10. Trial Branch 결과 처리

성공 (신뢰구간 분리됨):

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

merge_trial_success "$ORIGINAL" "$TRIAL"
# → Trial Branch를 원본에 병합 후 삭제

실패 (신뢰구간 중복):

discard_trial "$ORIGINAL" "$TRIAL"
# → Trial Branch 폐기, 원본 브랜치로 복귀

11. Mark as Upgraded

Update the skill stats file to set upgraded: true:

MONTH=$(date +%Y-%m)
SKILL_NAME="forge:forge"  # Replace with actual skill name
STATS_FILE="$HOME/.claude/.skill-evaluator/skills/${MONTH}.json"

python3 -c "
import json
with open('$STATS_FILE', 'r') as f:
    data = json.load(f)
if '$SKILL_NAME' in data.get('skills', {}):
    data['skills']['$SKILL_NAME']['upgraded'] = True
    with open('$STATS_FILE', 'w') as f:
        json.dump(data, f, indent=2)
    print('Marked $SKILL_NAME as upgraded')
else:
    print('Skill not found in stats')
"

This enables:

  • "강화 완료" badge in upgrade history
  • SSS/SS grade bonuses tracked in evaluation stats

Output Format

After upgrade, report:

## Upgrade Complete: {skill-name}

### 기준선 평가 (3회)
- 점수: [78, 82, 80]
- 평균: 80.0, CI: [73.5, 86.5]

### 개선 후 평가 (3회)
- 점수: [90, 92, 89]
- 평균: 90.3, CI: [85.8, 94.8]

### 판단
- 기준선 CI 상한: 86.5
- 개선 CI 하한: 85.8
- 신뢰구간 분리: NO (중복 있음)
- 결과: Trial Branch 폐기

---OR---

### 판단
- 기준선 CI 상한: 86.5
- 개선 CI 하한: 88.2
- 신뢰구간 분리: YES
- 결과: Trial Branch 병합 성공

### CSO Improvements
- [What was improved for discoverability]

### Verification
- [x] Baseline test showed failure
- [x] Improved skill passes tests
- [x] Confidence intervals separated
- [x] Stats updated with upgraded: true

Example

User: /forge superpowers:tdd

  1. Trial Branch 생성: forge/superpowers-tdd/20260128-143022
  2. Invoke forge:smelt to load TDD methodology
  3. Read the target skill: ~/.claude/skills/superpowers/tdd/SKILL.md
  4. Create scenario: "You wrote 200 lines, forgot TDD, dinner at 6:30pm..."
  5. 기준선 평가 (3회): [65, 70, 68] → 평균 67.7, CI: [61.2, 74.2]
  6. CSO: Description missing "Use when..." (FAIL)
  7. Fix: Add explicit "Delete code, start over" section
  8. 개선 후 평가 (3회): [85, 88, 86] → 평균 86.3, CI: [81.2, 91.4]
  9. 신뢰구간 분리: 기준선 상한 74.2 < 개선 하한 81.2 → YES
  10. 병합 성공: Trial Branch를 main에 병합
  11. Update stats: upgraded: true

Common Issues

IssueSolution
Skill not foundCheck all locations, suggest creation
No clear test scenarioUse skill's "When to Use" section
Agent always passesAdd more pressure (3+ combined)
Improvements don't helpMeta-test: ask agent what would make it clearer
CI 분리 실패 (중복)더 큰 개선 필요 또는 샘플 수 증가 (n=5)
Trial Branch 충돌수동으로 해결 후 merge_trial_success 재실행

Statistical Notes

신뢰구간 분리의 의미:

  • 기준선 점수의 최대 예상 범위 상한보다
  • 개선 후 점수의 최소 예상 범위 하한이 더 크면
  • 통계적으로 유의미한 향상

95% 신뢰구간:

  • 표본의 평균이 진짜 평균의 95% 확률로 포함되는 범위
  • 작은 n(=3)에서는 t-분포 사용 (정규분포보다 넓음)

Trial Branch의 장점:

  • 실패한 실험이 메인 브랜치를 오염시키지 않음
  • 여러 개선 방향을 독립적으로 실험 가능
  • 성공한 개선만 선택적으로 병합

Source

git clone https://github.com/quantsquirrel/claude-forge-smith/blob/main/skills/forge/SKILL.mdView on GitHub

Overview

Forge systematically improves a skill's quality using test-driven evaluation and statistical validation. It guides you from creating a Trial Branch to measured improvements, validating changes with confidence intervals. A mandatory TDD background via forge:smelt underpins the workflow, with an optional heuristic mode if tests are missing.

How This Skill Works

Forge automates a TDD-driven upgrade workflow: start a Trial Branch, locate and read the target skill, generate pressure scenarios, and perform baseline evaluation (3 runs) before identifying improvements and applying them in a GREEN phase. After implementing changes, re-run evaluations to confirm confidence-interval separation, then decide to merge or discard the Trial Branch and update the upgraded flag.

When to Use It

  • When a skill has low discoverability or unclear instructions
  • When you want to improve skill quality with measurable metrics
  • When you need upgrades with verified improvements (confidence interval separation)
  • When the existing skill has TDD coverage and a clear upgrade path
  • When you want changes isolated in a Trial Branch for safe merging or discard

Quick Start

  1. Step 1: Create and switch to a Trial Branch for the skill upgrade
  2. Step 2: Locate the target skill, read SKILL.md, and generate pressure scenarios
  3. Step 3: Run baseline evaluations, apply improvements, re-evaluate, verify CI separation, then merge or discard

Best Practices

  • Ensure all prerequisite files exist and git is initialized before starting
  • Always start changes in a Trial Branch to isolate experiments
  • Prefer TDD mode when tests/pressure scenarios exist; switch to HEURISTIC if not
  • Aim for statistical validation with 3x or 5x evaluations and CI checks
  • Update skill metadata (upgraded: true) after a successful merge

Example Use Cases

  • Upgrade a skill like superpowers:tdd using Standard TDD (n=3, 95% CI) and merge after CI separation
  • Apply High Precision mode (n=5) to validate subtle improvements in a critical skill
  • When tests are missing, switch to HEURISTIC mode, evaluate usage data, and propose automatic improvements
  • Create a Trial Branch, implement improvements, and re-run evaluations before deciding to merge
  • Verify CI separation post-improvement to ensure statistically significant gains

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers