What is forge mode and when should I use each?

TDD_FIT uses the traditional TDD workflow with trial branches and statistical validation; HEURISTIC analyzes usage data and suggests automated improvements when tests are absent.

How do I control precision of the upgrade?

Use --precision=high|medium|low (default: high) or -n5 to select high-precision mode for tighter CI, with -n5 enabling 5-run evaluations.

Do I need to run forge:smelt first?

Yes. forge:smelt provides the TDD methodology foundation needed to apply the upgrade workflow properly.

forge

npx machina-cli add skill quantsquirrel/claude-forge-smith/forge --openclaw

Files (1)

SKILL.md

12.1 KB

Forge Skill

Systematically improve a skill through test-driven evaluation and statistical validation.

REQUIRED BACKGROUND: First invoke forge:smelt skill for TDD methodology applied to skills. See testing-skills-with-subagents.md in that skill's directory for pressure scenario templates.

Quick Reference

Step	Action
1	Trial Branch 생성
2	스킬 찾기 및 읽기
3	Pressure Scenario 생성
4	기준선 평가 (3회) - evaluator 사용
5	Discoverability 평가 (CSO 체크)
6	개선 사항 식별
7	개선 적용 (GREEN Phase)
8	개선 후 평가 (3회)
9	신뢰구간 분리 확인
10	Trial Branch 결과 처리 (병합/폐기)
11	Stats 업데이트 (upgraded: true)

Upgrade Mode Selection

스킬 업그레이드 시작 전, 적합한 모드를 자동 선택합니다.

Mode Decision Flow

스킬 분석 (get_upgrade_mode 호출)
    │
    ├─ "TDD_FIT" ──→ TDD Mode (기존 워크플로우)
    │                 - Trial Branch
    │                 - 3x 평가 + 95% CI (또는 n=5 고정밀)
    │                 - 통계적 검증
    │
    └─ "HEURISTIC" ──→ Heuristic Mode (신규)
                       - Usage 데이터 분석
                       - 구조 품질 평가
                       - 자동 개선 제안

Mode Detection

업그레이드 시작 시 다음 bash 함수를 호출하여 모드 결정:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/storage-local.sh"
MODE=$(get_upgrade_mode "$skill_name")

TDD Mode Options

Option	Sample Size	CI Width	When to Use
Standard	n=3	Wider	빠른 피드백, 대부분의 경우
High Precision	n=5	Narrower	미묘한 개선 검증, 중요한 스킬

사용자가 /forge --precision=high 또는 /forge -n5로 n=5 모드 선택 가능

TDD Mode (기존)

조건: 테스트 파일 또는 pressure-scenarios.md 존재
검증: check_skill_has_test() → true
워크플로우: Step 1-11 (기존 그대로)

Heuristic Mode (신규)

조건: 테스트 파일 없음
검증: get_upgrade_mode() → "HEURISTIC"
워크플로우:
1. Usage 데이터 로드 (get_all_skills_summary())
2. 서브에이전트 호출: Task(subagent_type="forge:heuristic-evaluator", prompt="Evaluate skill: <skill-name>")
3. 점수 60 미만 → 자동 개선 제안 적용
4. Trial Branch에서 개선 적용
5. 1주일 후 사용량 변화로 검증 (get_usage_trend())

Hybrid Fallback

TDD Mode에서 신뢰구간 분리 실패 시 → Heuristic Mode로 전환 옵션 제공

When to Use

Use this skill when:

A skill has low discoverability or unclear instructions
You want to improve skill quality with measurable metrics
You need to upgrade a skill with verified improvements (confidence interval separation)

Do NOT use when:

The skill doesn't exist yet (create it first)
Quick one-off fixes are needed (just edit directly)

Arguments

Argument	Required	Description
`skill-name`	Yes	The skill to upgrade (e.g., `superpowers:tdd`, `forge:monitor`)
`--iterations N`	No	Maximum upgrade iterations (default: 6)
`--dry-run`	No	Evaluate only, don't apply changes

Example: /forge:forge superpowers:tdd --iterations 3

Prerequisites

Before using this skill, ensure:

Required files exist:
- ~/.claude/plugins/local/forge/hooks/lib/statistics.sh
- ~/.claude/plugins/local/forge/hooks/lib/trial-branch.sh
- forge:evaluator subskill
Git initialized in target directory (for Trial Branch)
First-time setup: Invoke forge:smelt once to understand TDD methodology

Workflow

1. Trial Branch 생성

스킬 변경을 격리된 브랜치에서 진행합니다.

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

# 현재 브랜치 저장 후 Trial Branch 생성
ORIGINAL=$(git branch --show-current)
TRIAL=$(create_trial_branch "skill-name")

Trial Branch 목적:

스킬 변경을 실험적으로 격리
향상 실패 시 깔끔하게 폐기
향상 성공 시 원본 브랜치에 병합

2. Locate Skill

Skill locations (check in order):
~/.claude/skills/{skill-name}/SKILL.md
~/.claude/plugins/**/skills/{skill-name}/SKILL.md
.claude/skills/{skill-name}/SKILL.md

If skill not found, report and exit.

3. Create Pressure Scenario Test

Generate 2-3 realistic scenarios that would trigger this skill:

Good scenario template:

SCENARIO: [Realistic situation where skill should apply]

Context:
- [Specific constraint 1]
- [Specific constraint 2]
- [Time/resource pressure]

Task: [What agent must do]

Success criteria:
- [ ] Agent recognizes skill applies
- [ ] Agent follows skill correctly
- [ ] Agent produces expected outcome

Pressure types to combine (pick 3+):

Time pressure
Sunk cost
Authority override
Exhaustion
Pragmatic shortcuts

4. 기준선 평가 (3회)

evaluator 서브스킬 사용 - critic 에이전트로 평가 실행:

Task(subagent_type="oh-my-claudecode:critic",
     model="opus",
     prompt="forge:evaluator 스킬을 사용하여 {skill-name} 평가.
             시나리오: {scenario}
             rubric.md의 기준 참조.")

3회 반복하여 scores 배열 수집:

# scores.txt에 한 줄씩 저장
echo "85" >> baseline-scores.txt
echo "88" >> baseline-scores.txt
echo "82" >> baseline-scores.txt

신뢰구간 계산:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/statistics.sh"

# 기준선 평균 및 95% CI
BASELINE_MEAN=$(calc_mean baseline-scores.txt)
read BASELINE_LOWER BASELINE_UPPER < <(calc_ci baseline-scores.txt)

echo "기준선: 평균=$BASELINE_MEAN, CI=[$BASELINE_LOWER, $BASELINE_UPPER]"

5. Evaluate Discoverability (CSO Check)

Check skill's search optimization:

Check	Pass/Fail
Description starts with "Use when..."
Description has specific triggers
Keywords match search terms
Name is verb-first, descriptive
No workflow summary in description

6. Identify Improvements

Based on test results:

If agent didn't find skill:

Improve description triggers
Add error message keywords
Add symptom keywords

If agent found but didn't follow:

Add explicit counters for rationalizations
Create red flags section
Add foundational principle

If skill is unclear:

Simplify structure
Add concrete examples
Remove ambiguity

7. Apply Improvements (GREEN Phase)

Edit the skill with identified improvements:

Keep changes minimal and targeted
Address specific test failures
Don't add hypothetical content

Trial Branch에서 커밋:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

commit_trial "Improve discoverability with CSO keywords"

8. 개선 후 평가 (3회)

동일한 시나리오로 개선된 스킬을 3회 재평가:

Task(subagent_type="oh-my-claudecode:critic",
     model="opus",
     prompt="forge:evaluator 스킬을 사용하여 개선된 {skill-name} 평가.
             시나리오: {scenario}
             rubric.md의 기준 참조.")

# improved-scores.txt에 수집
echo "90" >> improved-scores.txt
echo "92" >> improved-scores.txt
echo "89" >> improved-scores.txt

# 신뢰구간 계산
IMPROVED_MEAN=$(calc_mean improved-scores.txt)
read IMPROVED_LOWER IMPROVED_UPPER < <(calc_ci improved-scores.txt)

echo "개선 후: 평균=$IMPROVED_MEAN, CI=[$IMPROVED_LOWER, $IMPROVED_UPPER]"

9. 향상 판단 (신뢰구간 분리)

신뢰구간이 분리되면 유의미한 향상으로 판단:

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/statistics.sh"

# 기준선 CI 상한 < 개선 CI 하한 확인
if ci_separated "$BASELINE_UPPER" "$IMPROVED_LOWER"; then
  echo "✓ 유의미한 향상 인정 (신뢰구간 분리됨)"
  echo "  기준선 CI 상한: $BASELINE_UPPER"
  echo "  개선 CI 하한: $IMPROVED_LOWER"
  APPROVED=true
else
  echo "✗ 향상 미달 (신뢰구간 중복)"
  echo "  기준선 CI 상한: $BASELINE_UPPER"
  echo "  개선 CI 하한: $IMPROVED_LOWER"
  APPROVED=false
fi

10. Trial Branch 결과 처리

성공 (신뢰구간 분리됨):

source "${CLAUDE_PLUGIN_ROOT}/hooks/lib/trial-branch.sh"

merge_trial_success "$ORIGINAL" "$TRIAL"
# → Trial Branch를 원본에 병합 후 삭제

실패 (신뢰구간 중복):

discard_trial "$ORIGINAL" "$TRIAL"
# → Trial Branch 폐기, 원본 브랜치로 복귀

11. Mark as Upgraded

Update the skill stats file to set upgraded: true:

MONTH=$(date +%Y-%m)
SKILL_NAME="forge:forge"  # Replace with actual skill name
STATS_FILE="$HOME/.claude/.skill-evaluator/skills/${MONTH}.json"

python3 -c "
import json
with open('$STATS_FILE', 'r') as f:
    data = json.load(f)
if '$SKILL_NAME' in data.get('skills', {}):
    data['skills']['$SKILL_NAME']['upgraded'] = True
    with open('$STATS_FILE', 'w') as f:
        json.dump(data, f, indent=2)
    print('Marked $SKILL_NAME as upgraded')
else:
    print('Skill not found in stats')
"

This enables:

"강화 완료" badge in upgrade history
SSS/SS grade bonuses tracked in evaluation stats

Output Format

After upgrade, report:

## Upgrade Complete: {skill-name}

### 기준선 평가 (3회)
- 점수: [78, 82, 80]
- 평균: 80.0, CI: [73.5, 86.5]

### 개선 후 평가 (3회)
- 점수: [90, 92, 89]
- 평균: 90.3, CI: [85.8, 94.8]

### 판단
- 기준선 CI 상한: 86.5
- 개선 CI 하한: 85.8
- 신뢰구간 분리: NO (중복 있음)
- 결과: Trial Branch 폐기

---OR---

### 판단
- 기준선 CI 상한: 86.5
- 개선 CI 하한: 88.2
- 신뢰구간 분리: YES
- 결과: Trial Branch 병합 성공

### CSO Improvements
- [What was improved for discoverability]

### Verification
- [x] Baseline test showed failure
- [x] Improved skill passes tests
- [x] Confidence intervals separated
- [x] Stats updated with upgraded: true

Example

User: /forge superpowers:tdd

Trial Branch 생성: forge/superpowers-tdd/20260128-143022
Invoke forge:smelt to load TDD methodology
Read the target skill: ~/.claude/skills/superpowers/tdd/SKILL.md
Create scenario: "You wrote 200 lines, forgot TDD, dinner at 6:30pm..."
기준선 평가 (3회): [65, 70, 68] → 평균 67.7, CI: [61.2, 74.2]
CSO: Description missing "Use when..." (FAIL)
Fix: Add explicit "Delete code, start over" section
개선 후 평가 (3회): [85, 88, 86] → 평균 86.3, CI: [81.2, 91.4]
신뢰구간 분리: 기준선 상한 74.2 < 개선 하한 81.2 → YES
병합 성공: Trial Branch를 main에 병합
Update stats: upgraded: true

Common Issues

Issue	Solution
Skill not found	Check all locations, suggest creation
No clear test scenario	Use skill's "When to Use" section
Agent always passes	Add more pressure (3+ combined)
Improvements don't help	Meta-test: ask agent what would make it clearer
CI 분리 실패 (중복)	더 큰 개선 필요 또는 샘플 수 증가 (n=5)
Trial Branch 충돌	수동으로 해결 후 `merge_trial_success` 재실행

Statistical Notes

신뢰구간 분리의 의미:

기준선 점수의 최대 예상 범위 상한보다
개선 후 점수의 최소 예상 범위 하한이 더 크면
→ 통계적으로 유의미한 향상

95% 신뢰구간:

표본의 평균이 진짜 평균의 95% 확률로 포함되는 범위
작은 n(=3)에서는 t-분포 사용 (정규분포보다 넓음)

Trial Branch의 장점:

실패한 실험이 메인 브랜치를 오염시키지 않음
여러 개선 방향을 독립적으로 실험 가능
성공한 개선만 선택적으로 병합

Source

git clone https://github.com/quantsquirrel/claude-forge-smith/blob/main/skills/forge/SKILL.mdView on GitHub

Overview

Forge systematically improves a skill's quality using test-driven evaluation and statistical validation. It guides you from creating a Trial Branch to measured improvements, validating changes with confidence intervals. A mandatory TDD background via forge:smelt underpins the workflow, with an optional heuristic mode if tests are missing.

How This Skill Works

Forge automates a TDD-driven upgrade workflow: start a Trial Branch, locate and read the target skill, generate pressure scenarios, and perform baseline evaluation (3 runs) before identifying improvements and applying them in a GREEN phase. After implementing changes, re-run evaluations to confirm confidence-interval separation, then decide to merge or discard the Trial Branch and update the upgraded flag.

When to Use It

When a skill has low discoverability or unclear instructions
When you want to improve skill quality with measurable metrics
When you need upgrades with verified improvements (confidence interval separation)
When the existing skill has TDD coverage and a clear upgrade path
When you want changes isolated in a Trial Branch for safe merging or discard

Quick Start

Step 1: Create and switch to a Trial Branch for the skill upgrade
Step 2: Locate the target skill, read SKILL.md, and generate pressure scenarios
Step 3: Run baseline evaluations, apply improvements, re-evaluate, verify CI separation, then merge or discard

Best Practices

Ensure all prerequisite files exist and git is initialized before starting
Always start changes in a Trial Branch to isolate experiments
Prefer TDD mode when tests/pressure scenarios exist; switch to HEURISTIC if not
Aim for statistical validation with 3x or 5x evaluations and CI checks
Update skill metadata (upgraded: true) after a successful merge

Example Use Cases

Upgrade a skill like superpowers:tdd using Standard TDD (n=3, 95% CI) and merge after CI separation
Apply High Precision mode (n=5) to validate subtle improvements in a critical skill
When tests are missing, switch to HEURISTIC mode, evaluate usage data, and propose automatic improvements
Create a Trial Branch, implement improvements, and re-run evaluations before deciding to merge
Verify CI separation post-improvement to ensure statistically significant gains

Frequently Asked Questions

Add this skill to your agents