review-skill-improver
npx machina-cli add skill existential-birds/beagle/review-skill-improver --openclawReview Skill Improver
Purpose
Analyzes structured feedback logs to:
- Identify rules that produce false positives (high REJECT rate)
- Identify missing rules (issues that should have been caught)
- Suggest specific skill modifications
Input
Feedback log in enhanced schema format (see review-feedback-schema skill).
Analysis Process
Step 1: Aggregate by Rule Source
For each unique rule_source:
- Count total issues flagged
- Count ACCEPT vs REJECT
- Calculate rejection rate
- Extract rejection rationales
Step 2: Identify High-Rejection Rules
Rules with >30% rejection rate warrant investigation:
- Read the rejection rationales
- Identify common themes
- Determine if rule needs refinement or exception
Step 3: Pattern Analysis
Group rejections by rationale theme:
- "Linter already handles this" -> Add linter verification step
- "Framework supports this pattern" -> Add exception to skill
- "Intentional design decision" -> Add codebase context check
- "Wrong code path assumed" -> Add code tracing step
Step 4: Generate Improvement Recommendations
For each identified issue, produce:
## Recommendation: [SHORT_TITLE]
**Affected Skill:** `skill-name/SKILL.md` or `skill-name/references/file.md`
**Problem:** [What's causing false positives]
**Evidence:**
- [X] rejections with rationale "[common theme]"
- Example: [file:line] - [issue] - [rationale]
**Proposed Fix:**
```markdown
[Exact text to add/modify in the skill]
Expected Impact: Reduce false positive rate for [rule] from X% to Y%
## Output Format
```markdown
# Review Skill Improvement Report
## Summary
- Feedback entries analyzed: [N]
- Unique rules triggered: [N]
- High-rejection rules identified: [N]
- Recommendations generated: [N]
## High-Rejection Rules
| Rule Source | Total | Rejected | Rate | Theme |
|-------------|-------|----------|------|-------|
| ... | ... | ... | ... | ... |
## Recommendations
[Numbered list of recommendations in format above]
## Rules Performing Well
[Rules with <10% rejection rate - preserve these]
Usage
# Analyze feedback and generate improvement report
/review-skill-improver --output improvement-report.md
Example Analysis
Given this feedback data:
rule_source,verdict,rationale
python-code-review:line-length,REJECT,ruff check passes
python-code-review:line-length,REJECT,no E501 violation
python-code-review:line-length,REJECT,linter config allows 120
python-code-review:line-length,ACCEPT,fixed long line
pydantic-ai-common-pitfalls:tool-decorator,REJECT,docs support raw functions
python-code-review:type-safety,ACCEPT,added type annotation
python-code-review:type-safety,ACCEPT,fixed Any usage
Analysis output:
# Review Skill Improvement Report
## Summary
- Feedback entries analyzed: 7
- Unique rules triggered: 3
- High-rejection rules identified: 2
- Recommendations generated: 2
## High-Rejection Rules
| Rule Source | Total | Rejected | Rate | Theme |
|-------------|-------|----------|------|-------|
| python-code-review:line-length | 4 | 3 | 75% | linter handles this |
| pydantic-ai-common-pitfalls:tool-decorator | 1 | 1 | 100% | framework supports pattern |
## Recommendations
### 1. Add Linter Verification for Line Length
**Affected Skill:** `commands/review-python.md`
**Problem:** Flagging line length issues that linters confirm don't exist
**Evidence:**
- 3 rejections with rationale "linter passes/handles this"
- Example: amelia/drivers/api/openai.py:102 - Line too long - ruff check passes
**Proposed Fix:**
Add step to run `ruff check` before manual review. If linter passes for line length, do not flag manually.
**Expected Impact:** Reduce false positive rate for line-length from 75% to <10%
### 2. Add Raw Function Tool Registration Exception
**Affected Skill:** `skills/pydantic-ai-common-pitfalls/SKILL.md`
**Problem:** Flagging valid pydantic-ai pattern as error
**Evidence:**
- 1 rejection with rationale "docs support raw functions"
**Proposed Fix:**
Add "Valid Patterns" section documenting that passing functions with RunContext to Agent(tools=[...]) is valid.
**Expected Impact:** Eliminate false positives for this pattern
## Rules Performing Well
| Rule Source | Total | Accepted | Rate |
|-------------|-------|----------|------|
| python-code-review:type-safety | 2 | 2 | 100% |
Future: Automated Skill Updates
Once confidence is high, this skill can:
- Generate PRs to beagle with skill improvements
- Track improvement impact over time
- A/B test rule variations
Feedback Loop
Review Code -> Log Outcomes -> Analyze Patterns -> Improve Skills -> Better Reviews
^ |
+--------------------------------------------------------------------+
This creates a continuous improvement cycle where review quality improves based on empirical data rather than guesswork.
Source
git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-core/skills/review-skill-improver/SKILL.mdView on GitHub Overview
Review Skill Improver analyzes structured feedback logs to identify high-rejection rules, missing checks, and concrete skill changes. It guides teams to improve review accuracy by surfacing patterns and actionable recommendations.
How This Skill Works
It aggregates feedback by rule_source, computes rejection rates, and flags rules with rejection rates exceeding 30%. It then clusters rejections by rationale themes and generates targeted improvement recommendations with exact edits to apply.
When to Use It
- You have accumulated feedback data showing false positives that need reduction
- You need to refine rules that over-reject or generate unnecessary flags
- You seek to surface missing rules or patterns that should be caught
- You want structured guidance to modify skills and deployment with traceable evidence
- You are preparing an improvement report and need concrete, testable edits
Quick Start
- Step 1: Run /review-skill-improver to analyze your feedback logs and generate an improvement report
- Step 2: Inspect high-rejection rules and their themes to identify patterns
- Step 3: Apply the Proposed Fixes to the relevant skill and validate with a new feedback batch
Best Practices
- Use enhanced feedback schema logs as input
- Prioritize rules with >30% rejection rate
- Review rejection rationales for consistent themes
- Test proposed edits in a controlled feedback batch before rollout
- Document changes in the skill file and maintain changelogs
Example Use Cases
- python-code-review:line-length showing high rejection; add linter verification step
- pydantic-ai-common-pitfalls:tool-decorator rejected due to docs support for raw functions
- python-code-review:type-safety accepted, used to inform pattern analysis
- framework-patterns:resource-usage flagged; add exception to skill when framework supports pattern
- improvement report generated with example analysis (line-length, tool-decorator, type-safety)