llm-judge
Scannednpx machina-cli add skill existential-birds/beagle/llm-judge --openclawLLM Judge Skill
Compare code implementations across 2+ repositories using structured evaluation.
Overview
This skill implements a two-phase LLM-as-judge evaluation:
- Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
- Phase 2: Judging - Parallel judges score each dimension using consistent rubrics
Reference Files
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Dimensions
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
Aggregation
After Phase 2 completes:
- Collect scores from all 5 judges
- For each repo, compute weighted total:
weighted_total = sum(score[dim] * weight[dim]) / 100 - Rank repos by weighted total (descending)
- Generate verdict explaining the ranking
Output
Write results to .beagle/llm-judge-report.json and display markdown summary.
Dependencies
@beagle:llm-artifacts-detection- Reused by repo agents for dead code/overengineering
Source
git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-analysis/skills/llm-judge/SKILL.mdView on GitHub Overview
LLM Judge evaluates code implementations across two or more repositories using a two-phase evaluation. Phase 1 gathers structured facts from each repo, and Phase 2 uses five dimension-specific judges with weighted rubrics to produce per-repo scores and a final ranking.
How This Skill Works
Phase 1 spawns a repo agent for each repository to extract facts according to a defined schema. Phase 2 runs five parallel judge agents (one per dimension) to score each repository using the rubrics. The aggregation step applies the weights to compute a weighted total and generates a verdict and .beagle/llm-judge-report.json.
When to Use It
- When comparing multiple repository implementations of the same feature.
- When you need objective, rubric-driven scoring across dimensions like functionality and security.
- When identifying dead code and overengineering across forks or branches.
- When ensuring test quality and coverage across competing solutions.
- When you want a ranked verdict with a transparent, reusable report.
Quick Start
- Step 1: Spawn a Phase 1 Repo Agent for each repository and gather facts using the defined fact schema.
- Step 2: After facts are collected, spawn 5 Phase 2 Judge Agents (one per dimension) to score each repository with the rubrics.
- Step 3: Run the Aggregation to compute weighted totals, rank repositories, and output .beagle/llm-judge-report.json with a concise verdict.
Best Practices
- Clearly label each repository and provide a precise spec document for consistent facts gathering.
- Run Phase 1 facts collection before Phase 2 scoring to ensure reliable data.
- Use the provided fact schema and scoring rubrics to maintain comparable evaluations.
- Leverage artifacts-detection for dead code and overengineering signals to inform scoring.
- Review the final weighted ranking and verdict with a clear justification summary.
Example Use Cases
- Compare two OSS implementations of a REST API across repositories A and B.
- Audit security patterns and vulnerability signals across forks of a project.
- Assess test quality, coverage, and mock boundaries for competing libraries.
- Identify dead code and unnecessary complexity in multiple service implementations.
- Rank several microservice implementations by a weighted score to decide an optimal refactor target.