What is the LLM Judge and how does it work?

It is a two-phase evaluation that first gathers structured facts from each repository and then scores each repository on five dimensions using weighted rubrics, producing a ranked verdict.

What outputs are produced by the LLM Judge?

Per-repo scores per dimension, a weighted total, a ranking, and a final verdict. Outputs are stored as JSON reports and accompanied by a markdown summary.

How are weights applied in scoring?

Each dimension has a default weight (Functionality 30%, Security 25%, Test Quality 20%, Overengineering 15%, Dead Code 10%). The weighted_total is computed as the sum of score[dimension] * weight[dimension] divided by 100.

llm-judge

Scanned

npx machina-cli add skill existential-birds/beagle/llm-judge --openclaw

Files (1)

SKILL.md

3.1 KB

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.

Overview

This skill implements a two-phase LLM-as-judge evaluation:

Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
Phase 2: Judging - Parallel judges score each dimension using consistent rubrics

Reference Files

File	Purpose
references/fact-schema.md	JSON schema for Phase 1 facts
references/scoring-rubrics.md	Detailed rubrics for each dimension
references/repo-agent.md	Instructions for Phase 1 agents
references/judge-agents.md	Instructions for Phase 2 judges

Scoring Dimensions

Dimension	Default Weight	Evaluates
Functionality	30%	Spec compliance, test pass rate
Security	25%	Vulnerabilities, security patterns
Test Quality	20%	Coverage, DRY, mock boundaries
Overengineering	15%	Unnecessary complexity
Dead Code	10%	Unused code, TODOs

Scoring Scale

Score	Meaning
5	Excellent - Exceeds expectations
4	Good - Meets requirements, minor issues
3	Average - Functional but notable gaps
2	Below Average - Significant issues
1	Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

After Phase 2 completes:

Collect scores from all 5 judges

For each repo, compute weighted total:

weighted_total = sum(score[dim] * weight[dim]) / 100

Rank repos by weighted total (descending)
Generate verdict explaining the ranking

Output

Write results to .beagle/llm-judge-report.json and display markdown summary.

Dependencies

@beagle:llm-artifacts-detection - Reused by repo agents for dead code/overengineering

Source

git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-analysis/skills/llm-judge/SKILL.mdView on GitHub

Overview

LLM Judge evaluates code implementations across two or more repositories using a two-phase evaluation. Phase 1 gathers structured facts from each repo, and Phase 2 uses five dimension-specific judges with weighted rubrics to produce per-repo scores and a final ranking.

How This Skill Works

Phase 1 spawns a repo agent for each repository to extract facts according to a defined schema. Phase 2 runs five parallel judge agents (one per dimension) to score each repository using the rubrics. The aggregation step applies the weights to compute a weighted total and generates a verdict and .beagle/llm-judge-report.json.

When to Use It

When comparing multiple repository implementations of the same feature.
When you need objective, rubric-driven scoring across dimensions like functionality and security.
When identifying dead code and overengineering across forks or branches.
When ensuring test quality and coverage across competing solutions.
When you want a ranked verdict with a transparent, reusable report.

Quick Start

Step 1: Spawn a Phase 1 Repo Agent for each repository and gather facts using the defined fact schema.
Step 2: After facts are collected, spawn 5 Phase 2 Judge Agents (one per dimension) to score each repository with the rubrics.
Step 3: Run the Aggregation to compute weighted totals, rank repositories, and output .beagle/llm-judge-report.json with a concise verdict.

Best Practices

Clearly label each repository and provide a precise spec document for consistent facts gathering.
Run Phase 1 facts collection before Phase 2 scoring to ensure reliable data.
Use the provided fact schema and scoring rubrics to maintain comparable evaluations.
Leverage artifacts-detection for dead code and overengineering signals to inform scoring.
Review the final weighted ranking and verdict with a clear justification summary.

Example Use Cases

Compare two OSS implementations of a REST API across repositories A and B.
Audit security patterns and vulnerability signals across forks of a project.
Assess test quality, coverage, and mock boundaries for competing libraries.
Identify dead code and unnecessary complexity in multiple service implementations.
Rank several microservice implementations by a weighted score to decide an optimal refactor target.

Frequently Asked Questions

Add this skill to your agents