What is the main goal of this skill?

To quantify prompt quality for LLM-generated structured output using keyword matching and F1, without relying on exact text or expensive semantic similarity.

Why use a threshold in matching?

Threshold controls when a generated card is considered a valid match for an expected card, balancing precision and recall.

What happens to unmatched outputs?

Unmatched generated cards are treated as extras (affecting precision), and any unmatched expected cards reduce recall. The approach emphasizes coverage and relevance via the metrics.

keyword-based-llm-eval

npx machina-cli add skill shimo4228/claude-code-learned-skills/keyword-based-llm-eval --openclaw

Files (1)

SKILL.md

4.0 KB

Keyword-Based LLM Output Evaluation

Extracted: 2026-02-10 Context: Evaluating structured LLM output (cards, summaries, extractions) against expected results without exact match or expensive semantic similarity.

Problem

LLM generates structured output (e.g., Anki cards with front/back text, card types, tags). Need to measure prompt quality quantitatively — but:

Exact match is too strict (LLM wording varies)
Semantic similarity (embeddings) is expensive and adds dependency
Manual review doesn't scale

Solution

Keyword presence-based lightweight matching with greedy best-match:

1. Define Expected Output as Keywords

@dataclass(frozen=True, slots=True)
class ExpectedCard:
    front_keywords: list[str]  # must appear in generated front
    back_keywords: list[str]   # must appear in generated back
    card_type: CardType | None = None  # optional type constraint

2. Keyword Similarity Score

def _keyword_similarity(keywords: list[str], text: str) -> float:
    if not keywords:
        return 0.0
    found = sum(1 for kw in keywords if kw in text)
    return found / len(keywords)

3. Weighted Pair Scoring

def _score_pair(expected, card) -> float:
    front_sim = _keyword_similarity(expected.front_keywords, card.front)
    back_sim = _keyword_similarity(expected.back_keywords, card.back)
    # Optional type bonus (20% weight when type specified)
    if expected.card_type is not None:
        type_bonus = 1.0 if card.card_type == expected.card_type else 0.0
        return front_sim * 0.4 + back_sim * 0.4 + type_bonus * 0.2
    return front_sim * 0.5 + back_sim * 0.5

4. Greedy Best-Match Algorithm

def match_cards(expected, generated, threshold=0.3) -> CaseResult:
    used_indices: set[int] = set()
    for ec in expected:
        # Find highest-scoring unused generated card
        best_score, best_idx = 0.0, -1
        for i, card in enumerate(generated):
            if i in used_indices:
                continue
            score = _score_pair(ec, card)
            if score > best_score:
                best_score, best_idx = score, i
        if best_idx >= 0 and best_score >= threshold:
            used_indices.add(best_idx)
            # Record match
    # Remaining generated cards = unmatched (extra output)

5. Aggregate Metrics

Recall = matched / total_expected (coverage)
Precision = matched / total_generated (relevance)
F1 = harmonic mean
Avg Similarity = mean similarity of matched pairs

6. YAML Dataset Format

name: "eval-dataset"
version: "1.0"
cases:
  - id: "case-01"
    text: "Input text for LLM..."
    expected_cards:
      - front_keywords: ["key concept"]
        back_keywords: ["expected answer part"]
        card_type: qa

Architecture

4-module separation (each independently testable):

dataset.py  → ExpectedCard/EvalCase/EvalDataset + YAML loader
matcher.py  → _keyword_similarity + _score_pair + match_cards (greedy)
metrics.py  → EvalMetrics (Recall/Precision/F1) + calculate_metrics
report.py   → Rich table + JSON report + comparison report

When to Use

Building eval harness for LLM-generated structured output
Measuring prompt quality changes (A/B comparison)
CI integration for prompt regression detection
Any scenario where output has identifiable keywords but not exact text

Trade-offs

Pro: Zero additional dependencies, fast (~ms), language-agnostic keywords
Pro: Easy to maintain YAML datasets, human-readable
Con: Keyword presence != semantic understanding (false positives possible)
Con: Order-insensitive (can't verify sequence constraints)
Future: Add semantic similarity tier using embeddings for higher precision

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/keyword-based-llm-eval/SKILL.mdView on GitHub

Overview

Evaluates structured LLM output (cards, summaries, extractions) against expected results using keyword presence and F1 metrics. It avoids strict exact matches and expensive semantic similarity by using a lightweight, greedy best-match approach.

How This Skill Works

Define the expected output as keywords (front_keywords, back_keywords, and an optional card_type). Compute keyword similarity as the fraction of keywords found in the generated text. Score pairs with a weighted scheme (front 0.4, back 0.4, optional type bonus 0.2). Use a greedy best-match algorithm to pair expected items with generated cards, then compute Recall, Precision, F1, and Avg Similarity from the matches.

When to Use It

Building an eval harness for LLM-generated structured output (e.g., flashcards, summaries, extractions)
Measuring prompt quality changes with A/B comparisons
CI integration for prompt regression detection
Scenarios where outputs have identifiable keywords but not exact wording
Evaluating card-like outputs with optional type constraints (card_type) to enforce structure

Quick Start

Step 1: Define ExpectedCard(front_keywords=[...], back_keywords=[...], card_type=...).
Step 2: Run match_cards(expected, generated, threshold=0.3) to pair cards greedily.
Step 3: Compute Recall, Precision, F1, and Avg Similarity from the matches.

Best Practices

Clearly define ExpectedCard with front_keywords, back_keywords, and optional card_type
Keep keyword sets representative but concise to reduce false positives
Use an appropriate threshold to balance precision and recall
Document the dataset format (e.g., YAML) and match expectations to real outputs
Monitor and iterate on keywords when prompt changes affect wording

Example Use Cases

Evaluating Anki-style cards where fronts must include key concepts and backs must include definitions
Assessing structured summaries where certain tags must appear in the body text
QA-style card generation with optional card_type constraints (e.g., qa) to enforce format
CI checks that flag degraded keyword coverage after prompt tweaks
Benchmarking different prompts using a YAML-encoded eval dataset with front/back keywords

Frequently Asked Questions

Add this skill to your agents