Get the FREE Ultimate OpenClaw Setup Guide →

keyword-based-llm-eval

npx machina-cli add skill shimo4228/claude-code-learned-skills/keyword-based-llm-eval --openclaw
Files (1)
SKILL.md
4.0 KB

Keyword-Based LLM Output Evaluation

Extracted: 2026-02-10 Context: Evaluating structured LLM output (cards, summaries, extractions) against expected results without exact match or expensive semantic similarity.

Problem

LLM generates structured output (e.g., Anki cards with front/back text, card types, tags). Need to measure prompt quality quantitatively — but:

  • Exact match is too strict (LLM wording varies)
  • Semantic similarity (embeddings) is expensive and adds dependency
  • Manual review doesn't scale

Solution

Keyword presence-based lightweight matching with greedy best-match:

1. Define Expected Output as Keywords

@dataclass(frozen=True, slots=True)
class ExpectedCard:
    front_keywords: list[str]  # must appear in generated front
    back_keywords: list[str]   # must appear in generated back
    card_type: CardType | None = None  # optional type constraint

2. Keyword Similarity Score

def _keyword_similarity(keywords: list[str], text: str) -> float:
    if not keywords:
        return 0.0
    found = sum(1 for kw in keywords if kw in text)
    return found / len(keywords)

3. Weighted Pair Scoring

def _score_pair(expected, card) -> float:
    front_sim = _keyword_similarity(expected.front_keywords, card.front)
    back_sim = _keyword_similarity(expected.back_keywords, card.back)
    # Optional type bonus (20% weight when type specified)
    if expected.card_type is not None:
        type_bonus = 1.0 if card.card_type == expected.card_type else 0.0
        return front_sim * 0.4 + back_sim * 0.4 + type_bonus * 0.2
    return front_sim * 0.5 + back_sim * 0.5

4. Greedy Best-Match Algorithm

def match_cards(expected, generated, threshold=0.3) -> CaseResult:
    used_indices: set[int] = set()
    for ec in expected:
        # Find highest-scoring unused generated card
        best_score, best_idx = 0.0, -1
        for i, card in enumerate(generated):
            if i in used_indices:
                continue
            score = _score_pair(ec, card)
            if score > best_score:
                best_score, best_idx = score, i
        if best_idx >= 0 and best_score >= threshold:
            used_indices.add(best_idx)
            # Record match
    # Remaining generated cards = unmatched (extra output)

5. Aggregate Metrics

  • Recall = matched / total_expected (coverage)
  • Precision = matched / total_generated (relevance)
  • F1 = harmonic mean
  • Avg Similarity = mean similarity of matched pairs

6. YAML Dataset Format

name: "eval-dataset"
version: "1.0"
cases:
  - id: "case-01"
    text: "Input text for LLM..."
    expected_cards:
      - front_keywords: ["key concept"]
        back_keywords: ["expected answer part"]
        card_type: qa

Architecture

4-module separation (each independently testable):

dataset.py  → ExpectedCard/EvalCase/EvalDataset + YAML loader
matcher.py  → _keyword_similarity + _score_pair + match_cards (greedy)
metrics.py  → EvalMetrics (Recall/Precision/F1) + calculate_metrics
report.py   → Rich table + JSON report + comparison report

When to Use

  • Building eval harness for LLM-generated structured output
  • Measuring prompt quality changes (A/B comparison)
  • CI integration for prompt regression detection
  • Any scenario where output has identifiable keywords but not exact text

Trade-offs

  • Pro: Zero additional dependencies, fast (~ms), language-agnostic keywords
  • Pro: Easy to maintain YAML datasets, human-readable
  • Con: Keyword presence != semantic understanding (false positives possible)
  • Con: Order-insensitive (can't verify sequence constraints)
  • Future: Add semantic similarity tier using embeddings for higher precision

Source

git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/keyword-based-llm-eval/SKILL.mdView on GitHub

Overview

Evaluates structured LLM output (cards, summaries, extractions) against expected results using keyword presence and F1 metrics. It avoids strict exact matches and expensive semantic similarity by using a lightweight, greedy best-match approach.

How This Skill Works

Define the expected output as keywords (front_keywords, back_keywords, and an optional card_type). Compute keyword similarity as the fraction of keywords found in the generated text. Score pairs with a weighted scheme (front 0.4, back 0.4, optional type bonus 0.2). Use a greedy best-match algorithm to pair expected items with generated cards, then compute Recall, Precision, F1, and Avg Similarity from the matches.

When to Use It

  • Building an eval harness for LLM-generated structured output (e.g., flashcards, summaries, extractions)
  • Measuring prompt quality changes with A/B comparisons
  • CI integration for prompt regression detection
  • Scenarios where outputs have identifiable keywords but not exact wording
  • Evaluating card-like outputs with optional type constraints (card_type) to enforce structure

Quick Start

  1. Step 1: Define ExpectedCard(front_keywords=[...], back_keywords=[...], card_type=...).
  2. Step 2: Run match_cards(expected, generated, threshold=0.3) to pair cards greedily.
  3. Step 3: Compute Recall, Precision, F1, and Avg Similarity from the matches.

Best Practices

  • Clearly define ExpectedCard with front_keywords, back_keywords, and optional card_type
  • Keep keyword sets representative but concise to reduce false positives
  • Use an appropriate threshold to balance precision and recall
  • Document the dataset format (e.g., YAML) and match expectations to real outputs
  • Monitor and iterate on keywords when prompt changes affect wording

Example Use Cases

  • Evaluating Anki-style cards where fronts must include key concepts and backs must include definitions
  • Assessing structured summaries where certain tags must appear in the body text
  • QA-style card generation with optional card_type constraints (e.g., qa) to enforce format
  • CI checks that flag degraded keyword coverage after prompt tweaks
  • Benchmarking different prompts using a YAML-encoded eval dataset with front/back keywords

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers