keyword-based-llm-eval
npx machina-cli add skill shimo4228/claude-code-learned-skills/keyword-based-llm-eval --openclawKeyword-Based LLM Output Evaluation
Extracted: 2026-02-10 Context: Evaluating structured LLM output (cards, summaries, extractions) against expected results without exact match or expensive semantic similarity.
Problem
LLM generates structured output (e.g., Anki cards with front/back text, card types, tags). Need to measure prompt quality quantitatively — but:
- Exact match is too strict (LLM wording varies)
- Semantic similarity (embeddings) is expensive and adds dependency
- Manual review doesn't scale
Solution
Keyword presence-based lightweight matching with greedy best-match:
1. Define Expected Output as Keywords
@dataclass(frozen=True, slots=True)
class ExpectedCard:
front_keywords: list[str] # must appear in generated front
back_keywords: list[str] # must appear in generated back
card_type: CardType | None = None # optional type constraint
2. Keyword Similarity Score
def _keyword_similarity(keywords: list[str], text: str) -> float:
if not keywords:
return 0.0
found = sum(1 for kw in keywords if kw in text)
return found / len(keywords)
3. Weighted Pair Scoring
def _score_pair(expected, card) -> float:
front_sim = _keyword_similarity(expected.front_keywords, card.front)
back_sim = _keyword_similarity(expected.back_keywords, card.back)
# Optional type bonus (20% weight when type specified)
if expected.card_type is not None:
type_bonus = 1.0 if card.card_type == expected.card_type else 0.0
return front_sim * 0.4 + back_sim * 0.4 + type_bonus * 0.2
return front_sim * 0.5 + back_sim * 0.5
4. Greedy Best-Match Algorithm
def match_cards(expected, generated, threshold=0.3) -> CaseResult:
used_indices: set[int] = set()
for ec in expected:
# Find highest-scoring unused generated card
best_score, best_idx = 0.0, -1
for i, card in enumerate(generated):
if i in used_indices:
continue
score = _score_pair(ec, card)
if score > best_score:
best_score, best_idx = score, i
if best_idx >= 0 and best_score >= threshold:
used_indices.add(best_idx)
# Record match
# Remaining generated cards = unmatched (extra output)
5. Aggregate Metrics
- Recall = matched / total_expected (coverage)
- Precision = matched / total_generated (relevance)
- F1 = harmonic mean
- Avg Similarity = mean similarity of matched pairs
6. YAML Dataset Format
name: "eval-dataset"
version: "1.0"
cases:
- id: "case-01"
text: "Input text for LLM..."
expected_cards:
- front_keywords: ["key concept"]
back_keywords: ["expected answer part"]
card_type: qa
Architecture
4-module separation (each independently testable):
dataset.py → ExpectedCard/EvalCase/EvalDataset + YAML loader
matcher.py → _keyword_similarity + _score_pair + match_cards (greedy)
metrics.py → EvalMetrics (Recall/Precision/F1) + calculate_metrics
report.py → Rich table + JSON report + comparison report
When to Use
- Building eval harness for LLM-generated structured output
- Measuring prompt quality changes (A/B comparison)
- CI integration for prompt regression detection
- Any scenario where output has identifiable keywords but not exact text
Trade-offs
- Pro: Zero additional dependencies, fast (~ms), language-agnostic keywords
- Pro: Easy to maintain YAML datasets, human-readable
- Con: Keyword presence != semantic understanding (false positives possible)
- Con: Order-insensitive (can't verify sequence constraints)
- Future: Add semantic similarity tier using embeddings for higher precision
Source
git clone https://github.com/shimo4228/claude-code-learned-skills/blob/main/skills/keyword-based-llm-eval/SKILL.mdView on GitHub Overview
Evaluates structured LLM output (cards, summaries, extractions) against expected results using keyword presence and F1 metrics. It avoids strict exact matches and expensive semantic similarity by using a lightweight, greedy best-match approach.
How This Skill Works
Define the expected output as keywords (front_keywords, back_keywords, and an optional card_type). Compute keyword similarity as the fraction of keywords found in the generated text. Score pairs with a weighted scheme (front 0.4, back 0.4, optional type bonus 0.2). Use a greedy best-match algorithm to pair expected items with generated cards, then compute Recall, Precision, F1, and Avg Similarity from the matches.
When to Use It
- Building an eval harness for LLM-generated structured output (e.g., flashcards, summaries, extractions)
- Measuring prompt quality changes with A/B comparisons
- CI integration for prompt regression detection
- Scenarios where outputs have identifiable keywords but not exact wording
- Evaluating card-like outputs with optional type constraints (card_type) to enforce structure
Quick Start
- Step 1: Define ExpectedCard(front_keywords=[...], back_keywords=[...], card_type=...).
- Step 2: Run match_cards(expected, generated, threshold=0.3) to pair cards greedily.
- Step 3: Compute Recall, Precision, F1, and Avg Similarity from the matches.
Best Practices
- Clearly define ExpectedCard with front_keywords, back_keywords, and optional card_type
- Keep keyword sets representative but concise to reduce false positives
- Use an appropriate threshold to balance precision and recall
- Document the dataset format (e.g., YAML) and match expectations to real outputs
- Monitor and iterate on keywords when prompt changes affect wording
Example Use Cases
- Evaluating Anki-style cards where fronts must include key concepts and backs must include definitions
- Assessing structured summaries where certain tags must appear in the body text
- QA-style card generation with optional card_type constraints (e.g., qa) to enforce format
- CI checks that flag degraded keyword coverage after prompt tweaks
- Benchmarking different prompts using a YAML-encoded eval dataset with front/back keywords