From llm-application-dev
Evaluates LLM applications with automated metrics (BLEU, ROUGE, BERTScore, accuracy), human feedback, and LLM-as-judge. Use for measuring model performance, comparing prompts, and detecting regressions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-application-dev:llm-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
Classification:
Retrieval (RAG):
Manual assessment for quality aspects difficult to automate.
Dimensions:
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
from dataclasses import dataclass
from typing import Callable
import numpy as np
@dataclass
class Metric:
name: str
fn: Callable
@staticmethod
def accuracy():
return Metric("accuracy", calculate_accuracy)
@staticmethod
def bleu():
return Metric("bleu", calculate_bleu)
@staticmethod
def bertscore():
return Metric("bertscore", calculate_bertscore)
@staticmethod
def custom(name: str, fn: Callable):
return Metric(name, fn)
class EvaluationSuite:
def __init__(self, metrics: list[Metric]):
self.metrics = metrics
async def evaluate(self, model, test_cases: list[dict]) -> dict:
results = {m.name: [] for m in self.metrics}
for test in test_cases:
prediction = await model.predict(test["input"])
for metric in self.metrics:
score = metric.fn(
prediction=prediction,
reference=test.get("expected"),
context=test.get("context")
)
results[metric.name].append(score)
return {
"metrics": {k: np.mean(v) for k, v in results.items()},
"raw_scores": results
}
# Usage
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom("groundedness", check_groundedness)
])
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
]
results = await suite.evaluate(model=your_model, test_cases=test_cases)
Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.
npx claudepluginhub wshobson/agents --plugin llm-application-devEvaluates LLM apps using automated metrics (BLEU, ROUGE, BERTScore, MRR), human feedback, and LLM-as-judge. For testing performance, benchmarking, and regressions.
Evaluates LLM application performance using automated metrics (BLEU, ROUGE, BERTScore), human evaluation, and LLM-as-judge techniques. Helps detect regressions and validate improvements.