Skill

llm-evaluation

Evaluates LLM applications with automated metrics (BLEU, ROUGE, BERTScore, accuracy), human feedback, and LLM-as-judge. Use for measuring model performance, comparing prompts, and detecting regressions.

OpenAI

Popularity

Parent stars

36,816

Parent forks

3,983

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/llm-application-dev:llm-evaluation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Supporting Files

references/details.md

SKILL.md

146 lines · ~983 tokens

Stats

LanguagePython

Parent stars36,816

Parent forks3,983

MaintenanceExcellent

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

Measuring LLM application performance systematically
Comparing different models or prompts
Detecting performance regressions before deployment
Validating improvements from prompt changes
Building confidence in production systems
Establishing baselines and tracking progress over time
Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

BLEU: N-gram overlap (translation)
ROUGE: Recall-oriented (summarization)
METEOR: Semantic similarity
BERTScore: Embedding-based similarity
Perplexity: Language model confidence

Classification:

Accuracy: Percentage correct
Precision/Recall/F1: Class-specific performance
Confusion Matrix: Error patterns
AUC-ROC: Ranking quality

Retrieval (RAG):

MRR: Mean Reciprocal Rank
NDCG: Normalized Discounted Cumulative Gain
Precision@K: Relevant in top K
Recall@K: Coverage in top K

2. Human Evaluation

Manual assessment for quality aspects difficult to automate.

Dimensions:

Accuracy: Factual correctness
Coherence: Logical flow
Relevance: Answers the question
Fluency: Natural language quality
Safety: No harmful content
Helpfulness: Useful to the user

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.

Approaches:

Pointwise: Score individual responses
Pairwise: Compare two responses
Reference-based: Compare to gold standard
Reference-free: Judge without ground truth

Quick Start

from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class Metric:
    name: str
    fn: Callable

    @staticmethod
    def accuracy():
        return Metric("accuracy", calculate_accuracy)

    @staticmethod
    def bleu():
        return Metric("bleu", calculate_bleu)

    @staticmethod
    def bertscore():
        return Metric("bertscore", calculate_bertscore)

    @staticmethod
    def custom(name: str, fn: Callable):
        return Metric(name, fn)

class EvaluationSuite:
    def __init__(self, metrics: list[Metric]):
        self.metrics = metrics

    async def evaluate(self, model, test_cases: list[dict]) -> dict:
        results = {m.name: [] for m in self.metrics}

        for test in test_cases:
            prediction = await model.predict(test["input"])

            for metric in self.metrics:
                score = metric.fn(
                    prediction=prediction,
                    reference=test.get("expected"),
                    context=test.get("context")
                )
                results[metric.name].append(score)

        return {
            "metrics": {k: np.mean(v) for k, v in results.items()},
            "raw_scores": results
        }

# Usage
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom("groundedness", check_groundedness)
])

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
]

results = await suite.evaluate(model=your_model, test_cases=test_cases)

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

llm-evaluation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

llm-evaluation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

LLM Evaluation

When to Use This Skill

Core Evaluation Types

1. Automated Metrics

2. Human Evaluation

3. LLM-as-Judge

Quick Start

Detailed patterns and worked examples

Similar Skills

LLM Evaluation

When to Use This Skill

Core Evaluation Types

1. Automated Metrics

2. Human Evaluation

3. LLM-as-Judge

Quick Start

Detailed patterns and worked examples

Similar Skills