Skill

audit-llm-output

Evaluates LLM outputs for factual accuracy, relevance, safety, and alignment using RAGAS metrics and LLM-as-judge patterns. Useful for auditing responses or improving quality assurance pipelines.

ai-ml

Popularity

Stars

Forks

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/grimoire:audit-llm-output

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.

SKILL.md

58 lines · ~1k tokens

Stats

LanguageShell

Stars12

Forks1

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Audit LLM Output

Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.

Why This Is Best Practice

Adopted by: OpenAI (evals framework), Anthropic (Constitutional AI red-teaming), Stanford (HELM benchmark), Google DeepMind Impact: RAGAS studies show that naive RAG pipelines have faithfulness scores of 0.6-0.7 out of 1.0 — meaning 30-40% of LLM statements are unsupported by the retrieved context; systematic auditing identifies and resolves these gaps.

Why best: LLM outputs are probabilistic — the same prompt can produce different quality responses. Without structured auditing, quality regressions go undetected when models are updated, prompts change, or knowledge bases evolve. Automated metrics provide continuous quality signals; human evaluation sets the ground truth.

Steps

Define evaluation dimensions — Select dimensions relevant to the use case: Faithfulness (claims supported by context), Answer Relevancy (answers the question asked), Context Precision (retrieved context is relevant), Hallucination Rate, Toxicity, Coherence.
Build a golden evaluation set — Curate 100-500 question/ground-truth-answer pairs covering the full range of use cases, including edge cases and adversarial inputs. This is the most valuable investment in LLM quality.
Apply RAGAS for RAG systems — Run RAGAS metrics (faithfulness, answer relevancy, context recall, context precision) against the golden set. Target: faithfulness >0.85, answer relevancy >0.80.
Detect hallucinations — Use an LLM-as-judge pattern: prompt a separate model to assess whether each claim in the output is supported by the provided context. Flag unsupported claims for human review.
Run safety evaluations — Test for: harmful content, prompt injection, PII leakage, jailbreak susceptibility. Use structured red-teaming: role-play, indirect instruction, context manipulation.
Establish a regression baseline — Record metric scores for the current system; treat any significant degradation (>5% on primary metrics) after a change as a regression requiring investigation.
Log and sample production outputs — Sample 1-5% of live queries for human evaluation; flag low-confidence outputs (low logprob, long generation time) for priority review.

Rules

Never rely solely on automated metrics — they measure correlation with quality, not quality itself; human evaluation sets the ground truth.
Separate evaluation of retrieval quality from generation quality in RAG systems — a good generator cannot compensate for irrelevant retrieved context.
Include adversarial examples in the golden set — benign-only evals miss the failure modes that matter most.
Re-evaluate after every model version change, prompt change, or knowledge base update.

Examples

RAGAS evaluation:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=golden_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
# Target: faithfulness > 0.85

Hallucination detection prompt: "Given the context: [context]. Does the following claim appear in the context or can be directly inferred from it? Claim: [claim]. Answer: YES / NO / PARTIAL."

Common Mistakes

Evaluating on the training set — questions used to tune the system are not representative of production distribution.
LLM-as-judge without calibration — judge models have their own biases; validate judge scores against human labels before trusting them.
Ignoring latency and cost as quality dimensions — a correct answer delivered in 30 seconds may be worse than a slightly less precise answer in 2 seconds.

audit-llm-output

Popularity

Invocation

Context Preview

SKILL.md

audit-llm-output

Popularity

Invocation

Context Preview

SKILL.md

Audit LLM Output

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

Similar Skills

Audit LLM Output

Why This Is Best Practice

Steps

Rules

Examples

Common Mistakes

Similar Skills