From grimoire
Evaluates LLM outputs for factual accuracy, relevance, safety, and alignment using RAGAS metrics and LLM-as-judge patterns. Useful for auditing responses or improving quality assurance pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:audit-llm-outputThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.
Systematically evaluate LLM outputs for factual accuracy, relevance, safety, and alignment using structured frameworks and automated metrics.
Adopted by: OpenAI (evals framework), Anthropic (Constitutional AI red-teaming), Stanford (HELM benchmark), Google DeepMind Impact: RAGAS studies show that naive RAG pipelines have faithfulness scores of 0.6-0.7 out of 1.0 — meaning 30-40% of LLM statements are unsupported by the retrieved context; systematic auditing identifies and resolves these gaps.
Why best: LLM outputs are probabilistic — the same prompt can produce different quality responses. Without structured auditing, quality regressions go undetected when models are updated, prompts change, or knowledge bases evolve. Automated metrics provide continuous quality signals; human evaluation sets the ground truth.
RAGAS evaluation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=golden_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
# Target: faithfulness > 0.85
Hallucination detection prompt: "Given the context: [context]. Does the following claim appear in the context or can be directly inferred from it? Claim: [claim]. Answer: YES / NO / PARTIAL."
npx claudepluginhub jeffreytse/grimoire --plugin grimoireImplements evaluation strategies and quality gates for LLM outputs: structural validation, semantic checks, LLM-as-judge with bias mitigations, prompt testing, and guardrails. Use for evals, CI gates, quality measurement, regressions.
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.