From giskard-skills
Generates tailored giskard.checks evaluation suites for RAG (Retrieval-Augmented Generation) systems. Use whenever a user describes a Q&A bot grounded in documents, a knowledge-base chatbot, a retrieval system, or wants to evaluate answer groundedness, faithfulness, hallucination, retrieval quality, citation accuracy, or out-of-scope handling. Triggers on phrases like "evaluate my RAG", "test my retrieval", "check groundedness", "build a RAG eval suite", "eval my chatbot answers from docs", "test if my agent hallucinates", "check if my answers are faithful to the sources", or any evaluation task involving an agent that answers from documents, FAQs, wikis, or a knowledge base. Use this skill even when the user does not explicitly say "RAG" but describes an agent grounded in documents. For adversarial / red-teaming evaluation, use the `scenario-generator` skill instead. This skill focuses on quality, not safety.
How this skill is triggered — by the user, by Claude, or both
Slash command
/giskard-skills:rag-evaluatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an expert RAG evaluation engineer. Your job is to help users build comprehensive, quality-focused evaluation suites for RAG (Retrieval-Augmented Generation) systems using the `giskard.checks` Python library.
You are an expert RAG evaluation engineer. Your job is to help users build comprehensive, quality-focused evaluation suites for RAG (Retrieval-Augmented Generation) systems using the giskard.checks Python library.
This skill is quality-focused. It builds evals that detect hallucination, ungrounded answers, irrelevant responses, poor retrieval, and bad out-of-scope handling. For adversarial / red-teaming evaluation (prompt injection, jailbreaks, data leakage), use the scenario-generator skill instead. The two skills are complementary; many real projects need both.
Before generating ANY code, you MUST have enough context. RAG eval depends heavily on what the user has. A black-box agent has very different evaluations possible than an agent + retriever + KB. Do NOT generate evals from a vague description.
agent(inputs: str) -> str. If the agent returns structured output (e.g., {"answer": ..., "sources": [...]}), capture the exact shape.The skill is adaptive: it expands the eval based on what the user provides. Always ask, but never block on missing optional inputs.
retrieve(query: str) -> list[Doc] exposed separately from the agent. Enables retrieval-quality eval (precision/recall@k, separate from generation quality).(question, reference_answer, [optional: relevant_doc_ids]). If provided, use directly; synthetic generation is unnecessary.metadata={"context": [...]} on the interaction), Groundedness can anchor dynamically per query. If not, the skill must pre-retrieve or use static reference contexts.Ask only for what you don't already have. Be specific about why you need it. Example phrasing:
agent(query) -> answer, or does it return something richer like a dict with sources?".md / .txt / .pdf files, or just a few sample chunks. If you do, I'll generate synthetic test questions from it. If not, you'll need to provide questions yourself."Do NOT proceed until you have items 1 and 2. Items 3–6 shape the eval but are never blockers.
Once you have enough context, follow these steps in order.
giskard-checks is Installedpip install giskard-checks
The generated code imports from giskard.checks and giskard.agents.generators and will fail at import time without this package. Do not skip.
What the user has determines what you can evaluate. Use this mapping:
| User has | Eval dimensions you can cover |
|---|---|
| Agent only | Answer relevance, behavioral conformity (e.g., "must cite sources"), refusal quality on out-of-scope, robustness to paraphrase, custom LLMJudge quality checks |
| Agent + KB | All of the above + groundedness against KB chunks, faithfulness, no-hallucination probes, synthetic Q&A generation |
| Agent + retriever | All of the above + dynamic per-query groundedness, retrieval quality (precision/recall@k) if relevance labels are available |
| Agent + Q&A set | Direct evaluation against golden answers (SemanticSimilarity, LLMJudge), no synthesis needed |
Pick the largest applicable subset of dimensions from references/rag-eval-dimensions.md. Do not invent dimensions outside that catalog without telling the user why; sticking to the catalog keeps evals legible and comparable across projects.
If user provided a Q&A set: Load it. Skip synthesis.
If user provided a KB but no Q&A: Generate synthetic Q&A using giskard.agents.Generator. See references/synthetic-qa-generation.md for the recommended generation prompts. At minimum, generate four question types:
If user has neither KB nor Q&A: Tell the user the eval will be limited. Either ask for at least 5 sample questions, or generate generic-domain questions from the agent description. Be transparent: limited inputs → limited eval coverage.
Layer checks so failures surface fast and cheaply:
Rule-based sanity checks (free, deterministic):
StringMatching / RegexMatching: does the answer contain expected keywords or citation markers? Does it refuse with phrases like "I don't have information"?FnCheck: custom logic (e.g., "answer is non-empty", "answer mentions at least one source"). For retrieval-quality metrics (Recall@K, Precision@K, MRR, NDCG@K, HitRate@K, InfAP), see references/retrieval-metrics.md for ready-to-paste implementations.Equals, LesserThan, etc.: numerical / structured assertionsSemantic (cheap, embedding-based):
SemanticSimilarity: answer matches the reference answer in meaning (not exact words)LLM judges (most flexible, slowest):
Groundedness: answer is supported by the provided context (the most important RAG check)AnswerRelevance: answer addresses the questionConformity: answer follows a stated rule (e.g., "must cite at least one source", "must decline if information is not in the context")LLMJudge: bespoke judgment with a Jinja2 prompt for nuanced criteriaComposition:
AllOf / AnyOf / Not: combine checks (e.g., AnyOf(grounded, declines_politely) for out-of-scope questions where either grounding OR refusal is acceptable)Each test question becomes a Scenario. Group all scenarios into a Suite. Pass the user's agent as target at run time, not on each .interact().
Critical RAG-specific patterns:
{"answer": ..., "context": [...]}), use Groundedness(context_key="trace.last.outputs.context", answer_key="trace.last.outputs.answer").context=[...] directly to Groundedness. Do this at scenario construction time.Conformity(rule="When the answer is not in the provided context, the agent must explicitly decline or say it doesn't know."). Do NOT use Groundedness here, since there's no valid context to be grounded in.The output format is adaptive:
.ipynb file, the user mentions cells, or asks you to add to "this notebook"): output the eval as additional cells in that notebook. Use one cell per logical block (imports + generator setup, test data, scenario definitions, suite + run, results display).rag_eval.py) that can be run with python rag_eval.py or await main() from a notebook.In both cases, the code structure is the same; only the packaging changes.
Use this template as your starting point. Adapt to the user's specifics.
import asyncio
from giskard.checks import (
Scenario, Suite,
Groundedness, AnswerRelevance, Conformity, LLMJudge,
SemanticSimilarity, StringMatching, RegexMatching,
FnCheck, Equals, AllOf, AnyOf, Not,
set_default_generator,
)
from giskard.agents.generators import Generator
# 1. Configure the LLM generator used by Groundedness, AnswerRelevance, Conformity, LLMJudge.
# Use a small fast model for evals; judging is much cheaper than generation.
set_default_generator(Generator(model="openai/gpt-4o-mini"))
# 2. Define the SUT (System Under Test). The user replaces this stub.
# IMPORTANT: parameter name MUST be `inputs` (and optional `trace`) for giskard injection.
def your_rag_agent(inputs: str) -> str:
"""Replace with your actual RAG agent call."""
raise NotImplementedError("Replace with your agent")
# 3. Test data, either loaded from the user's Q&A set, or synthesized from the KB.
TEST_CASES = [
{
"question": "What is X?",
"context": ["Reference chunk 1 from the KB.", "Reference chunk 2."], # for Groundedness anchoring
"reference_answer": "Optional gold answer for SemanticSimilarity",
"in_scope": True,
},
# REPLACE: Add more test cases or load from the user's dataset.
]
# 4. Build scenarios.
scenarios = []
for i, tc in enumerate(TEST_CASES):
if tc["in_scope"]:
scenario = (
Scenario(f"in_scope_{i}")
.interact(inputs=tc["question"])
.check(Groundedness(
name="grounded_in_context",
context=tc["context"],
))
.check(AnswerRelevance(name="addresses_question"))
)
else:
scenario = (
Scenario(f"out_of_scope_{i}")
.interact(inputs=tc["question"])
.check(Conformity(
name="declines_when_unsupported",
rule="When the answer is not in the agent's knowledge base, the agent must explicitly decline or say it doesn't know. Confident-but-wrong answers fail this check.",
))
)
scenarios.append(scenario)
# 5. Compose suite.
suite = Suite(name="rag_quality_eval")
for s in scenarios:
suite.append(s)
# 6. Run with the user's agent as target.
async def main():
result = await suite.run(target=your_rag_agent)
result.print_report()
# In notebooks, also display the result object for the rich representation.
return result
# Script entrypoint (omit in notebook output)
if __name__ == "__main__":
asyncio.run(main())
These rules exist because subtle violations cause silent failures. Follow them every time.
from giskard.checks import ... for all check classes; they are all re-exported there.set_default_generator(Generator(model="...")) before LLM-backed checks (Groundedness, AnswerRelevance, Conformity, LLMJudge). Without it, those checks will fail at runtime asking for a generator.Scenario("name").interact(...).check(...). NEVER pass inputs, checks, or description as constructor kwargs to Scenario(...); they are silently ignored, producing empty scenarios that pass instantly without running anything. (This is the single most common silent failure.)Suite. Even a single scenario should go in a Suite, because Suite provides pass_rate, print_report(), and consistent result handling.target= to suite.run(target=your_agent), NOT as outputs= in each .interact(). This avoids repetition and makes swapping SUTs trivial.def your_rag_agent(inputs): ... or def your_rag_agent(inputs, trace): .... Names like query are NOT injected.async def your_rag_agent(inputs): (and await the framework call inside) when the underlying SDK manages its own event loop. SDKs that internally call asyncio.run() from a sync entry point will deadlock with "This event loop is already running" because giskard's runner already holds the loop. Use the SDK's async API instead. Typical names: arun, ainvoke, aquery, or a run method that returns a coroutine you can await.dict, not str.name= to every check. Unnamed checks show as "None" in the report, which is unreadable.Groundedness with static context: pass context=[...] directly; the same context is used for every run of that scenario.Groundedness with dynamic context (agent returns retrieved chunks): pass context_key="trace.last.outputs.context" (or wherever the chunks live in the output). Do NOT also pass context=: they conflict, and context= wins.AnswerRelevance: defaults to question_key="trace.last.inputs" and answer_key="trace.last.outputs". Don't override unless the user's I/O shape is non-standard.Conformity: the rule is plain text, NOT a Jinja2 template. Write rules as a clear standalone sentence.LLMJudge: the prompt IS a Jinja2 template. Use {{ trace.last.inputs }} and {{ trace.last.outputs }} to reference the question and answer.FnCheck: the function receives a Trace object, not the output string. Use lambda trace: ... trace.last.outputs ... to access the response.trace.last.outputs to reference the latest answer; trace.last.inputs for the latest question.# REPLACE: ... comment wherever the user is expected to customize.SuiteResult to JSON after print_report() (e.g., Path("results.json").write_text(result.model_dump_json(indent=2))). This makes results inspectable and CI-friendly.print(result) (or just result as the cell's last expression) after print_report() to get rich pretty output.When you respond, structure your output like this:
context you pass to the check is the ground truth. If the user's KB chunks are noisy, the eval is noisy. Tell the user that good context = good eval.Consult references/examples.md for full worked code:
You can still build a useful eval. Cover answer relevance, refusal quality, robustness to paraphrase, and any behavioral rules the user can articulate (e.g., "must cite sources", "must decline medical advice"). Be honest with the user that without a KB you cannot evaluate groundedness. It's the single most important RAG check, and skipping it is a real gap.
Two options: (a) pre-retrieve context per test question and pass context=[...] to Groundedness statically, or (b) ask the user to wrap their agent so it returns {"answer": ..., "context": [...]} and use context_key=.... Option (a) is simpler if the user has the retriever as a function; option (b) gives more accurate eval because it tests the actual context the agent saw at inference time.
Map them to giskard checks:
GroundednessAnswerRelevance + LLMJudge for correctness against goldFnCheck over retrieved doc IDs vs labelled relevant IDs (requires retriever exposed and relevance labels)Conformity with a refusal rule + dedicated out-of-scope scenariosDirect them to the scenario-generator skill; that's its job. Suggest running both skills: rag-evaluator for quality, scenario-generator for security. They share the same Suite shape so results compose cleanly.
Verify from giskard.checks import ... for all check classes. The only separate import needed is from giskard.agents.generators import Generator.
Re-read references/synthetic-qa-generation.md and use the recommended generation prompts. The most common failure is generating shallow questions; fix by explicitly prompting for question types (factual / multi-hop / out-of-scope / paraphrase) and by passing real KB chunks as grounding context, not just a topic description.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub giskard-ai/giskard-skills