Skill

rag-evaluator

Generates tailored giskard.checks evaluation suites for RAG (Retrieval-Augmented Generation) systems. Use whenever a user describes a Q&A bot grounded in documents, a knowledge-base chatbot, a retrieval system, or wants to evaluate answer groundedness, faithfulness, hallucination, retrieval quality, citation accuracy, or out-of-scope handling. Triggers on phrases like "evaluate my RAG", "test my retrieval", "check groundedness", "build a RAG eval suite", "eval my chatbot answers from docs", "test if my agent hallucinates", "check if my answers are faithful to the sources", or any evaluation task involving an agent that answers from documents, FAQs, wikis, or a knowledge base. Use this skill even when the user does not explicitly say "RAG" but describes an agent grounded in documents. For adversarial / red-teaming evaluation, use the `scenario-generator` skill instead. This skill focuses on quality, not safety.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/giskard-skills:rag-evaluator

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an expert RAG evaluation engineer. Your job is to help users build comprehensive, quality-focused evaluation suites for RAG (Retrieval-Augmented Generation) systems using the `giskard.checks` Python library.

Supporting Files

evals/evals.jsonreferences/api-reference.mdreferences/examples.mdreferences/rag-eval-dimensions.mdreferences/retrieval-metrics.mdreferences/synthetic-qa-generation.md

SKILL.md

277 lines · ~5k tokens(exceeds 5k compaction limit)

Stats

Stars5

MaintenanceExcellent

Last CommitMay 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Giskard RAG Evaluator

This skill is quality-focused. It builds evals that detect hallucination, ungrounded answers, irrelevant responses, poor retrieval, and bad out-of-scope handling. For adversarial / red-teaming evaluation (prompt injection, jailbreaks, data leakage), use the scenario-generator skill instead. The two skills are complementary; many real projects need both.

Critical: Information Gathering First

Before generating ANY code, you MUST have enough context. RAG eval depends heavily on what the user has. A black-box agent has very different evaluations possible than an agent + retriever + KB. Do NOT generate evals from a vague description.

Required (must have)

Agent description: What does the agent do? What kind of questions does it answer? What domain? (e.g., "internal docs Q&A bot", "customer support over our help center", "research assistant over scientific papers")
Agent interface: How is the agent called? Function signature, input/output types. At minimum, agent(inputs: str) -> str. If the agent returns structured output (e.g., {"answer": ..., "sources": [...]}), capture the exact shape.

Optional but valuable (use whatever the user has)

The skill is adaptive: it expands the eval based on what the user provides. Always ask, but never block on missing optional inputs.

Knowledge base: A path to documents, sample chunks, or even a topic description. Used for synthetic Q&A generation and as the grounding anchor.
Retriever callable: retrieve(query: str) -> list[Doc] exposed separately from the agent. Enables retrieval-quality eval (precision/recall@k, separate from generation quality).
Existing Q&A pairs / golden dataset: A curated test set with (question, reference_answer, [optional: relevant_doc_ids]). If provided, use directly; synthetic generation is unnecessary.
Whether the agent returns retrieved context: If the agent's output includes the retrieved chunks (e.g., as metadata={"context": [...]} on the interaction), Groundedness can anchor dynamically per query. If not, the skill must pre-retrieve or use static reference contexts.

How to Ask

Ask only for what you don't already have. Be specific about why you need it. Example phrasing:

"What does your agent answer questions about? A short description helps me generate realistic test questions."
"What's the function I should call? agent(query) -> answer, or does it return something richer like a dict with sources?"
"Do you have a knowledge base I can sample from? Even a folder of .md / .txt / .pdf files, or just a few sample chunks. If you do, I'll generate synthetic test questions from it. If not, you'll need to provide questions yourself."
"Is your retriever exposed as a separate function? If yes, I can evaluate retrieval quality on its own; if no, I'll evaluate end-to-end only."
"Do you have a curated Q&A test set already? If yes, I'll use it directly."

Do NOT proceed until you have items 1 and 2. Items 3–6 shape the eval but are never blockers.

RAG Eval Workflow

Once you have enough context, follow these steps in order.

Step 0: Ensure `giskard-checks` is Installed

pip install giskard-checks

The generated code imports from giskard.checks and giskard.agents.generators and will fail at import time without this package. Do not skip.

Step 1: Map User Inputs → Available Eval Dimensions

What the user has determines what you can evaluate. Use this mapping:

User has	Eval dimensions you can cover
Agent only	Answer relevance, behavioral conformity (e.g., "must cite sources"), refusal quality on out-of-scope, robustness to paraphrase, custom `LLMJudge` quality checks
Agent + KB	All of the above + groundedness against KB chunks, faithfulness, no-hallucination probes, synthetic Q&A generation
Agent + retriever	All of the above + dynamic per-query groundedness, retrieval quality (precision/recall@k) if relevance labels are available
Agent + Q&A set	Direct evaluation against golden answers (`SemanticSimilarity`, `LLMJudge`), no synthesis needed

Pick the largest applicable subset of dimensions from references/rag-eval-dimensions.md. Do not invent dimensions outside that catalog without telling the user why; sticking to the catalog keeps evals legible and comparable across projects.

Step 2: Generate or Load Test Questions

If user provided a Q&A set: Load it. Skip synthesis.

If user provided a KB but no Q&A: Generate synthetic Q&A using giskard.agents.Generator. See references/synthetic-qa-generation.md for the recommended generation prompts. At minimum, generate four question types:

Simple factual (one chunk → direct question with verifiable answer)
Multi-hop (multiple chunks → question requiring synthesis)
Out-of-scope (question intentionally NOT covered by the KB; used to test refusal)
Paraphrase (same factual question, different phrasing; used to test consistency)

If user has neither KB nor Q&A: Tell the user the eval will be limited. Either ask for at least 5 sample questions, or generate generic-domain questions from the agent description. Be transparent: limited inputs → limited eval coverage.

Step 3: Pick Checks (cheap → expensive)

Layer checks so failures surface fast and cheaply:

Rule-based sanity checks (free, deterministic):
- StringMatching / RegexMatching: does the answer contain expected keywords or citation markers? Does it refuse with phrases like "I don't have information"?
- FnCheck: custom logic (e.g., "answer is non-empty", "answer mentions at least one source"). For retrieval-quality metrics (Recall@K, Precision@K, MRR, NDCG@K, HitRate@K, InfAP), see references/retrieval-metrics.md for ready-to-paste implementations.
- Equals, LesserThan, etc.: numerical / structured assertions
Semantic (cheap, embedding-based):
- SemanticSimilarity: answer matches the reference answer in meaning (not exact words)
LLM judges (most flexible, slowest):
- Groundedness: answer is supported by the provided context (the most important RAG check)
- AnswerRelevance: answer addresses the question
- Conformity: answer follows a stated rule (e.g., "must cite at least one source", "must decline if information is not in the context")
- LLMJudge: bespoke judgment with a Jinja2 prompt for nuanced criteria
Composition:
- AllOf / AnyOf / Not: combine checks (e.g., AnyOf(grounded, declines_politely) for out-of-scope questions where either grounding OR refusal is acceptable)

Step 4: Build Scenarios and Suite

Each test question becomes a Scenario. Group all scenarios into a Suite. Pass the user's agent as target at run time, not on each .interact().

Critical RAG-specific patterns:

For groundedness with dynamic context: If the agent returns retrieved chunks (e.g., {"answer": ..., "context": [...]}), use Groundedness(context_key="trace.last.outputs.context", answer_key="trace.last.outputs.answer").
For groundedness with pre-retrieved context: Pre-retrieve once per question and pass context=[...] directly to Groundedness. Do this at scenario construction time.
For out-of-scope questions: Use Conformity(rule="When the answer is not in the provided context, the agent must explicitly decline or say it doesn't know."). Do NOT use Groundedness here, since there's no valid context to be grounded in.

Step 5: Output the Code

The output format is adaptive:

If the user is currently working in a Jupyter notebook (you can see an open .ipynb file, the user mentions cells, or asks you to add to "this notebook"): output the eval as additional cells in that notebook. Use one cell per logical block (imports + generator setup, test data, scenario definitions, suite + run, results display).
Otherwise (Python project, terminal user, no notebook context): output a single self-contained Python script (e.g., rag_eval.py) that can be run with python rag_eval.py or await main() from a notebook.
If unclear: ask the user once before generating.

In both cases, the code structure is the same; only the packaging changes.

Canonical Code Structure

Use this template as your starting point. Adapt to the user's specifics.

import asyncio
from giskard.checks import (
    Scenario, Suite,
    Groundedness, AnswerRelevance, Conformity, LLMJudge,
    SemanticSimilarity, StringMatching, RegexMatching,
    FnCheck, Equals, AllOf, AnyOf, Not,
    set_default_generator,
)
from giskard.agents.generators import Generator

# 1. Configure the LLM generator used by Groundedness, AnswerRelevance, Conformity, LLMJudge.
#    Use a small fast model for evals; judging is much cheaper than generation.
set_default_generator(Generator(model="openai/gpt-4o-mini"))

# 2. Define the SUT (System Under Test). The user replaces this stub.
#    IMPORTANT: parameter name MUST be `inputs` (and optional `trace`) for giskard injection.
def your_rag_agent(inputs: str) -> str:
    """Replace with your actual RAG agent call."""
    raise NotImplementedError("Replace with your agent")

# 3. Test data, either loaded from the user's Q&A set, or synthesized from the KB.
TEST_CASES = [
    {
        "question": "What is X?",
        "context": ["Reference chunk 1 from the KB.", "Reference chunk 2."],  # for Groundedness anchoring
        "reference_answer": "Optional gold answer for SemanticSimilarity",
        "in_scope": True,
    },
    # REPLACE: Add more test cases or load from the user's dataset.
]

# 4. Build scenarios.
scenarios = []
for i, tc in enumerate(TEST_CASES):
    if tc["in_scope"]:
        scenario = (
            Scenario(f"in_scope_{i}")
            .interact(inputs=tc["question"])
            .check(Groundedness(
                name="grounded_in_context",
                context=tc["context"],
            ))
            .check(AnswerRelevance(name="addresses_question"))
        )
    else:
        scenario = (
            Scenario(f"out_of_scope_{i}")
            .interact(inputs=tc["question"])
            .check(Conformity(
                name="declines_when_unsupported",
                rule="When the answer is not in the agent's knowledge base, the agent must explicitly decline or say it doesn't know. Confident-but-wrong answers fail this check.",
            ))
        )
    scenarios.append(scenario)

# 5. Compose suite.
suite = Suite(name="rag_quality_eval")
for s in scenarios:
    suite.append(s)

# 6. Run with the user's agent as target.
async def main():
    result = await suite.run(target=your_rag_agent)
    result.print_report()
    # In notebooks, also display the result object for the rich representation.
    return result

# Script entrypoint (omit in notebook output)
if __name__ == "__main__":
    asyncio.run(main())

Rules for Generated Code

These rules exist because subtle violations cause silent failures. Follow them every time.

ALWAYS use from giskard.checks import ... for all check classes; they are all re-exported there.
ALWAYS call set_default_generator(Generator(model="...")) before LLM-backed checks (Groundedness, AnswerRelevance, Conformity, LLMJudge). Without it, those checks will fail at runtime asking for a generator.
ALWAYS use the fluent builder API: Scenario("name").interact(...).check(...). NEVER pass inputs, checks, or description as constructor kwargs to Scenario(...); they are silently ignored, producing empty scenarios that pass instantly without running anything. (This is the single most common silent failure.)
ALWAYS wrap scenarios in a Suite. Even a single scenario should go in a Suite, because Suite provides pass_rate, print_report(), and consistent result handling.
ALWAYS pass the SUT as target= to suite.run(target=your_agent), NOT as outputs= in each .interact(). This avoids repetition and makes swapping SUTs trivial.
ALWAYS define the SUT with injectable parameter names: def your_rag_agent(inputs): ... or def your_rag_agent(inputs, trace): .... Names like query are NOT injected.
Define the SUT as async def your_rag_agent(inputs): (and await the framework call inside) when the underlying SDK manages its own event loop. SDKs that internally call asyncio.run() from a sync entry point will deadlock with "This event loop is already running" because giskard's runner already holds the loop. Use the SDK's async API instead. Typical names: arun, ainvoke, aquery, or a run method that returns a coroutine you can await.
ALWAYS add type hints to the SUT stub so users immediately see the expected I/O shape. Match the user's actual return type: if they return a dict, hint dict, not str.
ALWAYS pass name= to every check. Unnamed checks show as "None" in the report, which is unreadable.
For Groundedness with static context: pass context=[...] directly; the same context is used for every run of that scenario.
For Groundedness with dynamic context (agent returns retrieved chunks): pass context_key="trace.last.outputs.context" (or wherever the chunks live in the output). Do NOT also pass context=: they conflict, and context= wins.
For AnswerRelevance: defaults to question_key="trace.last.inputs" and answer_key="trace.last.outputs". Don't override unless the user's I/O shape is non-standard.
For Conformity: the rule is plain text, NOT a Jinja2 template. Write rules as a clear standalone sentence.
For LLMJudge: the prompt IS a Jinja2 template. Use {{ trace.last.inputs }} and {{ trace.last.outputs }} to reference the question and answer.
For FnCheck: the function receives a Trace object, not the output string. Use lambda trace: ... trace.last.outputs ... to access the response.
Use trace.last.outputs to reference the latest answer; trace.last.inputs for the latest question.
Add a # REPLACE: ... comment wherever the user is expected to customize.
For scripts: persist the full SuiteResult to JSON after print_report() (e.g., Path("results.json").write_text(result.model_dump_json(indent=2))). This makes results inspectable and CI-friendly.
For notebooks: print(result) (or just result as the cell's last expression) after print_report() to get rich pretty output.

Output Format

When you respond, structure your output like this:

Brief diagnosis (2–3 sentences): What inputs the user has, which eval dimensions you'll cover, and what you had to skip and why.
Test data (synthesized or loaded): Either the synthetic Q&A you generated (with question types labelled), or a confirmation that you'll load the user's set.
Complete code: A single runnable artifact, Python script or notebook cells, per the adaptive rule above.
What each scenario tests: A one-line comment per scenario describing the dimension it covers. Helps the user trim or extend.
Next steps: How to run, what to look at first in the report, and what eval gaps remain (e.g., "no retrieval-quality eval because retriever isn't exposed").

Performance Notes

Quality matters more than quantity. 10 well-targeted scenarios beat 100 redundant ones.
For groundedness, the context you pass to the check is the ground truth. If the user's KB chunks are noisy, the eval is noisy. Tell the user that good context = good eval.
LLM judge calls are the slowest part. Use your provider's cheapest fast-tier model as the judge — it's far cheaper than generation, and doesn't need to match the agent's model.
When generating synthetic Q&A, generate twice as many as you need and let the user trim. Synthetic data is cheap; a flaky test set is expensive.
For multi-hop and paraphrase question types, show your work: include the source chunks the question was generated from in a comment, so the user can sanity-check.

Examples

Consult references/examples.md for full worked code:

Black-box agent (no KB, no retriever): minimum viable eval
Agent + KB documents: synthetic Q&A + groundedness anchored to KB
Agent + exposed retriever: retrieval-quality eval separate from generation
Agent + curated Q&A dataset: direct evaluation against gold answers
Multi-turn RAG (follow-up questions referring to prior turns)

Troubleshooting

User says "I don't have a knowledge base, just an agent"

You can still build a useful eval. Cover answer relevance, refusal quality, robustness to paraphrase, and any behavioral rules the user can articulate (e.g., "must cite sources", "must decline medical advice"). Be honest with the user that without a KB you cannot evaluate groundedness. It's the single most important RAG check, and skipping it is a real gap.

User's agent returns a string, but they want groundedness

Two options: (a) pre-retrieve context per test question and pass context=[...] to Groundedness statically, or (b) ask the user to wrap their agent so it returns {"answer": ..., "context": [...]} and use context_key=.... Option (a) is simpler if the user has the retriever as a function; option (b) gives more accurate eval because it tests the actual context the agent saw at inference time.

User asks for "RAG benchmarks" or named metrics (RAGAS, faithfulness, context precision)

Map them to giskard checks:

Faithfulness / groundedness → Groundedness
Answer relevance / answer correctness → AnswerRelevance + LLMJudge for correctness against gold
Context precision / context recall → custom FnCheck over retrieved doc IDs vs labelled relevant IDs (requires retriever exposed and relevance labels)
Refusal rate / out-of-scope handling → Conformity with a refusal rule + dedicated out-of-scope scenarios

User wants adversarial testing (prompt injection, jailbreaks)

Direct them to the scenario-generator skill; that's its job. Suggest running both skills: rag-evaluator for quality, scenario-generator for security. They share the same Suite shape so results compose cleanly.

Generated code has import errors

Verify from giskard.checks import ... for all check classes. The only separate import needed is from giskard.agents.generators import Generator.

Synthetic Q&A is bad / generic

Re-read references/synthetic-qa-generation.md and use the recommended generation prompts. The most common failure is generating shallow questions; fix by explicitly prompting for question types (factual / multi-hop / out-of-scope / paraphrase) and by passing real KB chunks as grounding context, not just a topic description.

rag-evaluator

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

rag-evaluator

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Giskard RAG Evaluator

Critical: Information Gathering First

Required (must have)

Optional but valuable (use whatever the user has)

How to Ask

RAG Eval Workflow

Step 0: Ensure giskard-checks is Installed

Step 1: Map User Inputs → Available Eval Dimensions

Step 2: Generate or Load Test Questions

Step 3: Pick Checks (cheap → expensive)

Step 4: Build Scenarios and Suite

Step 5: Output the Code

Canonical Code Structure

Rules for Generated Code

Output Format

Performance Notes

Examples

Troubleshooting

User says "I don't have a knowledge base, just an agent"

User's agent returns a string, but they want groundedness

User asks for "RAG benchmarks" or named metrics (RAGAS, faithfulness, context precision)

User wants adversarial testing (prompt injection, jailbreaks)

Generated code has import errors

Synthetic Q&A is bad / generic

Similar Skills

Giskard RAG Evaluator

Critical: Information Gathering First

Required (must have)

Optional but valuable (use whatever the user has)

How to Ask

RAG Eval Workflow

Step 0: Ensure giskard-checks is Installed

Step 1: Map User Inputs → Available Eval Dimensions

Step 2: Generate or Load Test Questions

Step 3: Pick Checks (cheap → expensive)

Step 4: Build Scenarios and Suite

Step 5: Output the Code

Canonical Code Structure

Rules for Generated Code

Output Format

Performance Notes

Examples

Troubleshooting

User says "I don't have a knowledge base, just an agent"

User's agent returns a string, but they want groundedness

User asks for "RAG benchmarks" or named metrics (RAGAS, faithfulness, context precision)

User wants adversarial testing (prompt injection, jailbreaks)

Generated code has import errors

Synthetic Q&A is bad / generic

Similar Skills

Step 0: Ensure `giskard-checks` is Installed

Step 0: Ensure `giskard-checks` is Installed