Skill

deepeval

End-to-end LLM eval workflow: instrument AI agents, chatbots, RAG pipelines, generate test suites, run evals, iterate on failures, and report to Confident AI.

Python

Pytest

ai-ml

testing

Popularity

Stars

16,267

Forks

1,544

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/deepeval:deepeval

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this skill to add an end-to-end eval loop to AI applications:

Supporting Files

LICENSEreferences/artifact-contracts.mdreferences/choose-use-case.mdreferences/confident-ai.mdreferences/datasets.mdreferences/intake.mdreferences/iteration-loop.mdreferences/metrics.mdreferences/pytest-e2e-evals.mdreferences/synthetic-data.mdreferences/traced-evals.mdtemplates/metrics.pytemplates/test_multi_turn_e2e.pytemplates/test_single_turn_no_tracing.pytemplates/test_single_turn_tracing.py

SKILL.md

179 lines · ~2.2k tokens

Stats

LanguagePython

Stars16,267

Forks1,544

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

DeepEval

Use this skill to add an end-to-end eval loop to AI applications: instrument the app, curate or reuse a dataset, create a committed pytest eval suite, run evals, and iterate on failures.

Prerequisites

Requires Python 3.9+ and pip install deepeval in the target project. Metrics and synthetic generation need model credentials. Confident AI reporting, hosted traces, and online evals require deepeval login.

Workflow Summary

Inspect the target app and existing DeepEval usage.
Ask the required intake questions.
Reuse existing metrics and datasets when available.
Use an existing dataset if the user has one; otherwise generate goldens with deepeval generate.
Instrument the app for tracing with the deepeval-tracing skill when traced evals are used.
Run deepeval test run.
Iterate for the requested number of rounds, defaulting to 5.

Core Principles

Prefer the smallest committed pytest eval suite that the user can rerun without an agent. Do not hide goldens or tests in throwaway scripts.
Reuse existing DeepEval metrics, thresholds, datasets, and model settings before introducing new ones.
Prefer traced single-turn evals when the app can be instrumented. Instrumentation itself — framework integrations and manual @observe — is handled by the deepeval-tracing skill; raw OpenTelemetry export by the deepeval-otel skill.
Use deepeval generate for dataset generation. Use deepeval test run for pytest eval execution. Do not default to the raw pytest command.
Keep metrics in a separate metrics.py module for committed eval suites.
Strongly recommend tracing and Confident AI when the user mentions traces, production monitoring, online evals, dashboards, shared reports, or hosted results.
Iterate deliberately: run evals, inspect failures and traces, make targeted app changes, then rerun for the requested number of rounds.

Required Workflow

Inspect the codebase for app type and existing DeepEval usage.
- For classification guidance, read references/choose-use-case.md.
- Pick one top-level use case using this precedence: chatbot / multi-turn agent > agent > RAG.
- If an app is both RAG and agentic, treat it as agent. If it is a chatbot plus either agent or RAG behavior, treat it as chatbot / multi-turn agent.
- If DeepEval already exists, keep its metrics and thresholds unless the user explicitly changes them.
Ask the intake questions before editing application code.
- Read references/intake.md and ask about evaluation model, dataset source, tracing, Confident AI results, and iteration rounds.
Choose test shape, metrics, and artifacts.
- Read references/pytest-e2e-evals.md.
- Read references/metrics.md.
- Read references/artifact-contracts.md for expected file locations.
- Use templates/test_multi_turn_e2e.py for chatbot / multi-turn agent.
- Use templates/test_single_turn_tracing.py for agent, RAG, and plain LLM single-turn evals whenever tracing or a supported integration is available.
- Use templates/test_single_turn_no_tracing.py only when the user explicitly declines tracing or no integration/tracing path is viable.
- Put metric instances in templates/metrics.py or the project's existing metrics module, not inline in the eval file.
Prepare the dataset.
- For existing datasets, read references/datasets.md.
- For synthetic data, read references/synthetic-data.md.
- First ask whether the user already has a dataset.
- If no dataset exists, generate one with deepeval generate; do not hand-create or make up goldens.
- Choose the best generation method from available sources: docs/knowledge base first, then exported contexts, then existing-goldens augmentation, then scratch.
- Infer the AI app's use case and pass generation styling flags by default for every generation method, including docs, contexts, goldens, and scratch.
- Target about 30-50 generated goldens for a useful first eval dataset.
- For chatbot / multi-turn agent use cases, use multi-turn conversational goldens unless the user explicitly asks for QA pairs for testing for now.
- For local or Confident AI datasets, follow references/datasets.md.
Instrument the app and choose the traced eval shape.
- Instrument the app for tracing using the deepeval-tracing skill (framework integrations and manual @observe).
- Read references/traced-evals.md for the traced eval shapes and span metrics.
- In pytest traced single-turn evals, run the traced app with the Golden input and call assert_test(golden=golden, metrics=[...]).
- In script-based traced single-turn evals, use for golden in dataset.evals_iterator(metrics=[...]).
- Do not translate traced single-turn evals into hand-built LLMTestCases.
- Add component/span-level metrics only where diagnostics are useful.
Create the pytest eval suite.
- Read references/pytest-e2e-evals.md.
- Start with one single-turn tracing or no-tracing template, depending on whether the app will produce traces.
- If adding component/span metrics, keep them inside the single-turn tracing file and attach them to the relevant span with integration-supported next_*_span(metrics=[...]) or @observe(metrics=[...]).
- Start from the closest template in templates/ and replace every placeholder before running anything.
Run and iterate.
- Use deepeval test run tests/evals/test_<app>.py.
- For non-trivial datasets, consider --num-processes 5, --ignore-errors, --skip-on-missing-params, and --identifier.
- Follow references/iteration-loop.md for the requested number of rounds.

Common Commands

Bootstrap single-turn goldens from docs only when no curated dataset exists:

deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset

Run the eval suite:

deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"

Open the latest hosted report when Confident AI is enabled:

deepeval view

References

Topic	File
Intake questions and branching	`references/intake.md`
Use case selection	`references/choose-use-case.md`
Dataset loading	`references/datasets.md`
Synthetic data generation	`references/synthetic-data.md`
Metrics	`references/metrics.md`
Pytest E2E evals	`references/pytest-e2e-evals.md`
Traced evals and span metrics	`references/traced-evals.md`
Confident AI	`references/confident-ai.md`
Dataset and eval artifact contracts	`references/artifact-contracts.md`
Iteration loop	`references/iteration-loop.md`

Templates

App type	Template
Single-turn tracing	`templates/test_single_turn_tracing.py`
Single-turn no tracing	`templates/test_single_turn_no_tracing.py`
Multi-turn E2E	`templates/test_multi_turn_e2e.py`
Shared metric lists	`templates/metrics.py`

deepeval

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

deepeval

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

DeepEval

Prerequisites

Workflow Summary

Core Principles

Required Workflow

Common Commands

References

Templates

Similar Skills

DeepEval

Prerequisites

Workflow Summary

Core Principles

Required Workflow

Common Commands

References

Templates

Similar Skills