From DeepEval
End-to-end LLM eval workflow: instrument AI agents, chatbots, RAG pipelines, generate test suites, run evals, iterate on failures, and report to Confident AI.
How this skill is triggered — by the user, by Claude, or both
Slash command
/deepeval:deepevalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to add an end-to-end eval loop to AI applications:
LICENSEreferences/artifact-contracts.mdreferences/choose-use-case.mdreferences/confident-ai.mdreferences/datasets.mdreferences/intake.mdreferences/iteration-loop.mdreferences/metrics.mdreferences/pytest-e2e-evals.mdreferences/synthetic-data.mdreferences/traced-evals.mdtemplates/metrics.pytemplates/test_multi_turn_e2e.pytemplates/test_single_turn_no_tracing.pytemplates/test_single_turn_tracing.pyUse this skill to add an end-to-end eval loop to AI applications: instrument the app, curate or reuse a dataset, create a committed pytest eval suite, run evals, and iterate on failures.
Requires Python 3.9+ and pip install deepeval in the target project. Metrics
and synthetic generation need model credentials. Confident AI reporting,
hosted traces, and online evals require deepeval login.
deepeval generate.deepeval-tracing skill when
traced evals are used.deepeval test run.@observe — is
handled by the deepeval-tracing skill; raw OpenTelemetry export by the
deepeval-otel skill.deepeval generate for dataset generation. Use deepeval test run for
pytest eval execution. Do not default to the raw pytest command.metrics.py module for committed eval suites.references/choose-use-case.md.references/intake.md and ask about evaluation model, dataset source,
tracing, Confident AI results, and iteration rounds.references/pytest-e2e-evals.md.references/metrics.md.references/artifact-contracts.md for expected file locations.templates/test_multi_turn_e2e.py for chatbot / multi-turn agent.templates/test_single_turn_tracing.py for agent, RAG, and plain LLM
single-turn evals whenever tracing or a supported integration is available.templates/test_single_turn_no_tracing.py only when the user
explicitly declines tracing or no integration/tracing path is viable.templates/metrics.py or the project's existing
metrics module, not inline in the eval file.references/datasets.md.references/synthetic-data.md.deepeval generate; do not
hand-create or make up goldens.references/datasets.md.deepeval-tracing skill
(framework integrations and manual @observe).references/traced-evals.md for the traced eval shapes and span
metrics.Golden
input and call assert_test(golden=golden, metrics=[...]).for golden in dataset.evals_iterator(metrics=[...]).LLMTestCases.references/pytest-e2e-evals.md.next_*_span(metrics=[...]) or @observe(metrics=[...]).templates/ and replace every
placeholder before running anything.deepeval test run tests/evals/test_<app>.py.--num-processes 5,
--ignore-errors, --skip-on-missing-params, and --identifier.references/iteration-loop.md for the requested number of rounds.Bootstrap single-turn goldens from docs only when no curated dataset exists:
deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset
Run the eval suite:
deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"
Open the latest hosted report when Confident AI is enabled:
deepeval view
| Topic | File |
|---|---|
| Intake questions and branching | references/intake.md |
| Use case selection | references/choose-use-case.md |
| Dataset loading | references/datasets.md |
| Synthetic data generation | references/synthetic-data.md |
| Metrics | references/metrics.md |
| Pytest E2E evals | references/pytest-e2e-evals.md |
| Traced evals and span metrics | references/traced-evals.md |
| Confident AI | references/confident-ai.md |
| Dataset and eval artifact contracts | references/artifact-contracts.md |
| Iteration loop | references/iteration-loop.md |
| App type | Template |
|---|---|
| Single-turn tracing | templates/test_single_turn_tracing.py |
| Single-turn no tracing | templates/test_single_turn_no_tracing.py |
| Multi-turn E2E | templates/test_multi_turn_e2e.py |
| Shared metric lists | templates/metrics.py |
npx claudepluginhub confident-ai/deepeval --plugin deepevalEvaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Instruments AI applications (LLM apps, agents, RAG pipelines, chatbots) with DeepEval's native tracing for span-by-span visibility in Confident AI's Observatory. Supports framework integrations and manual @observe instrumentation.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.