From ork
Provides LLM and AI testing patterns including mock responses, DeepEval/RAGAS evaluation, structured output validation, and agentic tests (generator, healer, planner). Use for testing AI features and evaluation pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ork:testing-llmThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
| Area | File | Purpose |
|---|---|---|
| Rules | rules/llm-evaluation.md | DeepEval quality metrics, Pydantic schema validation, timeout testing |
| Rules | rules/llm-mocking.md | Mock LLM responses, VCR.py recording, custom request matchers |
| Reference | references/deepeval-ragas-api.md | Full API reference for DeepEval and RAGAS metrics |
| Reference | references/generator-agent.md | Transforms Markdown specs into Playwright tests |
| Reference | references/healer-agent.md | Auto-fixes failing tests (selectors, waits, dynamic content) |
| Reference | references/planner-agent.md | Explores app and produces Markdown test plans |
| Checklist | checklists/llm-test-checklist.md | Complete LLM testing checklist (setup, coverage, CI/CD) |
| Example | examples/llm-test-patterns.md | Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
Mock LLM responses for fast, deterministic unit tests:
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
Key rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.
Validate LLM output quality with multi-dimensional metrics:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])
| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |
Always validate LLM output with Pydantic schemas:
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0
Record and replay LLM API calls for deterministic integration tests:
@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()
The three-agent pattern for end-to-end test automation:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
Planner (references/planner-agent.md): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires seed.spec.ts for app context.
Generator (references/generator-agent.md): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
Healer (references/healer-agent.md): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.
For every LLM integration, cover these paths:
See checklists/llm-test-checklist.md for the complete checklist.
| Anti-Pattern | Correct Approach |
|---|---|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
Asserting only is not None | Schema validation + quality metrics |
ork:testing-unit — Unit testing fundamentals, AAA patternork:testing-integration — Integration testing for AI pipelinesork:golden-dataset — Evaluation dataset managementnpx claudepluginhub yonatangross/orchestkit --plugin orkEnd-to-end LLM eval workflow: instrument AI agents, chatbots, RAG pipelines, generate test suites, run evals, iterate on failures, and report to Confident AI.
Implements evaluation strategies and quality gates for LLM outputs: structural validation, semantic checks, LLM-as-judge with bias mitigations, prompt testing, and guardrails. Use for evals, CI gates, quality measurement, regressions.
Builds structured evaluation suites for LLM and AI system performance using reproducible metrics. Use when testing model quality, prompt changes, or regression detection.