From LLM Testing Expert
Use this skill when user needs to design LLM evaluation strategies, build test datasets, or ensure model quality. Trigger keywords: LLM testing, model evaluation, prompt engineering, regression testing, red teaming, hallucination detection, RAG testing, Agent testing, A/B testing, LLM-as-a-Judge. Applicable to quality assurance in model development and application deployment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-testing-expert:llm-testing-expertThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Provide LLM test strategy design, test dataset construction, automated evaluation, and security red teaming recommendations to ensure models meet production requirements for functionality, performance, robustness, and safety.
Provide LLM test strategy design, test dataset construction, automated evaluation, and security red teaming recommendations to ensure models meet production requirements for functionality, performance, robustness, and safety.
{
testTarget: {
type: string // Test object type (base-model/fine-tuned-model/rag-system/agent/prompt-template)
modelInfo?: {
name: string // Model name (e.g., gpt-4, claude-3.5-sonnet)
version?: string // Version
deployment?: string // Deployment method (API/self-hosted)
}
applicationContext?: string // Application scenario (e.g., customer service, code generation, document summarization)
}
testObjectives: {
functional?: boolean // Functional correctness testing
performance?: boolean // Performance testing (latency, throughput, cost)
robustness?: boolean // Robustness testing (adversarial samples, edge cases)
safety?: boolean // Safety testing (jailbreak, injection, PII leakage)
userExperience?: boolean // User experience testing
}
constraints: {
budget?: string // Testing budget (API call cost/manual annotation cost)
timeline?: string // Testing timeline
existingTestAssets?: string[] // Existing test assets (test sets, annotated data)
complianceRequirements?: string // Compliance requirements (GDPR, industry regulations)
}
riskAreas?: string[] // Known risk areas (e.g., hallucination, bias, privacy leakage)
existingMetrics?: string // Existing evaluation metrics and baselines
}
{
testStrategy: {
scope: string // Test scope definition
approach: string // Test approach (black-box/white-box/gray-box)
testLevels: {
unit?: string // Unit testing (single prompt/single tool call)
integration?: string // Integration testing (multi-turn dialogue/tool chain)
system?: string // System testing (end-to-end scenarios)
acceptance?: string // Acceptance testing (user scenario coverage)
}
}
testPlan: {
functional: {
testCases: {
id: string
scenario: string // Test scenario
input: string // Input sample
expectedOutput: string // Expected output (can use fuzzy rules)
passCriteria: string // Pass criteria
}[]
coverage: string[] // Coverage dimensions (instruction following/format output/reasoning ability, etc.)
}
performance: {
metrics: {
name: string // Metric name (latency/throughput/token-cost)
target: string // Target value (e.g., p95 < 2s)
measurement: string // Measurement method
}[]
loadProfile: string // Load model (concurrent users, request pattern)
}
robustness: {
adversarialCases: string[] // Adversarial sample design
edgeCases: string[] // Edge cases
stressScenarios: string[] // Stress scenarios (extra-long input, extreme parameters)
}
safety: {
redTeamScenarios: {
type: string // Attack type (jailbreak/injection/data-extraction)
technique: string // Attack technique
expectedDefense: string // Expected defense measures
}[]
harmfulContentCategories: string[] // Harmful content categories (violence/discrimination/privacy)
}
}
testDataset: {
sources: string[] // Data sources (public benchmarks/domain data/synthetic data)
composition: {
positive: number // Positive case ratio
negative: number // Negative/adversarial case ratio
edge: number // Edge case ratio
}
sampleSize: string // Sample size (calculated by statistical significance)
labelingStrategy: string // Labeling strategy (manual/automated/hybrid)
}
evaluationMethod: {
automated: {
metrics: string[] // Automated metrics (BLEU/ROUGE/exact-match/regex)
tools: string[] // Evaluation tools
}
humanEval: {
criteria: string[] // Human evaluation criteria
raterGuidelines: string // Annotator guidelines
interRaterAgreement: string // Consistency requirement (e.g., Kappa > 0.7)
}
llmAsJudge?: {
judgeModel: string // Judge model
rubric: string // Scoring rules
calibration: string // Calibration method (align with human annotation)
}
}
regressionPlan: {
triggerConditions: string[] // Conditions triggering regression (model update/prompt change)
baselineVersion: string // Baseline version
comparisonMetrics: string[] // Comparison metrics
reportFormat: string // Report format
}
cicdIntegration?: string // CI/CD integration solution
specializedTests?: {
rag?: {
retrievalQuality: string // Retrieval quality testing (recall/ranking)
citationAccuracy: string // Citation accuracy testing
faithfulness: string // Faithfulness testing (based only on retrieved content)
}
agent?: {
toolCallCorrectness: string // Tool call parameter correctness
planningRationality: string // Planning rationality
errorRecovery: string // Error recovery capability
}
}
}
Copy the following checklist before starting, and explicitly mark status after completing each step.
Feedback Loop: If test objectives are unclear or conflicting (e.g., comprehensive coverage with extremely low cost), MUST align priorities with user.
Test Pyramid Principle:
Feedback Loop: If user budget or timeline is extremely limited, prioritize designing high-priority smoke tests rather than pursuing comprehensive coverage.
Case Design Principles:
Feedback Loop: If existing test assets are insufficient, prioritize sampling real cases from production logs rather than fully synthetic data.
Evaluation Method Selection:
Feedback Loop: If automated metrics are inconsistent with manual evaluation, MUST recalibrate or adjust metric weights.
Feedback Loop: If RAG/Agent testing reveals systematic issues (e.g., poor retrieval quality, high tool call failure rate), should return to system design level for optimization, not just adjust testing.
Regression Testing Principles:
Feedback Loop: If regression testing finds performance degradation, MUST analyze root cause (model issue/prompt issue/test set issue), not just rollback directly.
Red Team Testing Methods:
Feedback Loop: If red team testing finds serious vulnerabilities, MUST prioritize fixing and retesting, not cover up or ignore.
deep-learning-expert — training and fine-tuning the models under test.python-expert — building evaluation harnesses, fixtures, and pytest-based regression suites.software-architect — designing observable, A/B-testable LLM application infrastructure.Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub miaoge-ge/coding-agent-skills --plugin llm-testing-expert