From abstract
Tests skills via RED/GREEN/REFACTOR TDD with fresh subagents to validate behavior and prevent priming bias. Use for skill validation and isolation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/abstract:subagent-testingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.
Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.
Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.
Running tests in the same conversation creates bias:
Three-phase TDD-style approach:
Test without skill to establish baseline behavior.
Test with skill loaded to measure improvements.
Test skill's anti-rationalization guardrails.
# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses
# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline
# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work
For complete testing patterns, examples, and templates:
npx claudepluginhub athola/claude-night-market --plugin abstractTests Claude Code skills before deployment using TDD RED-GREEN-REFACTOR: baseline failures without skill, write fixes, iterate pressure scenarios to resist rationalizations.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.