From dstoic
Scaffolds pytest smoke tests and runs behavioral tests for Claude Code skills in Docker harness. Generates golden files, runs pytest, reports LLM verdicts and costs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dstoic:test-skillsonnetThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Scaffold + run behavioral tests for a skill in the Docker test harness.
Scaffold + run behavioral tests for a skill in the Docker test harness.
CRITICAL: After EVERY AskUserQuestion call, check if answers are empty/blank. Known Claude Code bug: outside Plan Mode, AskUserQuestion silently returns empty answers without showing UI.
If answers are empty: DO NOT proceed with assumptions. Instead:
From $ARGUMENTS: skill_name (positional, kebab-case), --run-only, --scaffold-only.
Derive: snake_name = - → _, test_file = test/tests/test_{snake_name}.py, golden_file = test/fixtures/golden/{skill_name}-smoke.md.
1. Validate — dstoic/skills/{skill_name}/SKILL.md must exist. Error if not.
2. Scaffold test (skip if --run-only) — If test_file exists → skip to 4. Otherwise AskUserQuestion:
Generate test_file from template in reference.md. Derive prompt from scenario.
3. Scaffold golden (skip if --run-only) — If golden_file exists → skip. Read source SKILL.md → generate simplified version: frontmatter (name+description) + minimal body. Under 250 tokens.
4. Run (skip if --scaffold-only)
docker compose -f test/docker-compose.test.yml run --rm skill-tester pytest tests/test_{snake_name}.py -v -s
5. Report — Parse test/output/{snake_name}_smoke.yaml: status, judge verdict/reason, cost USD. Also show latest trace file test/output/{snake_name}_smoke_trace_*.yaml.
npx claudepluginhub digital-stoic-org/agent-skills --plugin dstoicRuns four-layer tests on Claude Code plugin skills: structure validation, trigger accuracy, multi-turn sessions, and value comparisons using Python scripts like validate.py and run_trigger_eval.py.
Generates EvalView test cases from SKILL.md files using LLM, captures real agent interactions as tests, or creates individual test YAMLs manually.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.