From claude-skill-tester
Test and improve Claude Code skills. Use when user says test skill, eval skill, run evals, test my skill, does this skill work, integration test, skill feedback loop, improve skill, fine tune skill. Unit test with eval-skill (LLM-judged assertions). Integration test with claude-skill-tester (Claude-to-Claude conversation against real infrastructure).
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-skill-tester:test-skillThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You have two testing tools bundled at `$SKILL_DIR/`. Your job is to test a skill, figure out what's wrong, fix it, and confirm the fix works.
You have two testing tools bundled at $SKILL_DIR/. Your job is to test a skill, figure out what's wrong, fix it, and confirm the fix works.
eval-skill.sh — unit tests. Sends test prompts to a Claude with the skill loaded, then a judge checks if assertions pass. Fast, no real infrastructure. Needs evals/evals.json in the skill directory — if it doesn't exist, write one before testing.
bash "$SKILL_DIR/eval-skill.sh" /path/to/skill/
claude-skill-tester.sh — integration tests. Spawns another Claude session with the skill loaded and real credentials. You send it prompts like a user would and see every tool call, agent spawn, and error streamed back. Use this when unit tests pass but you need to see how the skill actually behaves.
bash "$SKILL_DIR/claude-skill-tester.sh" start --dir /path/to/repo --env "KEY=VAL"
bash "$SKILL_DIR/claude-skill-tester.sh" say "natural user prompt"
bash "$SKILL_DIR/claude-skill-tester.sh" say "follow-up if skill asks"
bash "$SKILL_DIR/claude-skill-tester.sh" history
bash "$SKILL_DIR/claude-skill-tester.sh" cost
bash "$SKILL_DIR/claude-skill-tester.sh" kill
Unit tests and integration tests catch different things. Unit tests catch logic — does the skill trigger, does it refuse destructive requests, does it redirect out-of-scope? Integration tests catch architecture — does the DA actually review findings, do agents communicate correctly, does the validator run?
If you're not sure which to run, start with unit tests. They're faster and cheaper. If they pass and the user still reports problems, move to integration.
When a test fails, read the failure carefully. The problem is usually in the skill, not the assertion. Read the skill definition, understand why it produced the wrong behavior, fix it, and rerun. Don't weaken assertions to make them pass — fix the skill.
If the skill doesn't have evals/evals.json, write one. Think about what matters for this specific skill — not a generic checklist, but what would actually go wrong if the skill misbehaves.
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "Natural prompt a real user would say",
"assertions": [
"What the skill should do",
"What the skill should NOT do"
]
}
]
}
Good assertions test behavior and judgment, not exact wording. "Asks for clarification before acting" not "Says the words 'could you clarify'". Test what matters — if the skill audits security, test that it catches real risks AND that it doesn't cry wolf on intentional exposure.
After an integration test, read the history and the stream. The stream shows raw events — tool calls, errors, agent spawns. Look for patterns:
The history shows the conversation. Read it like you're the user — does the skill's behavior make sense? Would you trust its output? Would you know what to do with the report?
Fix what you find, retest until clean.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub facets-cloud/claude-skill-tester --plugin claude-skill-tester