From kraftwork-intel
Run quality evaluations against skills using recorded session interactions. Use when you want to check if a skill is performing well, after modifying a skill, or to identify degrading skills.
How this skill is triggered — by the user, by Claude, or both
Slash command
/kraftwork-intel:intel-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run evaluations against skills to measure quality using real interaction data.
Run evaluations against skills to measure quality using real interaction data.
bun installed~/.claude/kraftwork-intel/ initializedllama3.2:3b modelOptions:
For a single skill:
bun run ~/.claude/kraftwork-intel/src/cli.ts eval <skill-name>
For all skills:
bun run ~/.claude/kraftwork-intel/src/cli.ts eval --all
For flagged skills (low success rate):
bun run ~/.claude/kraftwork-intel/src/cli.ts eval --flagged
To include Ollama LLM scoring (slower but more nuanced):
bun run ~/.claude/kraftwork-intel/src/cli.ts eval <skill-name> --llm
Format the output as a readable summary. Each eval result includes:
If no interactions found for the requested skill, report that clearly.
Heuristic scorers (always run):
responseLength — penalizes too-short or too-long responsesaskedClarifyingQuestions — brainstorming skill: did it ask questions?followedTDD — TDD skill: did it write tests before implementation?noComments — did code blocks avoid adding comments? (project convention)LLM scorer (opt-in with --llm):
npx claudepluginhub filipeestacio/kraftwork --plugin kraftwork-intelEvaluates and benchmarks Agent Skills using static analysis and A/B testing. Measures activation accuracy, quality scorecards, and description optimization.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Creates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.