Skip to main content

/

/

Stats

Actions

Tags

Stats

Actions

Tags

ClaudePluginHub

Community directory for discovering and installing Claude Code plugins.

Find plugins for your project

AI-powered recommendations based on your stack.

Product

Browse Plugins
Marketplaces
Pricing
About
Contact

Resources

Learning Center
Blog
Weekly Digest
Claude Code Docs
Plugin Guide
Plugin Reference
Plugin Marketplaces

Community

Browse on GitHub
Get Support

Legal

Terms of Service
Privacy Policy

Browse · Plugins · Top Plugins · Marketplaces · Components · Technologies · Skills · Agents · Commands · Hooks · MCP Servers · LSP Servers · Output Styles · Themes · Monitors

Categories · Productivity · Development · Testing · Deployment · Security · Documentation · Data · Utilities

© 2025 ClaudePluginHub

Community Maintained · Not affiliated with Anthropic

ClaudePluginHub

ClaudePluginHub

Tools Learn Pricing

Search everything...

skill-selection-evals | crucible

Home
Skills
crucible
skill-selection-evals

Skill

skill-selection-evals

Contains evaluation data for measuring skill selection accuracy, including direct, negative, context-dependent, and cascade-ordering tests.

developer-tools

Popularity

Stars

10

Forks

2

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/crucible:skill-selection-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Supporting Files

GRADING.mdevals/evals.jsonscripts/run_selection_eval.py

SKILL.md

38 lines · ~433 tokens

Stats

LanguagePython

Stars10

Forks2

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

Stats

LanguagePython

Stars10

Forks2

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Tags

Skill-Selection Evals

This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.

Purpose

Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.

Eval Types

Direct selection: Given a prompt, does the agent pick the correct skill?
Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
Context-dependent: Same verb, different context, different correct skill.
Cascade ordering: Multi-skill tasks requiring correct invocation order.

Boundaries Tested

test-methodology — TDD vs test-coverage vs adversarial-tester
review-direction — temper vs review-feedback
adversarial-scope — red-team vs inquisitor vs audit vs siege
completion-claims — verify vs finish
bug-handling — debugging vs verify vs audit
build-vs-raw-dispatch — build (full idea→PR pipeline) vs a single-skill dispatch (planning, test-driven-development, …)

Difficulty Ratings

Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).

See Also

evals/evals.json — the eval data
GRADING.md — grading criteria and baseline measurement protocol

$

npx claudepluginhub raddue/crucible

Similar Skills

skill-forge-eval

58

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

View skill-forge-eval

evaluate-skill

34

Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.

14 tools

evaluate-plugin

View evaluate-skill

eval-run

28

Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.

14 files9 tools

agent-eval-harness