From ork
Runs isolated claude --bare mode calls for skill evaluation, output grading, trigger testing, and prompt benchmarking without plugin interference.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ork:bare-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run `claude -p --bare` for fast, clean eval/grading without plugin overhead.
Run claude -p --bare for fast, clean eval/grading without plugin overhead.
CC 2.1.81 required. The --bare flag skips hooks, LSP, plugin sync, and skill directory walks.
-p call that doesn't need plugins--plugin-dir)# --bare requires ANTHROPIC_API_KEY (OAuth/keychain disabled)
export ANTHROPIC_API_KEY="sk-ant-..."
# Verify CC version
claude --version # Must be >= 2.1.81
| Call Type | Command Pattern |
|---|---|
| Grading | claude -p "$prompt" --bare --max-turns 1 --output-format text |
| Trigger | claude -p "$prompt" --bare --json-schema "$schema" --output-format json |
| Optimize | echo "$prompt" | claude -p --bare --max-turns 1 --output-format text |
| Force-skill | claude -p "$prompt" --bare --print --append-system-prompt "$content" |
Load detailed patterns and examples:
Read("${CLAUDE_SKILL_DIR}/references/invocation-patterns.md")
JSON schemas for structured eval output:
Read("${CLAUDE_SKILL_DIR}/references/grading-schemas.md")
OrchestKit's eval scripts (npm run eval:skill) auto-detect bare mode:
# eval-common.sh detects ANTHROPIC_API_KEY → sets BARE_MODE=true
# Scripts add --bare to all non-plugin calls automatically
Bare calls: Trigger classification, force-skill, baseline, all grading.
Never bare: run_with_skill (needs plugin context for routing tests).
| Scenario | Without --bare | With --bare | Savings |
|---|---|---|---|
| Single grading call | ~3-5s startup | ~0.5-1s | 2-4x |
| Trigger (per prompt) | ~3-5s | ~0.5-1s | 2-4x |
| Full eval (50 calls) | ~150-250s overhead | ~25-50s | 3-5x |
Read("${CLAUDE_SKILL_DIR}/rules/_sections.md")
Read("${CLAUDE_SKILL_DIR}/references/troubleshooting.md")
eval:skill npm script — unified skill evaluation runnereval:trigger — trigger accuracy testingeval:quality — A/B quality comparisonoptimize-description.sh — iterative description improvementdoctor/references/version-compatibility.mdnpx claudepluginhub yonatangross/orchestkit --plugin orkRuns evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Runs evaluations for skills across baseline, variant, quality, and trigger modes to benchmark and validate behavior.
Tests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.