From mycelium
Runs evaluation scenarios to benchmark agent performance via reflexion loops, validates success criteria, records metrics, generates reports, and proposes new evals from logs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/mycelium:eval-runnerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
run <category/name>.claude/evals/scenarios/<category>/<name>.yml.claude/evals/results/<timestamp>-<name>.jsonrun-all [category].claude/evals/scenarios/**/*.ymlstatus: retired.claude/evals/pass-history.json with each resultrun-split <optimization|holdout>.claude/evals/scenarios/**/*.ymlsplit field matching the requested setstatus: retired.claude/evals/pass-history.json with each resultreport.claude/evals/results/| Category | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery | ... | ... | ... | |
| delivery | ... | ... | ... | |
| integration | ... | ... | ... | |
| **Overall** | ... | ... | ... | |
prune.claude/evals/pass-history.jsonlast_5 is all-pass (saturated) or all-fail (broken)stale)status: retired in scenario YAML, update pass-history.json, log in .claude/harness/decision-log.mdmineAnalyze audit logs to propose new eval scenarios from observed failure patterns.
.claude/state/change-log.jsonl (last 100 entries).claude/state/diamond-state-audit.jsonl (all entries)session_id, identify:
a. Correction clusters: 3+ edits to same file in one session (agent struggled)
b. Skill friction: edits to .claude/skills/*/SKILL.md during a session (instructions unclear)
c. Missing test coverage: 5+ files changed with no test file editssource: trace-mining and originating session_idSee .claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.
See .claude/evals/schema.md for YAML scenario and JSON result formats.
After writing each result JSON (step 8 of run), also update .claude/evals/pass-history.json:
runs and passes (if passed) for the evaltrue/false to last_5 (trim to keep only last 5)last_run timestampPlace YAML files in .claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.
npx claudepluginhub haabe/mycelium --plugin myceliumEvaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.