From grok-workflows
Lightweight eval harness — runs the SAME task N independent ways, each in its own isolated git worktree and fresh context window, then grades the candidates with separate evaluator agents (per-candidate rubric scoring + a pairwise tournament) to pick and explain the best. Use when the user wants to try a task several ways and compare, asks to "run this N ways and pick the best", wants an A/B/N bake-off of approaches, wants an impartial eval of candidate solutions against a rubric, or asks for /eval-skill.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grok-workflows:eval-skillThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Runs the bundled grok-workflows harness, which produces N independent candidate
Runs the bundled grok-workflows harness, which produces N independent candidate solutions to the SAME task (each in its own isolated git worktree, each a fresh context window), then grades them with a SEPARATE set of evaluator agents: a per-candidate absolute rubric score plus a pairwise tournament. The producers never grade their own work, which structurally defeats self-preferential bias. Nothing is auto-applied — candidate worktrees are left intact for review. You do not re-implement any of this; you invoke the harness and act on its JSON.
/eval-skill <task to run N ways> [-- N] [:: rubric]
-- N sets the number of candidates (default 3, clamped to 2..8).:: rubric sets the grading rubric (free text). Both are optional.Example: /eval-skill implement a fizzbuzz function -- 4 :: correctness, simplicity
This skill bundles an entrypoint at <skill-dir>/scripts/run.mjs (thin delegator
to the centralized launcher logic in src/launcher.mjs; self-location still works
via the passed import.meta.url from the delegator) —
<skill-dir> is this skill's own directory, whose absolute path is announced in
your system context when the skill loads. Derive the entrypoint path from that
announced SKILL.md path and inline the absolute path into a single
run_terminal_cmd call (don't rely on the working directory or a shell variable).
The launcher locates its bundled harness itself (via the delegator), so no
repository path is needed:
node <skill-dir>/scripts/run.mjs "<task to run N ways> [-- N] [:: rubric]"
The harness prints a single JSON object to stdout (progress logs go to stderr):
{
"winner": 3,
"why": "Candidate #3 (\"...\") ranked first with rubric score 80/100 and won the pairwise tournament. ...",
"ranking": [
{ "rank": 1, "candidate": 3, "approach": "...", "rubricScore": 80, "wonTournament": true }
],
"rubric": "...",
"scores": [ { "candidate": 1, "score": 60, "justification": "..." } ],
"candidates": [ { "candidate": 1, "approach": "...", "summary": "..." } ],
"tournamentWinner": 3,
"requested": 3,
"produced": 3,
"worktreesLeftForReview": true,
"note": "Outputs were NOT auto-applied. ..."
}
winner and the why explanation as the headline answer, then
show the ranking (rank, candidate #, approach, rubricScore, wonTournament)
as a short table.approach
and summary from candidates, and the rubric scores with justifications.produced vs requested — if produced < requested,
some candidate producers failed and were dropped (nothing was silently capped).note field). To apply the winner, they inspect the
worktrees (git worktree list) and cherry-pick the winning candidate's changes.
Only apply changes if the user explicitly asks.winner is null (e.g. produced is 0), tell the user all candidate
producers failed and suggest re-running or simplifying the task.Do not re-run the task yourself or re-grade the candidates — the harness already ran the producers and the impartial evaluators. Your job is to relay and act on its verdict.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub lswank/grok-workflows