By robertnowell
Bakes eval-first thinking into model-output work as a habit, not a gate. Forces every plan to name what success looks like, where things can go wrong, and propose simple evals validating key assumptions.
A lightweight Claude Code skill that bakes eval-first thinking into every model-output change — without ceremony, without refusing work, without asking the user to label cases.
For any work involving prompts, agents, skills, judges, or model outputs, the skill nudges Claude to:
Run as part of the plan. Show verdict first, evidence second. Mechanical changes (typos, formatting) skip the protocol.
Most prompt/agent/skill changes ship without anyone asking "how would we know this worked?" Production failures are then caught by users instead of by the developer. This skill makes verification a natural part of doing the work — Claude generates the cases, picks the cheapest fitting check, and surfaces the comparison as the answer.
/plugin marketplace add robertnowell/eval-first
/plugin install eval-first@eval-first-marketplace
Or pinned to a release:
/plugin marketplace add robertnowell/[email protected]
Three layers, classic progressive disclosure:
| Layer | What loads | When |
|---|---|---|
L1 — description | The load-bearing sentence + skip rule | Always in context (every Claude Code session) |
L2 — SKILL.md body | Full protocol + reference pointers | When the skill is invoked for a non-trivial change |
L3 — references/ | Anchor patterns, cheap-check snippets, comparison templates | When SKILL.md tells Claude to read them |
The ambient cost (L1 only) is ~200 tokens. The full skill (L1+L2) costs ~600 tokens. References (L3) load on demand at ~190 lines total.
skills/eval-first/references/anchor-patterns.md — 4 patterns for generating eval cases without asking the user (delta-driven, F×S×P tuples, synthetic-Q per chunk, 3-slot fallback).skills/eval-first/references/cheap-checks.md — Failure-type → minimal-check decision table with concrete snippets.skills/eval-first/references/comparison-templates.md — Verdict-first comparison templates A-E.Built by @robertnowell for the StartX founder community. Draws on canonical work by Anthropic (Demystifying Evals for AI Agents), Hamel Husain (Field Guide), Eugene Yan (Product Evals), Shreya Shankar (SPADE / EvalGen), Jason Liu (RAG flywheel), and the Promptfoo / Inspect AI / OpenAI Evals frameworks.
MIT
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub robertnowell/eval-first --plugin eval-firstAutomated distribution pipeline for open source developer tools. Onboard a project from its README, launch to MCP Registry + directories, post to Bluesky/Dev.to/Hashnode/Mastodon, track engagement.
Deep multi-source research with source quality evaluation and hard quality gates. Launches parallel research agents, classifies sources by tier, cross-references findings, and synthesizes structured reports.
Language expression coach — tells you how a native speaker would actually say something, with cultural context, tone notes, and audio pronunciation. Supports 13+ languages.
Shorts-first video essay pipeline: three shorts enter, one wins at 24h, becomes a 10-minute long-form. Handles research, TTS, image gen, caption burn-in, multi-platform publishing, and retention-aware judging.
macOS sound + notification capability for Claude Code. Fire immediately (notify) to alert the user when a long task finishes, or run named countdown timers (start/list/cancel) for pomodoros and reminders.
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
Comprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development
Complete collection of 30 Claude Code skills for document processing, development, business productivity, and creative tasks
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.