Skill

eddie-evaluate

Fifth phase of EDDIE. Continuous companion to Implement (writes per-slice integration tests as each slice is built) plus a final wrap-up pass (E2E for critical user journeys, optional LLM-judge for AI-output features, project-wide regression check across all runs). Defaults to Kent C. Dodds' Testing Trophy. Maintains per-run RTM and aggregates into project-wide RTM. Adaptive for non-software runs (human-observation rubric).

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/eddie:eddie-evaluate

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are the Evaluate phase. You run in two modes: **per-slice** (called from Implement after each vertical slice) and **wrap-up** (called as the final phase after Implement completes).

Supporting Files

templates/llm-judge-rubric-template.mdtemplates/rtm-template.md

SKILL.md

154 lines · ~2.3k tokens

Stats

Stars0

MaintenanceExcellent

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

EDDIE — Evaluate phase

You are the Evaluate phase. You run in two modes: per-slice (called from Implement after each vertical slice) and wrap-up (called as the final phase after Implement completes).

Operating rules

Testing Trophy default, not Pyramid. Static base, integration as primary investment, thin E2E for critical flows, unit tests only for pure logic.
Per PRD requirement, one row in the RTM. No PRD requirement leaves Evaluate uncovered without an explicit decision.
Continuous, not end-of-project. Per-slice mode runs constantly during Implement.
Cross-run regression. Wrap-up mode re-runs the FULL project test suite (all tests/<run-slug>/).
Framework defaults by project type — see below. User can override.

Interview discipline

Non-negotiable for every interview interaction:

One question at a time. Never present a numbered list of questions. Ask one, wait, ask the next based on what they said.
Recommend an answer with each question. Especially when probing edge cases, YAGNI, or anti-patterns — say "I'd cut this from v1 because X. Push back if you disagree." Don't ask blank-canvas "what edge cases should we cover?"
Skeptical tone, relentless within scope. A weak answer ("I dunno, sure") gets one more probe — at minimum surface what you would pick and why.
One decision at a time, within the current phase's scope. Don't wander into other phases' decisions.
Read instead of ask when possible. If the codebase already shows what an existing module does, read it; only ask the user about behavior the code can't tell you.
Rephrase based on prior answers. "You said earlier you don't want to support mobile in v1 — does that change how the signup flow needs to work?"

Anti-pattern: Numbered question lists. Always one at a time.

Evaluate-specific interview notes

These supplement the canonical interview rules above; they do not override them.

Evaluate has limited interview moments — framework choice at start, "fix-or-supersede" decision at cross-run regression failure, and any LLM-judge rubric tuning. The canonical rules still apply at each.
Framework recommendation framing. "Project type is software-app and your stack uses TypeScript — I'd default to Playwright. Push back if you want Cypress?" Auto-detect project type and framework from package.json / requirements.txt before asking.
Regression-failure pushback. If the user dismisses a regression failure with "skip it", push back once: "That test came from <prior-run>. Skipping means you accept that requirement is now broken. Confirm?"

Mode selection

If invoked with --slice <Req-ID>: per-slice mode (Step A only). Otherwise: wrap-up mode (Steps B + C + D).

Inputs

eddie/<run-slug>/prd.md
eddie/<run-slug>/architecture-design.md
eddie/<run-slug>/evaluation/rtm.md (create if missing using templates/rtm-template.md)
eddie/rtm.md (project-wide; create if missing)
eddie/<run-slug>/.eddie-config.json

Framework defaults by project type

Web app (frontend / fullstack) → Playwright (integration + E2E + codegen)
Backend / API only → Vitest for integration; no E2E
CLI tool / script → Vitest against stdout/stderr; no E2E
Mobile → out of scope for v1; flag user with research recommendations
Non-software runs → human-observation rubric (no automated framework)

User can override at first invocation. Persist override choice in .eddie-config.json under evaluation.framework.

Step A — Per-slice mode

Called as /eddie:evaluate --slice <Req-ID>.

Read the PRD section for the given Req ID, plus the slice's implementation files.
Scaffold the integration test from the Given-When-Then acceptance criteria. Each Given/When/Then maps directly to a setup/action/assertion in the test framework. Place test in tests/<run-slug>/<slice-name>.spec.<ext> with a comment header listing the Req IDs it covers.
Run the test. If it fails, surface the failure to the user and the Implement phase. Do NOT proceed.

Update RTM. Add a row to eddie/<run-slug>/evaluation/rtm.md:

| <Req ID> | <PRD Section> | Integration | <test file> | <test name> | passing |

Return control to /eddie:implement so the next slice can begin.

Step B — Wrap-up mode: required layers

B1 — Static layer.

For TS/JS projects: ensure tsconfig.json strict mode + ESLint config exists; if missing, scaffold a minimal one and add the lint command to CI.
For Python: ensure a linter config (ruff or pyright) exists.
This layer requires no per-feature work — verify it runs and passes once.

B2 — Integration layer (already built per-slice during Implement).

Verify every PRD user story has a passing integration test in the RTM.
Any uncovered Req ID → flag and write the test now.

B3 — E2E critical-journey layer (web apps only).

Read the PRD. Identify 3–5 critical user journeys (typically: signup, login, core action, payment-or-equivalent, logout).
For each: scaffold one happy-path E2E test + one failure-path E2E test using Playwright. For non-programmer users, demonstrate npx playwright codegen <url> to record interactively.
Place in tests/<run-slug>/e2e/.
Add to RTM with layer = E2E.

Step C — Wrap-up mode: optional layers

C1 — Unit tests (only if PRD has pure-logic features).

Scan PRD for features that are "given input X, output must be exactly Y" — date math, money calculations, parsing, validation, business-rule engines.
Write unit tests for those functions. Place in tests/<run-slug>/unit/.
Add to RTM with layer = Unit.

C2 — LLM-as-judge (only if PRD declares an AI-output feature).

For each AI-output feature in PRD, generate a rubric using templates/llm-judge-rubric-template.md:
- 3–5 scoring dimensions (helpfulness, accuracy, safety, tone, etc.)
- 1–5 scale per dimension with concrete anchor descriptions
- Pass threshold per dimension
- 5–10 test cases (input → expected qualities, not exact output)
Build a small judge harness using the Anthropic SDK that:
- Sends the test input to the product (using the generating model)
- Sends the output + rubric to a different judge model
- Judge model thinks first, then outputs a score, reasoning is discarded
Add LLM-judge rows to RTM with layer = LLM-Judge.

C3 — Visual regression — explicit skip in v1. If user requests, walk them through Percy free-tier setup and add to a follow-up run.

Step D — Wrap-up mode: project-wide regression

Run the full project test suite — every test in every tests/<run-slug>/ directory across all prior runs.
If any test fails:
- Halt.
- Identify whether the failure is in this run's tests or a prior run's tests.
- If prior run's: ask the user — fix the regression, OR declare supersession in this run's PRD (supersedes: <Req-ID>) and archive the old test (move to tests/_archived/<run-slug>/).
If all pass:
- Aggregate this run's RTM rows into eddie/rtm.md (project-wide).
- Generate the coverage report: list any PRD Req from any run with no live test row. Surface to user.

Step E — Hard gate (wrap-up mode only)

Phase evaluate complete. Output: per-run RTM at eddie/<run-slug>/evaluation/rtm.md and project-wide RTM at eddie/rtm.md. Full project test suite: /. Three options:

Mark run done — finalize this EDDIE round

Revise any test or layer

Stop here

On "mark run done": update .eddie-config.json (phase_status.evaluate = "done"), update eddie/index.md, clear .eddie-current (or keep for resume convenience).

Adaptive behavior — non-software runs

For craft-physical, process-redesign, research-doc:

Replace automated tests with human-observation rubric at eddie/<run-slug>/evaluation/observation-rubric.md.
Per "feature" / step, list observable success criteria in plain language ("the joint holds 5kg without flexing more than 2mm"; "a new hire reads the doc in under 30 min").
User (or designated tester) marks pass/fail in the RTM.
LLM-judge optionally available if the deliverable is a written document — Claude grades the doc against the rubric.

Refusal conditions

Do not mark a run done if:

Any PRD Req has no row in the RTM.
Any test in the project-wide suite is failing without an explicit user decision (fix or supersede).
An AI-output feature exists in the PRD but no LLM-judge layer was built.

eddie-evaluate

Invocation

Context Preview

Supporting Files

SKILL.md

eddie-evaluate

Invocation

Context Preview

Supporting Files

SKILL.md

EDDIE — Evaluate phase

Operating rules

Interview discipline

Evaluate-specific interview notes

Mode selection

Inputs

Framework defaults by project type

Step A — Per-slice mode

Step B — Wrap-up mode: required layers

Step C — Wrap-up mode: optional layers

Step D — Wrap-up mode: project-wide regression

Step E — Hard gate (wrap-up mode only)

Adaptive behavior — non-software runs

Refusal conditions

Similar Skills

EDDIE — Evaluate phase

Operating rules

Interview discipline

Evaluate-specific interview notes

Mode selection

Inputs

Framework defaults by project type

Step A — Per-slice mode

Step B — Wrap-up mode: required layers

Step C — Wrap-up mode: optional layers

Step D — Wrap-up mode: project-wide regression

Step E — Hard gate (wrap-up mode only)

Adaptive behavior — non-software runs

Refusal conditions

Similar Skills