Skill

start-evals

Generates 20 test cases (15 happy path + 5 edge) for AI features in spreadsheet format using PM-Friendly Evals. Launches simple eval workflow with optional Linear project.

ai-ml

testing

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/bette-think:start-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Launch your AI evaluation process using the **PM-Friendly Evals approach** (Aman Khan + Hamel Husain).

SKILL.md

135 lines · ~1k tokens

Stats

Parent stars15

Parent forks3

MaintenanceGood

Last CommitMar 8, 2026

Actions

View Source View Plugin View on GitHub View README

Start Evals

Launch your AI evaluation process using the PM-Friendly Evals approach (Aman Khan + Hamel Husain).

Start with 20 test cases in a spreadsheet. Scale when ready. Error analysis > automation.

Entry Point

When this skill is invoked, start with:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 START EVALS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Start with 20 test cases. Scale when ready.

What AI feature are you evaluating?

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Usage

/start-evals [feature-name]

Examples:

/start-evals "AI product recommendations" - Generate test cases
/start-evals --create-project - Create Linear project for tracking
/start-evals "customer support AI" --count 50 - Generate 50 test cases

What Happens

Invokes the eval-generator agent
Asks about your AI feature and quality criteria
Generates 20 test cases (15 happy path + 5 edge cases)
Provides spreadsheet template and workflow
Optionally creates Linear project for tracking

The Philosophy

Good -> Better -> Best progression:

Stage	Test Cases	Process	Tool
Good (Week 1)	20	Manual review	Spreadsheet
Better (Month 1-2)	50-100	LLM-as-judge	Weekly reviews
Best (Month 3+)	200+	Automated	CI/CD integration

Start here. You're at "Good." Don't jump to automation.

What You'll Get

AI Evals Starter Kit: Product Recommendations

HAPPY PATH (15 cases):

1. Input: "Recommend a laptop under $800 for college"
   Expected: Mid-range laptops with student-friendly features, under budget
   Pass criteria: All recommendations < $800, suitable for students

2. Input: "Best phone for photography"
   Expected: High-end phones with excellent cameras
   Pass criteria: Focus on camera quality, not price

...

EDGE CASES (5 cases):

16. Input: "Phone for elderly person"
    Expected: Simple, large screen, easy to use
    Pass criteria: Prioritizes simplicity over features
    Why it's tricky: Must understand implicit needs

...

Week 1 Workflow (2-3 hours)

Copy test cases to spreadsheet (10 min)
Run your AI against each input (1-2 hours)
Record actual outputs
Mark pass/fail
Look for patterns in failures (30 min)

After 1-2 Weeks

Pass Rate	Action
80%+	Add 10 more test cases
<80%	Fix issues, rerun
50-100 cases	Graduate to "Better" approach

Common Questions

Q: 20 seems like too few. Should I start with 100? A: No. 20 cases covering your core use case > 100 cases you never run.

Q: How long does running 20 tests take? A: First time: 30-60 min. After that: 15-20 min per run.

Q: Do I need special tools? A: No. Spreadsheet works great. Graduate to tools when manual gets painful.

Ready to Scale?

Signal	Next Step
You have 50+ test cases or see production failures	`/upgrade-evals` — Systematic error analysis on real traces
You need more diverse test inputs	`/generate-test-data` — Dimension-based synthetic data
Your AI feature uses retrieval (search, knowledge base)	`/eval-rag` — Separate retrieval from generation evaluation

Related Commands

/upgrade-evals - Error analysis on real traces (next step after this)
/build-judge - LLM-as-Judge for subjective failure modes
/generate-test-data - Diverse synthetic test inputs
/eval-rag - RAG-specific retrieval + generation evaluation
/calibrate - Ongoing post-launch calibration
/ai-health-check - Full pre-launch readiness audit
/ai-cost-check - Economic validation

Framework: PM-Friendly Evals (Aman Khan + Hamel Husain) Key insight: "Error analysis is the most important activity. Start with 20 cases in a spreadsheet."

start-evals

Popularity

Invocation

Context Preview

SKILL.md

start-evals

Popularity

Invocation

Context Preview

SKILL.md

Start Evals

Entry Point

Usage

What Happens

The Philosophy

What You'll Get

Week 1 Workflow (2-3 hours)

After 1-2 Weeks

Common Questions

Ready to Scale?

Related Commands

Similar Skills

Start Evals

Entry Point

Usage

What Happens

The Philosophy

What You'll Get

Week 1 Workflow (2-3 hours)

After 1-2 Weeks

Common Questions

Ready to Scale?

Related Commands

Similar Skills