From bette-think
Generates 20 test cases (15 happy path + 5 edge) for AI features in spreadsheet format using PM-Friendly Evals. Launches simple eval workflow with optional Linear project.
How this skill is triggered — by the user, by Claude, or both
Slash command
/bette-think:start-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Launch your AI evaluation process using the **PM-Friendly Evals approach** (Aman Khan + Hamel Husain).
Launch your AI evaluation process using the PM-Friendly Evals approach (Aman Khan + Hamel Husain).
Start with 20 test cases in a spreadsheet. Scale when ready. Error analysis > automation.
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
START EVALS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Start with 20 test cases. Scale when ready.
What AI feature are you evaluating?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/start-evals [feature-name]
Examples:
/start-evals "AI product recommendations" - Generate test cases/start-evals --create-project - Create Linear project for tracking/start-evals "customer support AI" --count 50 - Generate 50 test casesGood -> Better -> Best progression:
| Stage | Test Cases | Process | Tool |
|---|---|---|---|
| Good (Week 1) | 20 | Manual review | Spreadsheet |
| Better (Month 1-2) | 50-100 | LLM-as-judge | Weekly reviews |
| Best (Month 3+) | 200+ | Automated | CI/CD integration |
Start here. You're at "Good." Don't jump to automation.
AI Evals Starter Kit: Product Recommendations
HAPPY PATH (15 cases):
1. Input: "Recommend a laptop under $800 for college"
Expected: Mid-range laptops with student-friendly features, under budget
Pass criteria: All recommendations < $800, suitable for students
2. Input: "Best phone for photography"
Expected: High-end phones with excellent cameras
Pass criteria: Focus on camera quality, not price
...
EDGE CASES (5 cases):
16. Input: "Phone for elderly person"
Expected: Simple, large screen, easy to use
Pass criteria: Prioritizes simplicity over features
Why it's tricky: Must understand implicit needs
...
| Pass Rate | Action |
|---|---|
| 80%+ | Add 10 more test cases |
| <80% | Fix issues, rerun |
| 50-100 cases | Graduate to "Better" approach |
Q: 20 seems like too few. Should I start with 100? A: No. 20 cases covering your core use case > 100 cases you never run.
Q: How long does running 20 tests take? A: First time: 30-60 min. After that: 15-20 min per run.
Q: Do I need special tools? A: No. Spreadsheet works great. Graduate to tools when manual gets painful.
| Signal | Next Step |
|---|---|
| You have 50+ test cases or see production failures | /upgrade-evals — Systematic error analysis on real traces |
| You need more diverse test inputs | /generate-test-data — Dimension-based synthetic data |
| Your AI feature uses retrieval (search, knowledge base) | /eval-rag — Separate retrieval from generation evaluation |
/upgrade-evals - Error analysis on real traces (next step after this)/build-judge - LLM-as-Judge for subjective failure modes/generate-test-data - Diverse synthetic test inputs/eval-rag - RAG-specific retrieval + generation evaluation/calibrate - Ongoing post-launch calibration/ai-health-check - Full pre-launch readiness audit/ai-cost-check - Economic validationFramework: PM-Friendly Evals (Aman Khan + Hamel Husain) Key insight: "Error analysis is the most important activity. Start with 20 cases in a spreadsheet."
npx claudepluginhub breethomas/bette-think --plugin bette-thinkUse this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
Designs test cases, adversarial inputs, and iterates on prompts based on eval results. Useful for prompt-engineering tasks like drafting, testing, and refining prompts and skills.
Audits pre-launch AI features across 6 dimensions—model selection, data quality, cost, monitoring, failure UX, optimization—grading readiness and blocking shipment of broken products.