From bette-think
Guides post-launch AI feature calibration: document production error patterns, review eval performance, decide agency promotion. Uses CC/CD loop with /calibrate shortcuts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/bette-think:calibrateThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Calibration happens after launch, not before.**
Calibration happens after launch, not before.
The mistake: Building elaborate systems to perfectly calibrate AI behavior before launch. The reality: You learn what quality means by shipping to users and seeing what they actually need.
The Calibration Loop:
When this skill is invoked, start with:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CALIBRATE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Calibration happens after launch, not before.
What do you need?
1. Document error patterns
→ Analyze failures, categorize, plan fixes
2. Review eval performance
→ Are evals catching real issues? Missing patterns?
3. Agency promotion decision
→ Is this feature ready for more autonomy?
4. Quick calibration check
→ Is the system behaving as expected?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Parse intent from context:
Command-line shortcuts:
/calibrate → Show entry menu/calibrate --errors → Flow 1 (error patterns)/calibrate --evals → Flow 2 (eval review)/calibrate --promote → Flow 3 (agency promotion)/calibrate --quick → Flow 4 (quick check)━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR PATTERN DOCUMENTATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Let's catalog what's going wrong.
Where are you seeing errors?
• User feedback / complaints
• Support tickets
• Monitoring alerts
• Manual review
• User corrections / overrides
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
Categories must emerge from the errors you actually observe — not from a pre-defined list. Read the failures first, then group similar ones together. Pre-defined categories cause confirmation bias.
For the systematic process (reading traces, emergent categorization, failure rates), run /upgrade-evals. Flow 1 here is for quick ad-hoc error documentation when you don't need the full analysis.
For each error pattern, determine:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ERROR PATTERN ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Analysis Date: [date]
Data Source: [where errors were observed]
| Error Pattern | Category | Likely Reason | Potential Fix | Priority |
|---------------|----------|---------------|---------------|----------|
| [description] | [type] | [why] | [how to fix] | P1 |
| [description] | [type] | [why] | [how to fix] | P2 |
| [description] | [type] | [why] | [how to fix] | P3 |
PATTERN ANALYSIS:
- Most common category: [X]
- Emerging pattern: [Y]
- Regression from last period: [Z]
RECOMMENDED ACTIONS:
1. [P1 action]
2. [P2 action]
3. [P3 action]
ADD TO EVALS:
- [ ] Add test case for [error pattern 1]
- [ ] Add test case for [error pattern 2]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL PERFORMANCE REVIEW
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Let's see if your evals are working.
Current state:
• How many test cases do you have?
• What's your pass rate?
• When did you last update evals?
• Are you seeing failures in prod that evals missed?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Questions to ask:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL GAP ANALYSIS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Check coverage across categories:
□ Happy path - Common successful scenarios
□ Edge cases - Unusual but valid inputs
□ Adversarial - Intentional misuse attempts
□ Boundary - Out of scope handling
□ Regression - Previously fixed issues
□ Production errors - Real failures observed
Missing categories = gaps in coverage
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EVAL ASSESSMENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Test Cases: [count]
Pass Rate: [%]
Last Updated: [date]
COVERAGE ASSESSMENT:
| Category | Coverage | Status |
|----------|----------|--------|
| Happy path | [X] cases | ✅ Good |
| Edge cases | [X] cases | ⚠️ Needs work |
| Adversarial | [X] cases | ❌ Missing |
| Boundary | [X] cases | ✅ Good |
| Regression | [X] cases | ⚠️ Needs work |
EFFECTIVENESS:
- Catching real issues? [Yes/No/Partially]
- False positive rate: [%]
- Prod errors missed: [list]
RECOMMENDATIONS:
1. Add [X] test cases for [gap]
2. Update [Y] tests that are stale
3. Remove [Z] tests that are redundant
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
AGENCY PROMOTION CHECK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Considering: V[current] → V[target]
Let's verify readiness.
QUALITY METRICS
□ Accuracy/quality stable for 4+ weeks?
□ No new error patterns emerging?
□ User corrections decreasing?
SAFETY & TRUST
□ Confident in known failure modes?
□ Override mechanism working well?
□ User feedback positive?
OPERATIONAL READINESS
□ Monitoring in place for new level?
□ Rollback plan ready?
□ Team aligned on promotion?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
For each item, ask:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROMOTION VERDICT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Current: V[n] | Target: V[n+1]
VERDICT: [READY ✅ / NOT READY ❌ / NEEDS WORK ⚠️]
✅ PASSING:
- [criteria met with evidence]
- [criteria met with evidence]
❌ BLOCKING:
- [criteria not met + what's needed]
- [criteria not met + what's needed]
⚠️ RISKS IF PROMOTED NOW:
- [risk + mitigation needed]
RECOMMENDATION:
[Clear recommendation with reasoning]
NEXT STEPS:
1. [action if ready]
2. [action if not ready]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
QUICK CALIBRATION CHECK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Fast health check for [feature name]
• Current quality metric: [X%]
• Trend: [↑ improving / → stable / ↓ degrading]
• Any alerts triggered? [Y/N]
• User feedback signals: [positive/neutral/negative]
• Override rate: [X%] (is this expected?)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CALIBRATION STATUS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: [name]
Agency Level: V[n]
Check Date: [date]
STATUS: [HEALTHY ✅ / ATTENTION ⚠️ / DEGRADED ❌]
METRICS:
| Metric | Value | Trend | Status |
|--------|-------|-------|--------|
| Quality | [X%] | [→] | ✅ |
| Override rate | [X%] | [↓] | ✅ |
| User satisfaction | [X] | [→] | ⚠️ |
ALERTS:
- [any triggered alerts]
ACTION NEEDED:
- [none / specific action]
NEXT CHECK: [recommended cadence]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Weekly: Quick check (Flow 4)
Monthly: Eval review (Flow 2)
Quarterly: Deep calibration
"Quality looks fine" → "What's your correction rate? What's the trend? Fine compared to what baseline?"
"No complaints" → "Are users actually using it? Are they silently working around it?"
"Ready to promote" → "Show me the data. How long has quality been stable? What failure modes have you validated?"
"Evals are passing" → "100% pass rate? That might mean evals are too easy. When did you last add new test cases?"
Before /calibrate:
/agency-ladder - Define the ladder first/spec --ai - Ensure spec includes calibration planRelated:
/ai-health-check - Pre-launch validation/start-evals - Set up eval infrastructure/upgrade-evals - Systematic error analysis on real traces/build-judge - LLM-as-Judge for automating subjective evals/eval-rag - RAG-specific retrieval + generation evaluationFramework: CC/CD (Continuous Calibration/Continuous Development) Source: Aishwarya Naresh Reganti & Kiriti Badam (Lenny's Newsletter) Adaptation: Post-launch calibration workflows
npx claudepluginhub breethomas/bette-think --plugin bette-thinkGenerates synthetic problems with quasi-ground-truth outcomes to test agents and skills, measuring recall, precision, and confidence calibration. Use for validating routing accuracy, A/B testing changes.
Audits pre-launch AI features across 6 dimensions—model selection, data quality, cost, monitoring, failure UX, optimization—grading readiness and blocking shipment of broken products.
Use this skill when the user asks about "continuous improvement for AI", "AI quality flywheel", "how do we keep improving our AI feature", "closing the eval feedback loop", "systematic AI improvement process", or wants to build a repeating process that continuously improves AI product quality over time rather than doing one-off fixes.