From build-like-amazon
Defines success metrics, targets, and experiments for product initiatives using input/output metrics, guardrails, and counter-metrics. Useful during PR/FAQ review or pre-build planning.
How this skill is triggered — by the user, by Claude, or both
Slash command
/build-like-amazon:wb-test-and-iterateThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Test & Iterate defines HOW you'll know whether the product delivers on the promise made in the PR/FAQ. It establishes metrics, targets, and experiments BEFORE building — not after. This prevents the failure mode where teams build something, find metrics that make it look good in retrospect, and declare victory regardless of actual customer impact.
Test & Iterate defines HOW you'll know whether the product delivers on the promise made in the PR/FAQ. It establishes metrics, targets, and experiments BEFORE building — not after. This prevents the failure mode where teams build something, find metrics that make it look good in retrospect, and declare victory regardless of actual customer impact.
The key distinction in this stage: input metrics (what you control and do) vs output metrics (what customers experience and value). Input metrics are leading indicators — they move first. Output metrics are lagging indicators — they move later as a consequence of input metrics moving.
Your job is to identify the input metrics you can drive, predict which output metrics should respond, set specific numeric targets for both, and design experiments to validate the causal chain.
At Amazon, every significant initiative has a metrics deck that is reviewed weekly. Metrics are not chosen after launch to make the team look good — they are committed to before building, published in the PR/FAQ's Internal FAQ ("How will we measure success?"), and reviewed rigorously after launch.
Amazon's metrics philosophy:
Amazon also distinguishes between:
The Weekly Business Review (WBR) mechanism enforces accountability: metrics are reviewed every week, deviations from target trigger investigation, and teams must explain why inputs aren't producing expected outputs.
Map the causal chain from what you DO (inputs) to what customers EXPERIENCE (outputs):
INPUT METRICS OUTPUT METRICS
(You control these) (Customers experience these)
───────────────────────────── ──────────────────────────────
Actions your team takes ───▶ Direct customer outcomes
│
▼
Business results
(Revenue, retention, NPS)
Example causal chain:
INPUT METRICS OUTPUT METRICS
───────────────────────────── ──────────────────────────
Feature availability (ship date) ──▶ Customer adoption rate
Documentation completeness ──▶ Time-to-first-value
API response time (p99) ──▶ Customer task completion rate
Bug fix response time ──▶ Customer satisfaction (CSAT)
│
▼
Retention rate
Revenue per customer
Net Promoter Score
For YOUR initiative, identify:
Every metric needs a target. Without targets, metrics are just monitoring — you can't declare success or failure.
Target-setting framework:
| Metric | Current Baseline | Target (90-day) | Target (6-month) | Target (12-month) | Source of Target |
|---|---|---|---|---|---|
| [Metric 1] | [Current value] | [90-day target] | [6-mo target] | [12-mo target] | [Why this number] |
| [Metric 2] | [Current value] | [90-day target] | [6-mo target] | [12-mo target] | [Why this number] |
Sources for targets:
Anti-patterns in target-setting:
For every success metric, identify what could go wrong:
SUCCESS METRIC GUARDRAIL METRIC COUNTER-METRIC
(Must improve) (Must not degrade) (Watch for unintended harm)
──────────────────── ────────────────────── ────────────────────────────
Task completion speed → Error rate Accuracy of completed tasks
User adoption rate → System reliability Performance for existing users
Feature usage frequency → Page load time Support ticket volume
New customer acquisition → Existing customer retention Cost per acquisition
Rule: You cannot declare success if a guardrail metric degrades, even if the success metric improves. A 50% speed improvement that introduces 5x more errors is not a success.
Before building the full product, design experiments to validate your riskiest assumptions:
Experiment Design Template:
EXPERIMENT: [Name]
==================
Hypothesis: "If we [action], then [predicted outcome] because [rationale]"
Riskiest Assumption Being Tested: [What this experiment validates]
Success Criteria: [Specific metric + threshold that confirms hypothesis]
Failure Criteria: [Specific metric + threshold that rejects hypothesis]
Duration: [How long to run for statistical confidence]
Sample Size: [Minimum N for statistical significance]
Methodology: [A/B test / before-after / cohort / prototype test / wizard of oz]
DECISION FRAMEWORK:
- If success criteria met → Proceed to full build
- If failure criteria met → Pivot to alternative approach or kill
- If inconclusive → Extend duration or increase sample size (define limit)
Experiment types by stage:
| Product Stage | Experiment Type | What It Validates |
|---|---|---|
| Pre-build | Fake door test | Demand exists (customers click) |
| Pre-build | Wizard of Oz | Solution concept works (manual simulation) |
| Pre-build | Concierge MVP | Customers get value (hands-on with small N) |
| Alpha | Prototype usability test | Customers can use it (task completion) |
| Beta | Limited release A/B test | Metrics improve with real usage |
| GA | Full rollout with holdback | Causal impact at scale |
Define when and how you'll review metrics and make decisions:
Weekly rhythm:
Monthly rhythm:
Decision points:
Pivot/persevere framework:
METRIC ASSESSMENT AT DECISION POINT
====================================
Inputs improving AND Outputs improving → PERSEVERE (scaling)
Inputs improving AND Outputs flat → INVESTIGATE (broken chain)
Inputs flat AND Outputs flat → PIVOT (approach isn't working)
Inputs degrading AND Outputs degrading → KILL or RESTART (fundamental problem)
SUCCESS FRAMEWORK
=================
Initiative: [Name from PR/FAQ]
Problem Solved: [One-line from Stage 2]
Solution: [One-line from Stage 3]
INPUT METRICS (Team Controls)
| Metric | Baseline | 90-Day Target | 6-Month Target | Owner |
|--------|----------|---------------|----------------|-------|
| [Metric 1] | [Value] | [Target] | [Target] | [Name] |
| [Metric 2] | [Value] | [Target] | [Target] | [Name] |
| [Metric 3] | [Value] | [Target] | [Target] | [Name] |
OUTPUT METRICS (Customer Experiences)
| Metric | Baseline | 90-Day Target | 6-Month Target | Measurement Method |
|--------|----------|---------------|----------------|--------------------|
| [Metric 1] | [Value] | [Target] | [Target] | [How measured] |
| [Metric 2] | [Value] | [Target] | [Target] | [How measured] |
GUARDRAIL METRICS (Must Not Degrade)
| Metric | Current Value | Maximum Acceptable Degradation |
|--------|---------------|-------------------------------|
| [Metric 1] | [Value] | [Threshold] |
| [Metric 2] | [Value] | [Threshold] |
COUNTER-METRICS (Watch for Harm)
| Metric | Current Value | Alert Threshold |
|--------|---------------|-----------------|
| [Metric 1] | [Value] | [Threshold] |
VALIDATION EXPERIMENTS (Pre-build)
| Experiment | Hypothesis | Success Criteria | Duration | Start Date |
|------------|-----------|-----------------|----------|-----------|
| [Exp 1] | [If/then] | [Metric > X] | [Weeks] | [Date] |
| [Exp 2] | [If/then] | [Metric > X] | [Weeks] | [Date] |
DECISION CADENCE
- Weekly: Input metric review (team)
- Monthly: Output metric review (team + PM)
- Quarterly: Business metric review (team + leadership)
DECISION POINTS
| Date | Decision | Criteria |
|------|----------|----------|
| [Launch + 4 weeks] | Continue/Adjust | [Specific criteria] |
| [Launch + 8 weeks] | Investigate/Persist | [Specific criteria] |
| [Launch + 12 weeks] | Scale/Pivot/Kill | [Specific criteria] |
KILL CRITERIA (When to stop):
- [Specific condition that triggers stopping the initiative]
- [Specific condition that triggers stopping the initiative]
Present your complete Success Framework to the user. Ask:
Wait for the user to respond with explicit approval or requested changes.
⛔ DO NOT finalize the Working Backwards process until the user explicitly approves or says to continue.
SUCCESS FRAMEWORK
=================
Initiative: Deployment Telescope
Problem Solved: Platform engineers spend 4.2 hours diagnosing pipeline failures
Solution: Automated correlation of deployment events with root cause ranking
INPUT METRICS (Team Controls)
| Metric | Baseline | 90-Day Target | 6-Month Target | Owner |
|--------|----------|---------------|----------------|-------|
| Root cause accuracy (top-3) | N/A (new) | 70% | 85% | ML Team |
| Correlation coverage (% of failure types) | N/A | 60% | 90% | Platform Team |
| Time to display correlated timeline | N/A | <30 seconds | <10 seconds | Backend Team |
| CI/CD integrations supported | 0 | 3 (Jenkins, GHA, GitLab) | 7 | Integrations |
OUTPUT METRICS (Customer Experiences)
| Metric | Baseline | 90-Day Target | 6-Month Target | Measurement |
|--------|----------|---------------|----------------|-------------|
| Median time-to-root-cause | 4.2 hours | 45 minutes | 15 minutes | In-product timer |
| % of failures resolved without escalation | 34% | 60% | 80% | Resolution tracking |
| Engineer satisfaction with debugging (1-10) | 3.2 | 6.0 | 8.0 | Monthly survey |
GUARDRAIL METRICS (Must Not Degrade)
| Metric | Current Value | Maximum Acceptable Degradation |
|--------|---------------|-------------------------------|
| Pipeline execution time (added overhead) | 0ms | +500ms (0.5 second max added latency) |
| False positive rate (incorrect root cause in top-1) | N/A | <30% (must not erode trust) |
| Data security (no credential exposure) | 0 incidents | 0 incidents (zero tolerance) |
COUNTER-METRICS (Watch for Harm)
| Metric | Current Value | Alert Threshold |
|--------|---------------|-----------------|
| Engineers skipping manual investigation (over-relying on tool) | N/A | >20% of cases where tool is wrong AND engineer doesn't verify |
| Alert fatigue from correlation notifications | 0 | >3 non-actionable notifications per engineer per day |
VALIDATION EXPERIMENTS (Pre-build)
| Experiment | Hypothesis | Success Criteria | Duration | Start |
|------------|-----------|-----------------|----------|-------|
| Retrospective correlation | If we run correlation algorithm against 50 historical failures, we can identify correct root cause in top-3 for >60% | Accuracy ≥60% on 50 historical cases | 3 weeks | Week 1 |
| Wizard of Oz (manual analysis) | If we manually provide correlated timelines to 5 engineers during real failures, they resolve >50% faster | Time-to-resolution <2 hours for 4/5 cases | 4 weeks | Week 2 |
DECISION CADENCE
- Weekly: Root cause accuracy and coverage metrics (team standup)
- Monthly: Customer time-to-resolution and satisfaction (PM review)
- Quarterly: Retention impact and expansion metrics (leadership review)
DECISION POINTS
| Date | Decision | Criteria |
|------|----------|----------|
| Launch + 4 weeks | Continue / Adjust algorithm | Accuracy ≥50% on live failures AND 10+ active teams |
| Launch + 8 weeks | Invest in V2 / Investigate stall | Time-to-resolution ≤1 hour for median user |
| Launch + 12 weeks | Scale to GA / Pivot approach | 3+ integrations stable AND NPS ≥ 7 from active users |
KILL CRITERIA:
- Root cause accuracy plateaus below 50% after 8 weeks of iteration (algorithm approach fundamentally flawed)
- Fewer than 5 teams actively using after 90 days despite free access (value prop not compelling)
- Security incident involving customer pipeline data (trust destroyed, must rebuild from scratch)
| Intention | Mechanism |
|---|---|
| "We'll track metrics after launch" | Success Framework written and approved BEFORE engineering begins. Metrics instrumented in sprint 1 |
| "We'll know success when we see it" | Specific numeric targets committed in writing. No retroactive redefinition of success |
| "We'll iterate based on feedback" | Decision points with specific dates and criteria. Not vague "we'll see how it goes" |
| "We won't let the metrics game us" | Counter-metrics and guardrail metrics defined alongside success metrics |
| "We'll pivot if it's not working" | Kill criteria explicitly defined. Prevents sunk-cost fallacy from keeping a failing initiative alive |
| What They Say | Why It's Wrong | What To Do Instead |
|---|---|---|
| "It's too early to set targets — we don't know enough" | You know the problem statement (Stage 2) and the current baseline. That's enough to set a directional target | Set targets with confidence ranges. "We expect 60-80% improvement" is better than no target. Refine as you learn |
| "Our metrics won't be statistically significant with small N" | Small N means you need larger effect sizes to detect. If your product can't produce a large observable effect with early adopters, the value prop is weak | Design for large effect detection first. If you need 10,000 users to see a difference, the per-user value may be too small |
| "Vanity metrics are fine for early stage — we need traction" | Vanity metrics (signups, page views) don't predict business success. Teams that optimize for vanity metrics build popular-but-unprofitable products | Track vanity metrics for momentum, but DECIDE based on value metrics (retention, engagement depth, task completion) |
| "We can't measure customer satisfaction quantitatively" | You can always find proxies: repeat usage, support ticket reduction, time-on-task improvement, NPS, feature-specific CSAT | Pick 2-3 proxy metrics. Imperfect measurement > no measurement. Triangulate across multiple imperfect signals |
| "Setting kill criteria is defeatist" | Kill criteria are the opposite of defeatist — they're courageous. They commit you to intellectual honesty over sunk-cost rationalization | Define kill criteria as a sign of rigor. Teams that can't name their failure conditions can't recognize failure when it arrives |
Before beginning engineering/execution:
npx claudepluginhub robisson/build-like-amazon-agent-skillsDesigns OKR trees, KPI frameworks, North Star metrics, leading/lagging indicators, and A/B experiment guardrails for team goals and measurement.
Defines UX success metrics, designs experiments, and builds measurement frameworks. Activates for A/B tests, funnel analysis, metric reviews, and ethical guardrails against metric manipulation.