Skill

kill-or-ship

Forces a decision (KILL, PIVOT, RECOMMIT, REFINE, SHIP) on a hypothesis using repository evidence. Use at falsification completion, adversary verdict, or cost cap breach.

developer-tools

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/epistemic-skills:kill-or-ship

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> **Related skills:** `/skill:research-question`, `/skill:experiment-execution`, `/skill:falsification-review`, `/skill:surprise-triage`, `/skill:verification-before-publication`

SKILL.md

498 lines · ~5.5k tokens(exceeds 5k compaction limit)

Stats

LanguageTypeScript

Stars7

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Kill or Ship

Overview

This is the decision phase. Not the coping phase. Not the one more run phase. Not the place where sunk cost gets a vote.

You must choose exactly one branch:

KILL — the current claim dies.
PIVOT — the current claim dies, but the failure teaches a different claim worth registering as a new hypothesis.
RECOMMIT — same claim, same method, bounded extra budget or time under a written override.
REFINE — same claim, changed method, written override, explicit refinement count, then rerun from execution.
SHIP — the claim survived the gates and is ready for publication verification.

Two distinctions are non-negotiable:

PIVOT is still a kill of the old hypothesis.
REFINE is not RECOMMIT. RECOMMIT keeps the method. REFINE changes it.

COST_OVERRUN is not a sixth branch. It is the LessonEntry.outcome you write when budget pressure forced the decision.

Current repo reality matters:

src/state/repo.ts is the canonical state surface.
src/adversary/dispatch.ts is the adversary entrypoint.
src/index.ts still shows registerKillCriteria(...) as planned, not active.

So no live gate is going to save you from a sentimental decision. This skill is the gate.

Quick Reference

Branch	Same claim?	Same method?	Required writes	Lesson outcome
`KILL`	No future work on this claim	n/a	`HYPOTHESES.md` -> `KILLED`, `killReason`, `experiments/{id}/KILLED.md`	`"KILLED"` or `"COST_OVERRUN"`
`PIVOT`	No	No	old entry `KILLED`, kill reason points to new id, `experiments/{id}/KILLED.md`, new hypothesis entry	`"PIVOT"`
`RECOMMIT`	Yes	Yes	`OVERRIDES.md`, possible cap or window change, status stays `RUNNING`	`"COST_OVERRUN"` only if budget forced it
`REFINE`	Yes	No	`OVERRIDES.md`, increment `Refinement count`, rerun	none
`SHIP`	Yes	Yes	confirmed result on disk, status `CONFIRMED`	none

The Iron Law

5:1 kill-to-ship is normal.
Pivots count as kills of the old claim.
Cost already spent still does not vote.

Most ideas should die. Some should pivot. A few should survive long enough to ship. Anything softer becomes zombie research.

When to Use

Use this skill when:

falsification review is complete and the next move must be explicit
surprise triage is complete and the anomaly is now explained, downgraded, or fatal
an adversary verdict came back falsified-or-unreproducible
spend is near or above the hypothesis costCap
the result is clean enough that SHIP might be available
the same claim may need either a bounded recommit or a methodological refinement
a failed claim appears to suggest a better new claim
you are about to change hypothesis status, write KILLED.md, or write an override
you are tempted to quietly stop talking about a weak run instead of deciding it

Use it especially when the decision feels awkward. That usually means emotion is trying to outvote evidence.

When NOT to Use

Do not use this skill:

before experiments/{id}/prereg.md exists
before the run produced any real evidence
instead of /skill:falsification-review
instead of /skill:surprise-triage
to publish anything that still lives only in smokes/
to silently revive a KILLED record
to launder a changed claim under the old hypothesis id
to excuse missing judge.lock, stale baselines, or missing falsifier files
to backfill method changes after you already shipped the result

If the idea itself is changing before the evidence exists, use /skill:research-question or /skill:preregistration instead.

Decision Tree

Did the claim actually fail?
├─ yes
│  ├─ Ask first: "What does this failure teach us that we didn't know before?"
│  ├─ Concrete new claim, new contract, new id? -> PIVOT
│  └─ No concrete new claim? -> KILL
└─ no
   ├─ Same claim, same method, bounded extra budget or time? -> RECOMMIT
   ├─ Same claim, changed method? -> REFINE
   └─ All gates clean, confirmed result on disk, no unresolved overrun? -> SHIP

REFINE is not a loophole after a real falsifier kill. If the claim died, kill it or pivot it. REFINE is for the same claim when the method changed and the claim itself is still live.

State Surface

Read the actual repo state before deciding anything.

Surface	Why it matters
`HYPOTHESES.md`	canonical branch record, `killReason`, new hypothesis entry, refinement counter
`.epistemic/cost-ledger.jsonl`	total spend and spend composition
`.epistemic/lessons.jsonl`	cross-run memory via `appendLesson()`
`OVERRIDES.md`	mandatory authorization for `RECOMMIT` and `REFINE`
`experiments/{id}/prereg.md`	`SHIP` eligibility and method contract
`experiments/{id}/judge.lock`	proof the judge did not drift
`experiments/{id}/smokes/`	provisional evidence only
`experiments/{id}/RESULTS.md`	confirmed result required for `SHIP`
`experiments/{id}/KILLED.md`	terminal artifact for `KILL` and the old side of a `PIVOT`
`experiments/{id}/falsifiers/`	why the claim survived or died
`BASELINES.md` and `experiments/repro_{name}/prereg.md`	freshness and reproduction for comparison claims
`src/state/repo.ts`	canonical helpers and types
`src/adversary/dispatch.ts`	adversary verdict source
`src/index.ts`	proves the kill gate is still planned, not enforced

State helpers you will actually use here:

loadRepoState(cwd)
loadHypotheses(cwd), getActiveHypothesis(entries), parseHypotheses(content)
hypothesisToMarkdown(entry), saveHypotheses(cwd, entries), updateHypothesisStatus(cwd, id, status)
fileExists(path)
getHypothesisSpend(cwd, id), getHypothesisSpendByCategory(cwd, id), getAllHypothesisSpends(cwd)
loadBaselines(cwd), getBaselineAgeDays(entry)
getJudgeLock(cwd, id), computeJudgeHash(judgeRef, id)
appendLesson(cwd, lesson)
runFalsificationAdversary({ claim, context, cwd }) if the decision depends on missing or stale adversary output

Current repo reality:

HypothesisEntry supports killReason.
LessonEntry.outcome supports "KILLED", "PIVOT", "COST_OVERRUN", and "UNREPRODUCIBLE_BASELINE".
HypothesisEntry does not currently carry a refinement counter.

So REFINE needs a visible - **Refinement count:** N line in the hypothesis block. Preserve it deliberately. Do not assume saveHypotheses(...) will keep unknown fields.

The Process

1. Load the real decision state

Start from repo state, not memory.
Call loadRepoState(cwd) for the top-level scaffold.
Call loadHypotheses(cwd) and identify the active hypothesis.
If several hypotheses could match, resolve the exact id explicitly.
Read the active HypothesisEntry closely enough to answer:
- what is the claim
- what falsifies it
- what is the best-case conclusion
- what is the cost cap
- what compute target is expected
- what judge is locked
- what baseline is being compared
- what the current status says
Locate experiments/{id}/.
Check prereg.md, judge.lock, RESULTS.md, and KILLED.md with fileExists(...).
Inspect smokes/, falsifiers/, and baselines/.
Pull total spend with getHypothesisSpend(cwd, id).
Pull the spend split with getHypothesisSpendByCategory(cwd, id).
If shared budget matters, inspect getAllHypothesisSpends(cwd).
For comparison claims, load the relevant baseline metadata and freshness.
Do not decide from memory.
Do not decide from the last encouraging run.
Do not decide from the loudest person in the room.

2. Read the money as diagnosis, not decoration

Total spend is not enough.
You must read the split from getHypothesisSpendByCategory(cwd, id).
Record both numbers: llm and compute.
The split changes the story.
A hypothesis that spent $10 on LLM and $200 on Modal is not failing the same way as one that spent $180 on judge calls and $5 on compute.
Interpret the split before you write the reason:
- llm >> compute often means the hypothesis, judge, prompt, or search loop consumed the budget.
- compute >> llm often means the substrate, orchestration path, or execution economics consumed the budget.
- low spend with a decisive falsifier means kill quickly instead of defending the sunk cost.
Compare total spend against costCap.
If spend is greater than 1.5 × costCap, treat it as a forced decision point.
SHIP is closed until the overrun is explicitly resolved.
If budget pressure drove the outcome, the lesson you write later uses outcome: "COST_OVERRUN".

3. Close branches that are not legally available

SHIP is closed if experiments/{id}/prereg.md is missing.
SHIP is closed if judge.lock is missing or does not match computeJudgeHash(h.judgeRef, id).
SHIP is closed if the claim still depends on smokes/.
SHIP is closed if comparison language depends on a stale or unreproduced baseline.
SHIP is closed if the falsifier files show unresolved falsified-or-unreproducible or cannot-audit verdicts.
SHIP is closed if cost overrun was never explicitly resolved.
RECOMMIT is closed if the claim changed.
RECOMMIT is closed if the method changed.
REFINE is closed if the claim changed.
REFINE is closed if you cannot describe the old method, the new method, and why the claim itself still deserves to live.
PIVOT is closed if you do not have a concrete new hypothesis.
A killed hypothesis cannot be reopened in place.
If the old idea deserves another life, it gets a new id.

4. When the adversary says `falsified`, ask the pivot question first

Treat any falsified-or-unreproducible verdict as a real falsifier hit for this phase.

Ask this exact question before you even think about KILL:

What does this failure teach us that we didn't know before?

Then decide honestly:

If the answer yields a concrete new claim, new boundary condition, or new comparator that the old evidence actually revealed, choose PIVOT.
If the answer is just a plea for more effort, choose KILL.
If the answer is same claim, but we need a different method, that is only REFINE when the claim itself survived and only the method is changing.
If the falsifier killed the claim as stated, REFINE is not available.
PIVOT comes before KILL in this branch because learning is the only honest rescue.
No new learning, no pivot.

5. Separate `RECOMMIT` from `REFINE`

This is where people lie to themselves.

Choose RECOMMIT only when all of these are true:

same hypothesis id
same claim
same method
same success condition
one bounded extra window of time or budget is justified by concrete new information

Choose REFINE only when all of these are true:

same hypothesis id
same claim
same success or failure boundary
the method changed
the change is written explicitly
the claim is still worth testing after the method change

If the claim changed, it is not RECOMMIT. If the claim changed, it is not REFINE. It is either PIVOT or KILL.

6. Execute the chosen branch exactly

If the answer is `KILL`

KILL means the current claim is dead and there is no concrete better claim to register right now.
Call updateHypothesisStatus(cwd, id, "KILLED").
Reload the entries with loadHypotheses(cwd).
Set the matching entry's killReason.
Save the entries back with saveHypotheses(cwd, entries).
Write experiments/{id}/KILLED.md.
Include:
- hypothesis id
- claim
- total spend
- spend split (llm, compute)
- compute target
- time spent
- decision: KILL
- root cause
If budget pressure drove the kill, say so plainly and include the split.
Preserve smokes/, falsifier files, and ledger history.
Do not reopen the same id later.

If the answer is `PIVOT`

PIVOT means the old claim died.
Start there.
Update the old hypothesis to KILLED.
Set killReason so it points to the new hypothesis id and the lesson learned.
Write experiments/{oldId}/KILLED.md.
The pivot note must include:
- old hypothesis id
- old claim
- why the old claim failed
- what the failure taught
- total spend and spend split
- new hypothesis id
Then create a brand-new hypothesis entry.
Use loadHypotheses(cwd), append the new HypothesisEntry, then persist with saveHypotheses(cwd, entries).
New id. New timestamp. New contract.
The new entry must include id, claim, falsifier, bestCaseConclusion, n, judgeRef, baselineRef, costCap, computeTarget, status: OPEN, and timestamp.
If you cannot write the new claim concretely, then you do not have a pivot yet.
Kill the old hypothesis honestly and return to /skill:research-question instead of faking specificity.
The pivot rationale must explain what was learned, not merely what you want to try next.

If the answer is `RECOMMIT`

RECOMMIT is same claim, same method, tighter remaining work.
Write the override in OVERRIDES.md.
The reason must be at least 50 characters long.
Include:
- date
- hypothesis id
- trigger being overridden
- old cap or window
- new cap or window
- exact remaining experiment
- what changed since the original plan
If the remaining work is not specific, do not recommit.
Update the live hypothesis entry only as needed:
- adjusted costCap
- status RUNNING
Keep the same id.
Keep the same claim.
Keep the same method.
If budget pressure forced the recommit, append a COST_OVERRUN lesson.

If the answer is `REFINE`

REFINE keeps the claim and changes the method.
Write the override in OVERRIDES.md.
The reason must be at least 50 characters long.
The override must name:
- the old method
- the new method
- why the old method failed
- why the same claim still deserves a test
Increment the refinement counter in the hypothesis entry.
Write it as an explicit - **Refinement count:** N line in that hypothesis block.
Because the current state serializer does not round-trip that field, preserve the line manually when you edit the block.
If you must update supported fields in the same pass, re-read the raw markdown and make one careful edit instead of helper-round-tripping the block and dropping the counter.
Keep the same id.
Do not create a new hypothesis entry.
If the change alters the claim, comparator, metric, baseline target, or the meaning of success, it is not REFINE.
It is PIVOT.
Once the override, method record, and counter are written, route the work back to /skill:experiment-execution.
Do not jump from REFINE straight to publication.

If the answer is `SHIP`

SHIP is the rare branch.
Before you take it, the repo must already look ship-ready.
Confirm:
- prereg exists and still matches the claim
- judge.lock matches computeJudgeHash(...)
- the result survived falsification review
- any required baseline is fresh and reproduced
- the confirmed number lives in experiments/{id}/RESULTS.md
- the claim no longer depends on smokes/
- no unresolved cost overrun remains
If any of that is false, SHIP is not available.
If all of it is true, call updateHypothesisStatus(cwd, id, "CONFIRMED").
SHIP does not skip publication verification.
It earns the right to start it.

Cross-Run Lessons Are Mandatory

On KILL, PIVOT, or budget-driven overrun decisions, append a LessonEntry through appendLesson() from src/state/repo.ts. Do not hand-edit .epistemic/lessons.jsonl.

Use the real fields:

hypothesisId
outcome
summary
costSpent
rootCause

Canonical shape:

await appendLesson(cwd, {
  timestamp: new Date().toISOString(),
  hypothesisId: id,
  outcome,
  summary,
  costSpent: totalSpend,
  rootCause,
});

Decision-to-lesson mapping:

KILL -> outcome: "KILLED" unless budget pressure was the forcing reason
PIVOT -> outcome: "PIVOT"
budget-driven KILL or RECOMMIT -> outcome: "COST_OVERRUN"

Write the lesson like an adult:

summary says what was learned or why the line stopped
rootCause names the mechanism, not the mood
costSpent is the real total from getHypothesisSpend(...)

Good rootCause:

Modal compute burn dominated the run and no stable gain survived the locked judge.
Falsification showed the gain existed only on long-context tasks, so the general claim died.

Bad rootCause:

Not feeling it
Maybe later
Too messy

Close the Loop

The decision is not done until the repository tells one story without you present.

After KILL or PIVOT:

HYPOTHESES.md says KILLED
killReason is present
experiments/{id}/KILLED.md exists
.epistemic/lessons.jsonl has the lesson row

After RECOMMIT:

OVERRIDES.md exists
the remaining work is bounded
any budget-driven exception has a COST_OVERRUN lesson
status stays RUNNING for a real reason, not habit

After REFINE:

OVERRIDES.md exists
Refinement count incremented
the same claim still exists
the next step is /skill:experiment-execution

After SHIP:

experiments/{id}/RESULTS.md is the authoritative artifact
status is CONFIRMED
nothing quoteable still depends on smokes/

If the files disagree, the decision is not finished.

Common Rationalizations

Excuse	Reality
`We already spent too much to stop now.`	Prior spend is not evidence. It is exactly why you need a decision.
`Pivot is basically the same as keeping it alive.`	No. `PIVOT` kills the old claim and creates a new id.
`Falsified means kill immediately.`	First ask what the failure taught. `PIVOT` comes before `KILL` when the evidence supports a new claim.
`Refine and recommit are basically the same.`	No. `RECOMMIT` keeps the method. `REFINE` changes it.
`The total cost is enough.`	No. Read the split. `$10` LLM + `$200` Modal is a different failure mode from `$180` of judge calls.
`We can write the lesson later.`	Unwritten lessons are forgotten failures. Use `appendLesson()` now.
`We can reopen the killed hypothesis if the new idea works.`	Silent revival is method fraud. New id required.
`The smokes look great, so ship is fine.`	`smokes/` is provisional. It does not authorize `SHIP`.
`The override can be one sentence.`	Short excuses are why the 50-character minimum exists.
`Refinement count is bookkeeping.`	It is churn accounting. If the same claim needed three method rewrites, that matters.

Red Flags - STOP

Stop and restart the decision if:

you want to ship from smokes/
you want to ignore the spend split
you want to treat COST_OVERRUN like a branch instead of a lesson label
you want to pivot without a concrete new hypothesis id
you want to refine without naming the old and new method
you want to recommit even though the claim changed
you want to call a real falsifier hit a refinement
you want to keep the old id after changing the claim
you want to skip .epistemic/lessons.jsonl because the failure feels embarrassing
HYPOTHESES.md, KILLED.md, RESULTS.md, and OVERRIDES.md tell different stories

All of those mean the same thing: stop, reread the artifacts, and let the repository win.

Good vs Bad

Good: pivot from real learning

# KILLED
- Hypothesis ID: h-017
- Claim: Router A improves answer quality over Router B across the full eval set.
- Decision: PIVOT
- Why old claim died: Falsification showed the gain vanished on short-context tasks under the locked judge.
- What we learned: The effect appears limited to long-context routing.
- Successor hypothesis: h-044
- Spend: $210.14 total ($12.08 llm, $198.06 compute)

Good because the old claim is dead, the lesson is explicit, and the new claim is narrower.

Bad: sentimental pivot

- Status: RUNNING
- Note: same idea, just with a slightly smarter framing

Bad because nothing died, nothing was learned, and the new contract is hidden.

Good: refine the method without changing the claim

## 2026-05-31 — Refine h-024
- Reason: The claim is unchanged, but the extraction parser was dropping valid answers and contaminating the score. We are keeping the same claim, comparator, metric, and judge, updating only the parser, and rerunning the full preregistered sample.
- Method change: parser v1 -> parser v2
- Hypothesis entry: Refinement count 2

Good because the claim stayed put, the method change is explicit, and the churn is counted.

Bad: hide a method rewrite inside recommit

## Override h-024
- Reason: Want a few more runs and some evaluation cleanup

Bad because evaluation cleanup is method change disguised as budget extension.

Good: cost-overrun lesson with diagnosis

await appendLesson(cwd, {
  timestamp: "2026-05-31T18:04:11.233Z",
  hypothesisId: "h-031",
  outcome: "COST_OVERRUN",
  summary: "Killed after compute burn exceeded the budget without stable improvement.",
  costSpent: 210.14,
  rootCause: "Compute spend on modal dominated the run while the locked-judge win rate stayed flat.",
});

Good because the lesson says why the budget mattered, not just that the number was large.

Bad: vague kill reason

# Maybe dead
Spent a lot.
Might revisit later.

Bad because it preserves deniability instead of recording a decision.

After SHIP, the next required skill is /skill:verification-before-publication.

kill-or-ship

Popularity

Invocation

Context Preview

SKILL.md

kill-or-ship

Popularity

Invocation

Context Preview

SKILL.md

Kill or Ship

Overview

Quick Reference

The Iron Law

When to Use

When NOT to Use

Decision Tree

State Surface

The Process

1. Load the real decision state

2. Read the money as diagnosis, not decoration

3. Close branches that are not legally available

4. When the adversary says falsified, ask the pivot question first

5. Separate RECOMMIT from REFINE

6. Execute the chosen branch exactly

If the answer is KILL

If the answer is PIVOT

If the answer is RECOMMIT

If the answer is REFINE

If the answer is SHIP

Cross-Run Lessons Are Mandatory

Close the Loop

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: pivot from real learning

Bad: sentimental pivot

Good: refine the method without changing the claim

Bad: hide a method rewrite inside recommit

Good: cost-overrun lesson with diagnosis

Bad: vague kill reason

Similar Skills

Kill or Ship

Overview

Quick Reference

The Iron Law

When to Use

When NOT to Use

Decision Tree

State Surface

The Process

1. Load the real decision state

2. Read the money as diagnosis, not decoration

3. Close branches that are not legally available

4. When the adversary says falsified, ask the pivot question first

5. Separate RECOMMIT from REFINE

6. Execute the chosen branch exactly

If the answer is KILL

If the answer is PIVOT

If the answer is RECOMMIT

If the answer is REFINE

If the answer is SHIP

Cross-Run Lessons Are Mandatory

Close the Loop

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: pivot from real learning

Bad: sentimental pivot

Good: refine the method without changing the claim

Bad: hide a method rewrite inside recommit

Good: cost-overrun lesson with diagnosis

Bad: vague kill reason

Similar Skills

4. When the adversary says `falsified`, ask the pivot question first

5. Separate `RECOMMIT` from `REFINE`

If the answer is `KILL`

If the answer is `PIVOT`

If the answer is `RECOMMIT`

If the answer is `REFINE`

If the answer is `SHIP`

4. When the adversary says `falsified`, ask the pivot question first

5. Separate `RECOMMIT` from `REFINE`

If the answer is `KILL`

If the answer is `PIVOT`

If the answer is `RECOMMIT`

If the answer is `REFINE`

If the answer is `SHIP`