From formal-methods
Structured bug investigation using formal models for hypothesis-driven debugging. Builds scoped Alloy models (normative rules, data constraints, causal chains, observability) to generate distinguishing experiments and narrow the causality cone around a symptom. Use whenever the user reports a bug, incident, or unexpected behavior and wants rigorous investigation. Trigger on "investigate", "root cause", "debug this", "why does this happen", "find the bug", "postmortem", or any unexplained symptom. Valuable for: business-rule bugs, state machine errors, workflow inconsistencies, cross-service disagreements, investigations where the obvious explanation was already checked. Not for simple stack traces or typos — use when the problem space is large enough that unstructured search would waste time or miss the real cause.
How this skill is triggered — by the user, by Claude, or both
Slash command
/formal-methods:formal-debuggerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Requires `formal-modeling` skill. Hypothesis accepted only when no other compatible hypothesis
scripts/check_pw0_live.pyscripts/check_rejection_reasons.pyscripts/check_workspace_clean.shscripts/evidence_hash.pyscripts/iso_to_filename.pyscripts/now_iso.pyscripts/sha256_file.pyscripts/time_delta.pyself-models/fdp_baseline_comparability.alsself-models/fdp_dynamic_data.alsself-models/fdp_evidence_quality.alsself-models/fdp_fact_ordering.alsself-models/fdp_intervention_ordering.alsself-models/fdp_protocol.alsself-models/fdp_protocol.dfyself-models/fdp_rejection_reasons.alsself-models/fdp_skip_protocol.alsself-models/fdp_source_classification.alsself-models/fdp_structured_chain.alsself-models/fdp_symptom_proximity.alsRequires formal-modeling skill. Hypothesis accepted only when no other compatible hypothesis
remains undistinguished. Production first — prefer direct production queries over
interpreted code reading at every step.
Create ./investigations/<slug>/ (kebab-case from symptom) with this layout:
investigation-report-<N>_<timestamp>.md — versioned narrative snapshots (PW1).
Multiple files: investigation-report-1_<ts>.md at Step 0 (symptom only),
investigation-report-2_<ts>.md after Step 0b, additional versions as the
investigation progresses, and a final version at termination.evidence/ — one file per evidence entry E (PW1)hypothesis/ — one file per hypothesis event H- (PW3)model-changes/ — one file per model change M (PW2)Formal model files (.dfy, .als) go at the investigation root.
Report versioning. Reports are immutable once written; new information produces a new
versioned file rather than an edit in place. Each report from version 2 onward carries a
PrevReportHash: field = sha256(previous-report-file-bytes), forming a report chain
parallel to the hypothesis and model chains. The first report (investigation-report-1_*)
has no PrevReportHash:; it is the genesis anchor for all three chains — hypothesis, model,
and report.
Each record is its OWN file — no monolithic append logs. Filename format:
E1_2026-04-25T10-30-45-123Z.md, H1-2_2026-04-25T10-31-00-456Z.md,
M1_2026-04-25T10-40-00-789Z.md. The suffix is an ISO 8601 timestamp with millisecond
precision, using - in place of : and . (filesystem-safe, sortable, greppable). Each
record also carries a Timestamp: field with the full ISO 8601 value (with colons and
decimal ms) INSIDE the file, e.g., Timestamp: 2026-04-25T10:30:45.123Z.
Chain structure — three relationships, not one flat chain.
PrevHypHash: = sha256(previous-hypothesis-event-file),
where "previous" is the H event with the next-lower Timestamp:. The first H event's
PrevHypHash: anchors to sha256(investigation-report-1_*.md) — the first (genesis) report.ParentHypEvent: H<id>-<N> identifying the
hypothesis event it was collected for, and ParentHypHash: sha256(parent-file) snapshotting
that parent at collection time. Evidence does NOT chain to other evidence.PrevModelHash:
(sha256 of previous M, anchored on sha256(investigation-report-1_*.md) for the first)
AND ParentHypEvent: / ParentHypHash: identifying which hypothesis context triggered
the model iteration.investigation-report-<N>_*.md with N≥2 carries
PrevReportHash: = sha256(investigation-report-<N-1>_*.md). The first report has no
PrevReportHash:; it is the genesis for all three parallel chains.State-change events freeze their evidence. When a hypothesis record is a status-changed
event (to compatible, weakened, rejected, undistinguished, accepted), it carries
Evidence: [E<N>, E<M>, ...] listing the specific evidence it rests on, plus EvidenceHash:
= sha256(sorted-concat-of-evidence-file-sha256s). Any later edit to any referenced
evidence record invalidates the state-change's EvidenceHash — evidence becomes immutable
once cited in a state transition.
Symptom-verification anchor. At Step 0b, before collecting E1, write H0-1: symptom-claimed
as the root anchor event. Its PrevHypHash: = sha256(investigation-report-1_*.md). E1
then links to H0-1 as its ParentHypEvent:. This keeps the "evidence must attach to a
hypothesis event" rule consistent from the very first evidence.
Precondition. Before writing any record, its parent (or chain predecessor) must exist
on disk with ALL required fields. You cannot write H1-2 without H1-1 being complete; you
cannot write E3 without its parent H event being complete; you cannot write a status-changed
event without its cited evidence being complete. This makes construction sequential by
necessity and makes the dependency graph explicit.
Why this works instead of forcing sleeps: batch-flushing the whole structure at session end
requires either (a) honest Timestamp: values matching wall-clock during the flush — caught
by filesystem-provenance — or (b) backdated timestamps — also caught by filesystem-provenance.
Meanwhile, the chain and reference hashes prevent any retroactive insert, reorder, or edit
of a record that has already been cited or chained from.
Blocking precondition (PW0-init). Before proceeding to Step 0a, Claude MUST create
investigation-report-1_<timestamp>.md (the first report, with just the symptom sketch)
plus the three subdirectories (evidence/, hypothesis/, model-changes/) on disk.
Claude MUST NOT collect evidence, read code, or run queries before these exist on disk.
Verify with a directory listing and show it to the user. Retroactive stub creation does
not satisfy this rule.
Pin down the symptom — ask and record: What (exact wrong behavior), Where (service,
endpoint), When (always? intermittent?), For whom (all users? specific?), What is NOT
the symptom (adjacent things that work). Assess severity: data exposure → security-critical,
financial impact → business-critical, data loss → data-integrity-critical, compliance (GDPR,
SOC2, PCI-DSS) → note explicitly. Write ## Symptom + **Severity:** into the report.
Check investigations/tooling-inventory.md first (template at <skill-dir>/templates/tooling-inventory.md).
Determine: production DB, logs, metrics, tracing, error tracking, live API, config, queues,
repo, CI/CD. Write ## Tooling noting direct vs inferred per tool.
S0-Obs — observability-first pre-flight. Before Step 0b, name the highest-resolution
layer where the symptom is directly observable today. Resolution ladder, highest first:
production logs of the failing call → production DB row that captured the failure →
metric/trace covering the failure window → derived report → user description. Append
## Observability ceiling to tooling-inventory.md listing:
response.model) is missing, write Gap: <field/log/metric> not captured.If a gap is named, Step 0b's E1 MUST come from the next-best layer AND H0-1 MUST include "observability gap may be hiding the real cause." Closing the gap (adding the missing log/field) is a valid first hypothesis-driving action — don't defer it to Step 7. The investigation does not move past Step 0b on a layer worse than what production already supports.
S0-V — symptom verification required. Confirm the symptom with at least one direct fact
(production DB query, log, or live API observation) before Step 1. interpreted sources (reports,
specs, prior investigations) must be re-verified against production.
First write hypothesis/H0-1_<timestamp>.md with Event: symptom-claimed and
PrevHypHash: sha256(investigation-report-1_*.md) — this anchors the hypothesis chain. Then
write evidence/E1_<timestamp>.md with ParentHypEvent: H0-1 and ParentHypHash: set to
the sha256 of H0-1's file bytes. Then write investigation-report-2_<timestamp>.md (the
updated narrative now including **Verified by:** E1) with PrevReportHash: = sha256 of
investigation-report-1_*.md. Do NOT edit the first report — write a new version.
S0-V.1 — Symptom proximity check. If the symptom is transport-shaped (DNS failure,
gaierror, connection refused, socket timeout, 5xx, health-check fail, "host not found"),
the first hypothesis MUST include "target process never started or crashed during startup."
Before investigating the transport layer, gather direct evidence of upstream liveness:
startup logs, container state, readiness probe, process list. If that evidence is
unavailable, Step 1 is observability (expose startup output), not topology.
Requires formal-modeling skill. Check production logs/traces for the actual execution path
before writing model code. Append findings to evidence-log.md.
Locating modeling tooling. The formal-modeling skill ships the Alloy/Dafny runners
(alloy_run.sh, dafny_run.sh, verify.sh) and reference .als/.dfy examples. When
installed as a plugin, look under ~/.claude/plugins/marketplaces/*/skills/formal-modeling/
or ~/.claude/plugins/cache/*/formal-methods/*/skills/formal-modeling/
(scripts/ for runners, references/ for example models). If not found there, check
~/.claude/skills/formal-modeling/ or ask the user for the install path before proceeding.
Alloy availability — DO NOT short-circuit on which alloy. Alloy is not expected
to be on PATH as a standalone binary. The skill ships a self-contained runner
(alloy_run.sh) that downloads alloy.jar on first use into a sibling .alloy/ cache and
executes it via any local Java 17+ install (Homebrew openjdk, system OpenJDK, or
JAVA_HOME); it falls back to Docker if no Java 17+ is found. Therefore:
which alloy / command -v alloy returning "not found" is expected and
carries zero signal about whether Alloy can run. Never use it as a tool-availability
check, and never write reports that say "Alloy is not installed" based on it.alloy_run.sh exists in <formal-modeling>/scripts/ (locate per the paths above)./opt/homebrew/opt/openjdk/bin/java -version,
$JAVA_HOME/bin/java -version, or java -version parsed for major version ≥ 17), OR
docker is on PATH (the runner falls back to a containerised JDK).bash <path>/alloy_run.sh <path>/references/<example>.als on
a known-good example completes with a Solver result: line.java/docker lookups were tried — not just
"alloy not found".which alloy — that is a bug, not a defaulting rule.Default: create a formal model file AND run the solver in the same step. Use .dfy for
fast iteration, .als for counterexamples. Building the model and running the solver are
one atomic step — a .dfy/.als file without a Solver result: field in model-change-log.md
does NOT satisfy TC28. "Model exists but solver deferred pending evidence" is NOT an allowed
state.
Consequences:
Solver result: field per the hypothesis-entry format.
An M1 entry without a solver result is a protocol violation caught by the termination gate.Model skip — requires explicit user acknowledgement before writing the skip entry.
Protocol (MUST follow in order):
<X>. Here are the edge cases narrative reasoning
misses: <Y>."M1: Skipped (user-acknowledged) to model-change-log.md.
The entry MUST quote the user's verbatim acknowledgement string under an
Acknowledgement: field.Forbidden pattern: writing M1: Skipped (user-acknowledged) in the same turn that proposes
the skip. The skip entry and the proposal MUST be in different turns separated by a user
reply.
Model only the nearest causal layer. Build four layers:
Invariant kind tagging. When an invariant is captured directly in prose rather than
the formal model (e.g., "production used model X", "the replay tool reaches the same
endpoint as production"), tag it with one of protocol | data | observability | setup in
the M entry. protocol invariants govern the system under study; data invariants
govern stored values; observability invariants govern what should be visible; setup
invariants bind one experimental/operational frame to another — including production-state
vs replay-state, experiment vs production, AND multi-version system pairings (e.g.,
V_old peer ↔ V_new peer in distributed systems, legacy ↔ current schema, old client ↔ new
server). Every setup invariant MUST also appear as an A-row in the Step 2 assumptions
audit — setup is the strongest place for an unstated premise to hide.
RE — Replay-environment audit (required when M uses replay or counterfactual evidence). A replay tool is any mechanism that re-runs a production-shaped operation outside production: LLM prompt replay, query replay, integration-test replay, recorded-and-replayed HTTP traffic. When the model relies on such evidence, append an RE-table to the M entry:
| id | dimension | production value (source) | replay value (source) | match? |
|-----|--------------------------------------|---------------------------|------------------------------|--------|
| RE1 | model / version / endpoint | (logs / config) | (replay tool config) | ✓ / ✗ |
| RE2 | input parameters (temperature, …) | (production code path) | (replay tool defaults) | ✓ / ✗ |
| RE3 | identity (user, tenant, flag cohort) | (logs / DB) | (replay setup) | ✓ / ✗ |
| RE4 | data snapshot | (production timestamp) | (replay snapshot timestamp) | ✓ / ✗ |
Add domain-specific rows as needed (auth scope, region, feature flags, schema version).
Any ✗ row downgrades the replay's reliability tag to interpreted and MUST appear as an
A-row in the Step 2 assumptions audit.
Append M1 to model-change-log.md with trigger, what was created, solver results.
A1 — Assumptions audit (precedes hypothesis enumeration). Before writing any hypothesis, list the assumptions under which the symptom is anomalous. An assumption is a premise that, if false, dissolves the entire hypothesis space rather than eliminating one branch within it. Typical kinds:
Write hypothesis/A_<timestamp>.md (event: assumptions-audited, chained on the latest
report) before any H<N>_*.md files. Its body MUST contain a table:
| id | assumption | falsifier (concrete query/command) | evidence-id | reliability |
|----|------------|------------------------------------|-------------|-------------|
| A1 | … | … | E<N> | direct/… |
Every row MUST cite an Evidence record (collect them now if needed; this is a
Step-4-shaped sub-loop scoped to assumption verification, not hypothesis verification). A
row without direct or inferred evidence blocks hypothesis enumeration — the assumption
itself becomes the next thing to verify.
A2 — Negation hypothesis. For each assumption, the hypothesis space MUST include at least one hypothesis that is the negation of the assumption (e.g., assumption "production used model X" → hypothesis "production used a different/lesser model"). These are full H1-shaped hypotheses with mechanism and counterfactual; they are not skipped on grounds of "obviously true."
Then: extract hypotheses from the model. Each must follow H1:
[condition] -> [mechanism] -> [state change] -> [symptom].
For each, state the counterfactual (FZ1). Append created, mechanism-stated,
counterfactual-stated entries to hypothesis-log.md.
For each hypothesis pair: what check distinguishes them? Strong = confirms one, excludes another. Weak = compatible with multiple. Order by max information gain.
Each fact gets a reliability tag:
| Source | Reliability |
|---|---|
| Production DB query | direct |
| Production logs (<7d) | direct |
| Production logs (>7d) | inferred |
| Live API response | direct |
| Deployed config / env vars | direct |
| Repo code | interpreted |
| Git history (local) | interpreted |
| Prior reports / specs / docs | interpreted |
| Alloy model results | inferred |
| User verbal description | interpreted |
| Mobile app code | unreliable-source |
| Third-party docs | interpreted |
| User reports | inferred |
Repo code is not production truth. Always tag code reading as interpreted.
F4 — Fix tasks: first fact must be direct production observation of current behavior.
F3 — Dynamic data: collect (1) current value, (2) change history, (3) timeline coverage.
F6 — Zero-result queries: list ALL sources, query each. Record Absence sources: N/M.
F7 — Wrong values: trace WRITE paths, not read. Record Analysis type: write-path.
F8 — Numeric discrepancy: compute exact locally before estimating. Record Computation method.
F9 — Snapshot fields: check temporality before trusting. Record Field temporality.
F10 — Baseline comparability: differential vs "last known good" requires same repo, trigger,
and config. Record Baseline: <id> | repo=X trigger=Y config-diff=Z. Mismatched baseline = interpreted.
F11 — Workspace contamination: local/CI investigations must check for untracked/gitignored
files that mask CI failure. Run <skill-dir>/scripts/check_workspace_clean.sh [paths] (which
wraps git ls-files --others --exclude-standard + gitignored enumeration). Use --source-only
to filter common source extensions when noise is high. Record Workspace clean: yes/no on
evidence derived from local state.
Append E entries to evidence-log.md immediately as facts are gathered.
Add confirmed facts as model constraints. Re-run solver. Append M to model-change-log.md.
Valid status transitions:
| From | Allowed transitions |
|---|---|
active | compatible, weakened, rejected, undistinguished |
compatible | accepted, weakened, rejected, undistinguished |
weakened | rejected, compatible |
undistinguished | compatible, rejected |
rejected | terminal — no transitions |
accepted | terminal — no transitions |
Append status-changed entries to hypothesis-log.md with evidence reference.
If hypotheses explain the same facts and predict the same for all checks — do NOT accept any,
do NOT pick by "likelihood." Proceed to Step 7. Append equivalence-checked and
observability-assessed (FZ2) entries to hypothesis-log.md.
Expand only where undistinguished hypotheses live. Directions: depth (code), depth (data), depth (type contracts — check types at call boundary when function returns default value), breadth (observability), breadth (concurrency). Append M. Go back to Step 2.
Termination conditions (ALL must be true):
compatible (U1)direct evidence supports conclusion (PV1)direct evidence (F5)direct entry (PW1)mechanism-stated logged (PW3/H1)counterfactual-stated logged (PW3/H2)observability-assessed logged with "observable" (PW3/FZ2)alternative-considered logged (PW3/M2)equivalence-checked logged (PW3/U1)compatible/accepted, rest rejected (PW3/U1)direct/inferred (TC19)rejected→anything)direct (F4)inferred (F6)M1: Skipped (user-acknowledged) with an Acknowledgement: field quoting the user's verbatim affirmative reply from a turn AFTER the skip was proposed (PV2)investigation-report.md + three subdirectories (evidence/, hypothesis/, model-changes/) exist on disk. Filesystem ctime of each subdirectory precedes the earliest Timestamp: field value of any record inside it. Retroactive rebuild fails this check.Timestamp: field in ISO 8601 UTC with millisecond precision; (b) investigation-report-<N>_*.md files chain via PrevReportHash: (first report has none); (c) hypothesis records form a valid SHA-256 chain via PrevHypHash: anchored on sha256(investigation-report-1_*.md); (d) evidence records have a valid ParentHypEvent: + ParentHypHash: pointing to an existing hypothesis event; (e) model-change records chain via PrevModelHash: and have valid ParentHypEvent: + ParentHypHash:; (f) every status-changed/accepted hypothesis record has a matching EvidenceHash: over the cited Evidence: list; (g) each record's in-field Timestamp: matches filesystem ctime within 60 seconds.direct evidence before transport investigation (S0-V.1)git status --ignored (F11)direct evidence of the changed state (OB1)rejected hypothesis has a valid Reason: (evidence-based with Evidence: E<N>, or preference-based with an allowed Priority: + Rationale:) in its status-changed log entry (U2-doc)Write call, preceded by Bash calls to now_iso.py and sha256_file.py for fresh values; no multi-file write/repair helper was used; pre-write narration was present for each chain recordassumptions-audited event present with ≥1 row, each row citing an Evidence record with direct or inferred reliability (A1).✓ rows are anchored to a direct or inferred evidence record (RE).If any fails, iterate. Before acceptance, write an alternative-considered and a
status-changed event file under hypothesis/. Assemble report: Symptom, Conclusion,
Hypothesis History, Evidence Log, Hypothesis Log, Model Change Log, Model Coverage,
Remaining Uncertainties, Next Steps. The report may reference the individual record files
by name (e.g., "E1, E4, and H2-3 together rule out ...").
Three per-record subdirectories, one file per event. PW1: evidence/ needs ≥1 direct
entry. PW2: model-changes/ needs ≥1 file if a model is built, re-run after fact
integration. PW3: hypothesis/ needs TC13-18 events across the hypothesis lifecycle.
PW0-init — stub layout is a blocking precondition. investigation-report.md plus the
three subdirectories (evidence/, hypothesis/, model-changes/) MUST exist on disk
before Step 0a begins. No evidence collection, code reading, or queries may precede their
creation. Verify via directory listing shown to the user.
PW0-live — per-record files with structured hash integrity. Each E, H, M event is its
OWN file with a timestamp suffix in the filename, a Timestamp: field inside, and
reference-hash fields tying it to its dependencies. Enforcement (TC30) is four-part:
Timestamp: field in ISO 8601
UTC form with millisecond precision (e.g., Timestamp: 2026-04-25T10:30:45.123Z).Timestamp: form a valid
SHA-256 chain. Each H record's PrevHypHash: equals sha256(previous-H-file-bytes).
First H record's PrevHypHash: equals sha256(investigation-report-1_*.md).ParentHypEvent: H<id>-<N> pointing
to an existing H event, and ParentHypHash: equals sha256(parent-H-file-bytes).
Evidence is not cross-chained. Model-change records have ParentHypEvent: +
ParentHypHash: the same way, and additionally chain via PrevModelHash: to the
previous M (or sha256(investigation-report-1_*.md) for the first M).
Reports with N≥2 chain via PrevReportHash: = sha256(investigation-report-<N-1>_*.md).Event: status-changed
(or accepted) include Evidence: [E<N>, ...] and EvidenceHash: = SHA-256 over
the sorted concatenation of the cited evidence files' own SHA-256 hashes. Any later
edit to any cited evidence invalidates this hash. Evidence records become immutable
once cited in a state transition.Timestamp: must match the file's creation
time within 60 seconds. Large disagreement = backdated record.These checks together make the following attacks detectable:
ParentHypEvent: or mismatched ParentHypHash:Records can be written at whatever pace the investigation runs — the structure enforces sequential dependencies without requiring artificial sleeps.
Before acceptance, run <skill-dir>/scripts/check_pw0_live.py <investigation-dir> — exit 0
is required.
PW0-strict — single-record write discipline. This is the binding rule that prevents batch-fix shortcuts. The structural integrity of the chain only holds if each record was actually constructed sequentially against the live state of its predecessor.
One Write per record. Each E/H/M record is written by exactly one Write tool call,
immediately preceded by Bash calls that read the current wall-clock time and the
predecessor's SHA-256. Use scripts/now_iso.py and scripts/sha256_file.py. The
Timestamp: and PrevHypHash:/ParentHypHash:/PrevModelHash:/PrevReportHash:/
EvidenceHash: values MUST be the values returned by those calls — never synthesized,
never derived from a multi-file write helper.
No multi-file write helpers. Read-only helpers that compute hashes or timestamps
for a single record (sha256_file.py, evidence_hash.py, now_iso.py) are allowed
and encouraged. Any Bash/Python invocation that writes, edits, or rechains two
or more record files in a single pass is forbidden — including for "fixups" after a
check fails, including for setup of test fixtures, including for "convenience."
Append-only forever. Never delete or rewrite a record. Once an E/H/M/Report file is written, its bytes are immutable — and its existence on disk is permanent. Deletion of a broken record is NOT a permitted "fix." Renaming, moving, or rewriting in place is also forbidden. If a record is wrong, the only remedy is to write a NEW record that supersedes it (see rule 6 below). The broken record stays on disk as a visible historical fact — the audit trail must record the mistake, not pretend it never happened.
If a checker fails, repair file-by-file. When check_pw0_live.py or
check_rejection_reasons.py reports violations across multiple files, the fix is to
repair each affected record individually with a fresh Bash call to now_iso.py
and a fresh Write. Do not write a script to fix multiple files at once. Do not
propose one. If a "single-pass repair" feels like the right move, that is the signal
that PW0-strict is binding — write the next single file instead.
Cost is not a justification. Permission prompts, tool round-trips, and context size are costs the user accepts in exchange for the integrity the protocol provides. "Batching saves tokens / round-trips / prompts" is not a valid reason to violate any of the rules above.
Supersedes mechanism for repairing broken records. When a record needs correction
(e.g., a state-change with stale EvidenceHash: after evidence was tampered with), the
repair is a NEW state-change record carrying a Supersedes: H<id>-<N> field naming the
record it replaces. The new record has its own fresh Timestamp:, PrevHypHash: (linking
to whichever record currently has the highest Timestamp), Evidence:, and EvidenceHash:
values. The original broken record stays on disk untouched. The supersession itself is
part of the audit trail.
check_pw0_live.py excuses a superseded record from per-record validity checks where
"the broken state IS the audit trail":
EvidenceHash: validity) — skipped for superseded recordsTimestamp: vs ctime within 60s) — skipped for
superseded recordsOther checks still apply transitively because they're about the record's relationship to OTHER records, not the record's own validity:
PrevHypHash:) — successors of a superseded record
still chain to its content hash; supersession doesn't break the chainParentHypHash: reflects what the parent's content
was at attachment time, which doesn't change under supersessionSo if you supersede a state-change whose ctime has drifted (e.g., the file was touched
long after its in-field Timestamp:), the check passes once the superseder is in place.
The drift on the historical record is acknowledged, not erased.
Pre-Write narration. Before each Write of a chain record, state in one sentence:
the file being written, the Timestamp: value just read from now_iso.py, and the
PrevHash/ParentHash/EvidenceHash value just computed. This narration is the audit
trail of construction discipline; skipping it is itself a PW0-strict violation.
The forbidden patterns:
Report record format: A file at investigation-report-<N>_<timestamp-suffix>.md where
<N> starts at 1 and increments with each new version. Required fields in the body:
Timestamp: (ISO 8601 UTC ms), and PrevReportHash: (sha256 of the previous report file)
for N≥2. The first report has no PrevReportHash:. Narrative content (Symptom, Severity,
Tooling, and later Conclusion / Hypothesis History / etc.) follows.
Evidence record format: A file at evidence/E<N>_<timestamp-suffix>.md with first line
E<N>: [description], followed by required fields: Timestamp: (ISO 8601 UTC ms),
ParentHypEvent: (the H record it serves, e.g., H1-2 or H0-1 for symptom verification),
ParentHypHash: (sha256 of the parent H file at collection time), Step, Source, Reliability,
Raw observation, Interpretation, Integrated?, Hypotheses affected, Verification query.
F6-F9 optional fields: Absence sources: N/M + verdict (F6), Analysis type: write-path +
Producer identified (F7), Computation method + Residual (F8), Field temporality +
Last written (F9). These are the audit trail TC24-27 check at termination.
F5 staleness: direct evidence goes stale after deploy/migration. Re-verify before acceptance.
If investigation spans sessions, re-verify all prior direct evidence.
Model record format: A file at model-changes/M<N>_<timestamp-suffix>.md with first
line M<N>: [description], followed by required fields: Timestamp: (ISO 8601 UTC ms),
PrevModelHash: (sha256 of previous M, or sha256 of investigation-report.md for the first),
ParentHypEvent: (triggering H event), ParentHypHash: (sha256 of that parent H at the
time of this model change), Step, Trigger, What changed, Solver result.
Hypothesis record format: A file at hypothesis/H<id>-<N>_<timestamp-suffix>.md with
first line H<id>-<N>: [event], followed by required fields: Timestamp: (ISO 8601 UTC ms),
PrevHypHash: (sha256 of previous H record — or sha256 of investigation-report.md for the
first H, which is H0-1 at Step 0b), Step, Hypothesis, Event (symptom-claimed | created |
mechanism-stated | counterfactual-stated | observability-assessed | alternative-considered |
status-changed | equivalence-checked), Detail.
For Event: status-changed (any target status) or Event: accepted, additionally include:
Evidence: [E<N>, E<M>, ...] listing the specific evidence that justifies this transition,
and EvidenceHash: = SHA-256 over the sorted concatenation of the cited evidence files'
own SHA-256 hashes. For status-changed to rejected, also add Reason: — either
evidence + Evidence: E<N> (the evidence list also populates the EvidenceHash) OR
preference + Priority: (from the allowed set) + Rationale: (non-empty text; a
preference-based rejection still needs Evidence: [] + EvidenceHash: = sha256 of empty
concatenation).
Pre-acceptance log checklist (mirrors TC1-36):
direct entry (PW1)mechanism-stated logged (H1)counterfactual-stated logged (H2)observability-assessed = observable (FZ2)direct evidenceequivalence-checked logged (U1)alternative-considered logged (M2)compatible/accepted, rest rejected (U1)direct evidence (F5)direct/inferred (TC19)rejected→other)direct (F4)Absence sources N/M with N=M (F6)write-path + Producer identified (F7)exact-local + Residual ≤5% (F8)Field temporality: live for current-state (F9)M1: Skipped (user-acknowledged) with rationale (TC28/PV2)rejected hypothesis has a documented Reason: + backing field (TC35/U2-doc)Write per record, fresh Bash calls for time and predecessor hash, no multi-file write/repair helpers, pre-write narration presentIf any fails, iterate.
H1 — Every hypothesis: [condition] -> [mechanism] -> [state change] -> [symptom].
H2 — State what observation would make it false. No counterfactual = too vague.
T1 — Every check: which hypotheses does it distinguish? Compatible with all = zero value.
U1 — Accept only if no other hypothesis is compatible. If undistinguished → deepen, don't pick.
U2 — Multiple compatible hypotheses: keep all active, do not collapse.
U2-doc — Every status-changed entry to rejected MUST carry a Reason: field:
Reason: evidence + Evidence: E<N> (cite the specific entry) OR
Reason: preference + Priority: <name> + Rationale: <text>.
Allowed Priority: values: Occam, BlastRadius, Severity, RecencyOfDeploy,
Reproducibility, FixCost. Any other priority must be raised to the user first.
M2 — Before accepting, name ≥1 alternative mechanism.
M1 — blind spot checklist (all must be reviewed before acceptance): Concurrency, Shared mutable state, Object lifecycle, Caching, Async boundaries, External systems, Partial observability, Config/feature flags, Data migration, Tenant isolation, Auth state, Deployment drift, Multi-artifact versions, Build pipeline divergence.
F1 — Tag every fact by source reliability. Never tag interpreted as direct.
F2 — Absence ≠ evidence of absence. Could it be absent from your view but present in production?
F3 — Dynamic data: check (1) current value, (2) change history, (3) timeline coverage.
F6 — Zero-result query: list ALL sources, query each. Conclude absence only when all agree.
F7 — Wrong value: trace WRITE paths (INSERT/UPDATE), not read paths. Match value to producer.
F8 — Numeric discrepancy: compute exact locally before estimating. Residual >5% = compute instead.
F9 — DB field as current state: check temporality (live/snapshot/scheduled). Snapshot = inferred.
F10 — Baseline comparability: differential requires matching repo/trigger/config. Record the diff.
F11 — Workspace contamination: check git status --ignored + git ls-files --others on local/CI mixes.
OB1 — Observability before intervention. Don't change topology/config/code under investigation
until you have direct evidence of the state being changed. Blind intervention moves the target.
PW0-strict — One Write per record, with fresh Bash calls for time and predecessor hash.
No multi-file write/repair helpers. If a checker fails, fix each affected file individually with
a new Bash + Write per file — do NOT write a script that batches the fix. Cost (tokens,
prompts, round-trips) is not a justification. Pre-write narration: state file, Timestamp,
PrevHash before each Write.
PV1 — ≥1 direct fact must support acceptance. Code reading alone is insufficient.
PV2 — Formal model required. Skip requires: (a) Claude asked, (b) user replied affirmatively in a later turn, (c) the user's verbatim reply is quoted in the skip entry. Inferring acknowledgement from silence, prior preferences, or memory is forbidden.
FZ1 — State counterfactual for each hypothesis.
FZ2 — Unobservable counterfactual blocks acceptance. Deepen observability, don't assume true.
FM1 — ≥1 direct fact before building model. Sequence: verify → code → hypothesize → model → verify fix.
direct evidence before proposing a system-level mechanism.templates/tooling-inventory.md — Template for tooling inventory. Copy into the project's
investigations/ directory and fill in. The skill reads it at Step 0a to avoid
re-enumerating tools each investigation.scripts/check_pw0_live.py — TC30/PW0-live enforcement. Run against an investigation
directory before acceptance: python3 scripts/check_pw0_live.py investigations/<slug>.
Fails if any entry lacks a Turn: field or if two consecutive entries share the
termination turn (the latest Turn: value across all three logs).scripts/check_workspace_clean.sh — TC33/F11 enforcement. Run against the paths
relevant to the investigation: scripts/check_workspace_clean.sh src/ bot/. Fails
if any untracked or gitignored files exist under those paths. --source-only filters
to common source extensions when the noise floor is high (e.g., .DS_Store, build
artifacts). Exit 0 = clean, 1 = contamination.scripts/check_rejection_reasons.py — TC35/U2-doc enforcement. Run against an
investigation directory: python3 scripts/check_rejection_reasons.py investigations/<slug>.
Flags status-changed-to-rejected entries missing Reason:, using preference
without an allowed Priority:, or missing Evidence:/Rationale: backing fields.scripts/sha256_file.py <path> — print SHA-256 hex of one file. Use for computing
PrevHypHash:, PrevModelHash:, PrevReportHash:, and ParentHypHash: values.scripts/evidence_hash.py <file> [<file> ...] — compute EvidenceHash: for a
state-change record. Hashes each input individually, sorts the hex digests, concatenates,
then takes sha256 of the result. Order-independent (same inputs → same hash regardless
of argument order).scripts/now_iso.py [--filename] — current UTC timestamp with millisecond precision.
Default form is the in-field Timestamp: value (2026-04-25T10:30:45.123Z); --filename
gives the filename-safe form (2026-04-25T10-30-45-123Z) with : and . replaced by -.scripts/iso_to_filename.py <iso> / --reverse <filename> — convert between the two
timestamp forms.scripts/time_delta.py <earlier> <later> — signed time delta in seconds with ms
precision. Accepts either canonical or filename-safe ISO form. Useful for verifying
the 60s filesystem-vs-field tolerance and for computing investigation span.Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub in8finity/claude-plugin --plugin formal-methods