Use when auditing a large multi-section technical document (design doc, spec, patent, report) — especially LLM-written or edited many times — for internal inconsistency and concept drift. Symptoms: the same quantity carries different values in different sections (e.g. throughput 20,000 vs 24,000); a count in a bullet list disagrees with the nearby table; a concept is defined or framed one way early and used differently later (meandering trains of thought); cross-references like "§4.2" or "Invention #N" point at a section or name that no longer exists; post-pivot residue, where new-architecture prose coexists with old-architecture reasoning; a prose cleanup pass already ran but missed cross-section contradictions. Not for fresh documents, pure spec/API reference with no reasoning to track, or short docs one careful read covers — though a small but heavily-pivoted doc still benefits.
How this skill is triggered — by the user, by Claude, or both
Slash command
/trains-of-thought-audit:trains-of-thought-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Long multi-section documents — especially LLM-written or many-times-edited ones — drift. A concept is defined one way in §1 and used differently in §9; a number stated early disagrees with the same number later. The root cause: **the document's core specifications are never written down and tracked**, so each section re-derives them from fading memory and lands somewhere slightly different.
Long multi-section documents — especially LLM-written or many-times-edited ones — drift. A concept is defined one way in §1 and used differently in §9; a number stated early disagrees with the same number later. The root cause: the document's core specifications are never written down and tracked, so each section re-derives them from fading memory and lands somewhere slightly different.
No single reading pass catches everything (this was measured — see Evidence). The audit therefore runs two complementary passes into one verification gate:
The registry gives completeness and trustworthy precision; the read pass gives recall on the unanticipated; the gate makes both safe to ship. Run all three and union the confirmed findings. (The old framing "structure surfaces the problem; prose doesn't" is wrong — a plain read caught the single worst error a numbers-only registry missed.)
The trigger is multi-pivot / LLM-drift damage, not size. Size decides only how you dispatch the registry pass:
digraph scope {
"Doc fits in one context window (~5k lines / 50 pages)?" [shape=diamond];
"Single agent: registry + read pass itself" [shape=box];
"Parallel per-section agents emit registry rows → merge; read pass split across slices" [shape=box];
"Doc fits in one context window (~5k lines / 50 pages)?" -> "Single agent: registry + read pass itself" [label="yes"];
"Doc fits in one context window (~5k lines / 50 pages)?" -> "Parallel per-section agents emit registry rows → merge; read pass split across slices" [label="no"];
}
Symptoms that trigger this audit:
Don't use for: fresh documents; pure spec / API-reference with no reasoning to track; a short doc one careful read covers (a small but heavily-pivoted doc still earns the audit — use one agent).
The registry's "canonical" value cannot come from the document alone: the document is the thing that's inconsistent. Fix the tie-breaker up front: (a) ask the user for the current truth; (b) take it from the newest / most-recently-edited section, an ADR / changelog / decision log; (c) if nothing is authoritative, the concept's canonical slot stays UNRESOLVED and that conflict is itself the finding. Never silently pick one value and flag the rest as drift — a wrong guess blesses the error and flags the correct mentions. Non-negotiable.
Extract exhaustively BY TYPE; filter for conflict at reconcile, not at extraction. The failure mode that wrecks recall is pre-judging which claims are "important enough" to register — the claims that drift are usually the ones that look minor. Register every claim that fits a type; reconciliation (cheap) decides what conflicts.
Per section (one agent per file, or per heading-range for a monolithic file — hand each the shared section index), emit a row per claim:
- type: quantity | definition_scope | repudiation | identity_numbering |
benchmark_analogy | derived_arithmetic | justification_count
- subject: <the entity/concept the claim is about — for grouping>
- statement: <the claim, normalized>
- value: <number+unit, if any>
- sectionRef + quote: <verbatim, so another agent can re-locate it>
The type list is the recall contract — each type is a class the audit must reconcile:
A linear read (split across slices for a big doc) that holds cross-section context and hunts contradictions the registry under-registers — especially a benchmark/approach repudiated in one place but used in another, an internal sum whose parts don't add up, and a justification count that differs from the same count elsewhere. Quote both sides. This pass exists because a registry only diffs what it registered; a read notices the unanticipated. Its candidates go through the same gate.
Run all of these over the merged registry + read-pass candidates, not just value grouping:
quantity/justification_count by (subject, attribute), normalizing aliases; >1 distinct value = candidate.repudiation, search whether the rejected thing is asserted/used elsewhere.Every candidate is handed to a fresh agent that re-reads both cited locations in the document and confirms a genuine same-thing-same-sense contradiction. Reject if: the quote isn't found (hallucination), the two figures are different things sharing a name, or both are legitimately true in context (conservative vs aggressive scenario, gross vs net, different region/year). Confirm only if airtight, with verbatim quotes. This gate is what makes the audit shippable without re-checking — in testing it held false positives at zero.
Per concept, a card — canonical entry (or UNRESOLVED: <competing options>), every deviation with both quotes and a one-line fix, and a drift measure (distinct values/framings × section span × resolved?). Order by severity (UNRESOLVED canonical > confirmed value drift > numbering/repudiation > framing > stale ref). Then ask the user which to remediate. Read-only throughout — the audit never edits; the user decides the fixes.
UNRESOLVED beats a confident wrong canonical.Measured on a real 75k-word, 22-section technical + investment report (not a synthetic fixture — synthetic tests where the same model plants and "finds" a defect prove nothing).
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub jihlenburg/mad-skills --plugin trains-of-thought-audit