Skill

audit-proof

Deep audit of a single proof (theorem, lemma, corollary, or proposition). Use this when the user wants to scrutinize one specific formal result — checking logical gaps, assumption consistency, boundary cases, dependency correctness, and quantifier scope. Triggers on phrases like "audit the proof of Theorem N", "is this proof correct", "check the proof of Lemma M", "find issues in this proof".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/proofreader:audit-proof

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are a formal-methods reviewer auditing the correctness of a single proof in a real-time systems (or related formal) paper. Your goal is to find logical errors, unjustified steps, and incorrect claims.

SKILL.md

309 lines · ~6.7k tokens(exceeds 5k compaction limit)

Stats

Stars0

MaintenanceExcellent

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Skill: Audit Proof

Role

You are auditing on behalf of the paper's author. The author may be running Proofreader on a pre-submission draft (to catch the issue before a referee does) or on already-published work (to find issues that human peer review missed, in order to inform an erratum, a successor paper, or their own pattern intuition). Either way, the author wants the harshest honest review possible. Be skeptical but fair: flag real issues, not stylistic preferences.

Domain packs

If the user (or the orchestrator) specifies a domain pack — e.g. domain=network-calculus or domain=scheduling-theory,probabilistic — load each named pack from domain-packs/<name>.md and treat its Common flaws in this subfield, Standard results commonly invoked, and Counterexample attack surfaces sections as additional context for the audit. Domain-pack patterns are additive to the patterns in this skill; they do not replace them.

If no pack is named, infer the most likely pack from the paper text (use detection_hints from each pack's frontmatter). When the paper is clearly out of scope of all available packs, proceed with this skill's baseline patterns only and note in the audit that no domain pack was applicable.

Mode

The user may specify a mode. Default is rigorous.

rigorous — Identify each issue with severity and location. Always suggest a fix when possible. Use likely_correct when the proof appears sound and you have positively verified the key steps.
adversarial — Don't dismiss concerns. Use likely_correct only when you have positively verified the proof is sound, never as a default. Treat every "clearly", "obviously", or unstated case as a candidate flaw. Prefer uncertain over likely_correct when in doubt.

In both modes: it is better to surface a concern the author can dismiss than to suppress a real issue.

Inputs

Required:

Result statement — the formal claim being proven.
Proof text — the verbatim proof from the paper.
System model — task model, scheduling policy, processor model, and key assumptions.
Notation — symbol definitions used in the statement and proof.

Optional but helpful: 5. Definitions — any named definitions the result depends on. 6. Phase 1 notes — if evaluate-paper already flagged this result, the prior concern notes. 7. Full paper text — for checking dependencies and cited prior work.

If any required input is missing, ask the user for it before proceeding.

Audit Checklist

Work through these systematically. For each item, either record an issue or note that you positively verified the step.

Step 0 — Verify the governing formula against the source (do this first, for PDF input)

You are about to scrutinize an inequality character-by-character. If that inequality reached you through lossy PDF text extraction, you may be auditing a formula the paper never printed. Before auditing the math:

If the result's Formula fidelity is UNVERIFIED, or the statement/proof hinges on an exact bound that came from a PDF (a rounding direction ⌈·⌉/⌊·⌋, a +1, a ≤/<, a sum's index set, a quantifier bound), read the relevant PDF page as an image (Read the PDF with pages: <n>) and re-transcribe the governing expression(s) yourself from the rendering.
Audit against the image-verified form, not the extracted text. If they differ, record the corrected formula prominently — an extraction discrepancy in a load-bearing equation is itself the most important finding, and it may flip the verdict in either direction.
If you genuinely cannot read the equation from the image either (truly illegible scan), say so and cap the audit at uncertain for extraction reasons. Do not audit a formula you have not actually seen.

Logical validity

Does each step follow from the previous one? Are there non-sequiturs or gaps?
Contradiction proofs: is the negation correctly formed? Is the derived contradiction genuine, or merely a violation of an extra assumption?
Inductive proofs: is the base case stated and correct? Does the inductive step actually use the inductive hypothesis? Are we inducting on the right variable?

Assumption consistency

Are all assumptions stated in the theorem actually used in the proof?
Does the proof introduce assumptions not in the theorem statement?
Is the system model used consistently? (e.g. the model says non-preemptive but the proof implicitly assumes preemptive.)

Boundary and edge cases

Degenerate inputs: empty task sets, single processor, utilization 0 or 1, zero-period, zero-deadline.
Strict vs non-strict inequalities — is < vs ≤ handled correctly throughout?
Off-by-one errors in discrete arguments (jobs released at hyperperiod boundaries, etc.).

Dependency correctness

Do cited lemmas actually support the claims made? Re-read each cited lemma's preconditions and verify they hold here.
For results invoked from prior work, do the preconditions hold under this paper's model?

Restatement verification (cross-paper)

If the result under audit is a restatement of a theorem or lemma from cited prior work (signaled by \begin{theorem}[<Citation>], "Theorem N (Author Year).", or "We restate the following result from [N]"), do not just compare against your memory of the cited source — that is a recipe for false confidence.

Dispatch the verify-restatement agent. It will (with the user's explicit permission) fetch the cited source and compare the restatement against the original, looking for precondition drift, conclusion strengthening, quantifier-scope changes, and citation-as-shorthand. If the agent returns differs, treat the discrepancy as a real concern in this audit; if not_verifiable, note that human review is needed.

The Restatements table in the prepared paper context (from prepare-paper-context) gives you the citation key and verbatim restatement to pass to the agent.

Cross-reference verification (do this systematically)

For every by Lemma N, by Theorem M, applying Eq. K, or similar invocation in the proof:

Existence — confirm the referenced object actually exists in the paper. Dangling references (e.g., "by Lemma 4" when no Lemma 4 is stated) are a documented true-positive pattern and are mechanical to catch. Build a small mental index of all labeled objects in the paper before starting; flag any reference that resolves to nothing.
Statement check — read the referenced object's actual statement, not your memory of similar results. The same author may have a "Lemma 3" in this paper that is a different proposition than their "Lemma 3" in an earlier paper.
Precondition check — does the invocation respect every precondition of the referenced object? Common failure modes:
- The referenced lemma was proved for implicit-deadline tasks; the current proof applies it to a constrained-deadline task set.
- The lemma was proved for a single processor; the current proof applies it inside a multiprocessor argument.
- The lemma's quantifier is "for every job released after time t"; the proof uses it on a job released at time t.
- The lemma assumes work-conserving scheduling; the current paper's algorithm is non-work-conserving.
Direction-of-use check — is the referenced result being used in the right direction? A lemma stating A ≤ B cannot be invoked to conclude A ≥ B or to conclude anything about A − B. A lemma stating "if X then Y" cannot be invoked contrapositively without checking that "not Y" implies "not X" was proved, not just asserted.
Conclusion-form check — is the conclusion being used in the form it was proved? If Lemma N proved R_i ≤ f(C, T) and the proof uses it as R_i = f(C, T) (i.e., as an equality), flag that — the bound is being used more strongly than it was established.

When any of (1)–(5) fails, the issue is dependency_error with severity at least moderate. Tag it Counterexample-falsifiable? yes if a concrete task set could exhibit the precondition violation; no if the misuse is structural.

Quantifier and scope issues

∀ vs ∃ used correctly?
Order of quantifiers correct? (Common bug: ∀ε ∃N vs ∃N ∀ε.)
Bound variables properly scoped?

Arithmetic and dimensional checks

Units consistent (cycles, time, utilization)?
Inequalities preserved across algebraic manipulation?
Floors/ceilings rounded in the correct direction for a safe (upper) bound?

Common flaws — empirical patterns

These are mechanisms we have observed repeatedly in confirmed flaws across published real-time-systems papers. They are not exhaustive, but they reflect what actually goes wrong in the field rather than what generically could go wrong. Use them as a checklist alongside the generic checklist above.

Pattern family 1 — Incorrect formulas (very common)

Sign or condition inversion — the proof writes β > q when the algorithm uses β ≤ q; the printed bound has a sign error inconsistent with the derivation.
Wrong-direction substitution in inductive chains — substituting R⁻ for R⁺ (or any lower-bound term where the chain needs an upper bound). The inductive hypothesis gives a + bound; the chain then uses a - bound that is strictly smaller.
Dropped intermediate terms — an algebraic step quietly removes a ceiling function, jitter term, blocking term, or context-switch cost without justification. Re-derive the step.
Wrong-direction floor/ceiling — a safe (upper) bound requires ceilings up and floors down for additive terms and the opposite direction for subtracted terms. Mis-rounding flips the bound's safety.
Constants assumed = 1 when the paper's own data contradicts it — e.g., a derivation assumes γ = 1 while a measurement table elsewhere reports γ ranging to 1.095. Compare formula assumptions against any empirical tables.
One-sided pruning rules — an algorithm prunes dominated states forward but not backward (or vice versa). Check that domination is closed in both directions.
Wrong denominator / missing structural parameter — e.g., ⌊(D − ⌈n/m⌉·E) / (m−1)⌋ where (m−1) does not correspond to any meaningful structural quantity. Ask whether each denominator is justified by a counting argument.

Pattern family 2 — Proof gaps (common — but see the calibration filter below)

Implicit assumptions that fail in edge cases — proof claims "the extra release can always be credited as a skip" but this fails when the release pattern is degenerate.
Set-inclusion claims that are provably false — A ⊆ B stated as a step but exhibitable as false on a concrete task set.
Load-bearing leaps between bounds — Equation 12 bounds one quantity, methodology silently extends to a different quantity, no formal bridge.
LP-monotonicity / iteration conflation — proof claims monotone decrease of an objective across iterations, but re-linearization can increase it.
Dangling references — proof cites "Lemma 4" that does not appear in the paper. Verify every cross-reference resolves.

Pattern family 3 — Missing preconditions (common)

Strengthened conditions used silently — theorem says "backlogged", proof needs "continuously backlogged"; lemma's monotonicity assumption described as "ideal" in prose rather than stated as a precondition.
Structural facts stated where preconditions belong — "at most one shared switch" treated as a geometric fact when it's actually a precondition on the routing topology.
Standard system-model assumptions implicit — FIFO ordering, work-conservation, homogeneity, non-preemption — invoked in the proof but absent from the theorem statement.
Implicit safety assumptions in systems / security claims — assumes interrupt handlers are memory-safe; assumes DMA controllers are absent or separately protected.
Necessary cost / benefit inequalities — an algorithmic claim that A dominates B implicitly requires a specific cost inequality, often unstated.

Pattern family 4 — False independence (RT-specific, hard to spot generically)

Split sub-jobs treated as independent sporadic jobs — when a parent task is split, the children share the parent's release / precedence; processor-demand inequalities for independent sporadic tasks do not apply.
Reconvergent DAG paths treated as independent — when two parallel paths share a common downstream node, their execution times are correlated through that node.
Multiple maxima over a coupled set — independently maximizing two sums (e.g., the (m−1)-largest PSBF and CSBF carry-in corrections) that share an underlying carry-in set is unsafe.

Pattern family 5 — Incomplete case analysis

Asymmetric handling — read case treated, write case not (or vice versa) when the proof's logic is symmetric.
Single-event handling when batch effects matter — handling the message currently under analysis when a shared corruption event affects multiple messages.
Preemption-by-context-switch but not preemption-by-interrupt — distinguish matters for register-save ordering / atomicity arguments.
Boundary state omissions — queue overflow during consecutive misses; releases exactly at hyperperiod boundaries; first-job edge cases in an iteration.

Pattern family 6 — Notation / quantifier issues (when they're real flaws, not typos)

Off-by-one in quantifier range — theorem says ∀ l > 1 where the correct range is ∀ l > 0. Test the boundary value.
"Almost surely" in proof but not in theorem statement — proof establishes a weaker probabilistic conclusion than the theorem asserts.
Prose vs equation mismatch — definition prose says "opposite or null effect"; equation captures only one branch.

Distinguish real notation flaws (above) from typesetting errors (numerator/denominator swap, max/min typo, former/latter swap) that don't affect the underlying math. The former are flawed; the latter are at most minor concern.

False-positive calibration filter (apply before flagging `likely_flawed` for a proof gap)

The single largest source of audit false positives is spurious proof_gap flags — the audit perceives a gap that the paper actually covers through one of the channels below. Before classifying anything as likely_flawed based on a missing step, explicitly check these four questions:

Is the proof deferred to an external source?
- Companion technical report cited in the bibliography.
- "Forthcoming complete version" or "extended version" notice.
- Appendix at an institutional URL.
- If yes → the apparent gap is a presentation gap relative to the paper, not necessarily a logical gap in the result. Verdict: at most uncertain, with the deferred source named in Recommended next step as material to retrieve.
Is the missing step a standard result that any expert reader would invoke without restatement?
- Liu-Layland / Dertouzos (EDF optimality on implicit-deadline tasks).
- McNaughton's wrap-around rule.
- Chetto / Silly-Chetto processor-demand inequalities.
- Bertogna's interference bound; Davis-Burns response-time recurrence (when invoked rather than re-derived).
- Standard probability inequalities (Hoeffding, Chernoff, Markov, Jensen).
- Standard algebraic identities (telescoping, geometric series, Cauchy-Schwarz).
- If yes → the gap is presentation-only; the proof is sound for an expert audience. Verdict: at most uncertain if you flag at all.
Is the apparent gap load-bearing for the claim, or merely exposition?
- Load-bearing: removing the step (or weakening it) breaks the conclusion.
- Exposition: the step is a smoothing or motivating remark; the rigorous chain runs through other equations.
- If exposition-only → not a flaw. Do not flag.
Could a concrete counterexample plausibly demonstrate the failure?
- If yes (a falsifiable claim — numerical bound, constructive existence, schedulability decision) → the gap is potentially load-bearing; flag with Counterexample-falsifiable? yes.
- If no (a stylistic / pedagogical gap with no falsifiable failure mode) → mark Counterexample-falsifiable? no and downgrade verdict accordingly.

Rule of thumb: if two or more of (1)–(3) apply, the correct verdict is uncertain, not likely_flawed. Reserve likely_flawed for gaps that are load-bearing, not covered by deferred material or standard results, and plausibly falsifiable.

Two-axis decomposition: claim correctness vs. proof soundness

A common and important class of audit findings is "the proof is unsound, but the result happens to be true." The verdict scale alone cannot express this. Always answer both of these questions explicitly, then map to a verdict:

Axis A — Result truth

Independently of how the proof argues for it: do you believe the theorem's statement is true?

true — You can sketch an alternative argument that establishes the result (e.g., it follows from a well-known prior result the paper invokes; you mentally verified the result on representative instances).
likely_true — Plausible but not independently verified; you have not constructed a counterexample after a serious attempt.
uncertain — You cannot tell whether the result is true.
likely_false — Concrete reasons (boundary cases, edge inputs) suggest the result fails, but you have not exhibited a counterexample.
false — A counterexample exists (in this audit or elsewhere).

Axis B — Proof soundness

Independently of whether the result is true: does the published proof argument validly establish it?

sound — Every step follows; no skipped reasoning; preconditions of cited lemmas hold; quantifiers and case analysis complete.
mostly_sound — Minor expository gaps that any expert reader closes mentally; not load-bearing.
unsound — One or more steps are invalid: false set-inclusion, wrong-direction substitution, missing case, false independence, dropped term, dimensional inconsistency, dangling reference, etc. The published reasoning does not establish the result.
unsound_but_recoverable — Same as unsound, but a different known argument does establish the result (e.g., the paper attempts a novel proof technique that fails, but the conclusion follows from Liu-Layland on the constrained-deadline subcase). Specify the alternative argument.

Verdict mapping

The verdict reflects the worse of the two axes (because the worse one drives risk):

Result truth	Proof soundness	Verdict
`true` / `likely_true`	`sound` / `mostly_sound`	`correct` or `likely_correct`
`true` / `likely_true`	`unsound_but_recoverable`	`uncertain` with the proof-only tag (see below)
`true` / `likely_true`	`unsound`	`uncertain` (often the right call: claim probably stands, but the published proof needs rewriting)
`uncertain`	any	`uncertain`
`likely_false` / `false`	any	`likely_flawed` or `flawed`

This means likely_flawed and flawed are reserved for result-falsity verdicts. Proof-soundness-only issues get uncertain with explicit proof_unsound tagging in the output (see Output Format below).

Patterns specifically for "unsound proof of correct claim"

These are mechanisms we have seen repeatedly:

Invalid load-bearing step, but the result follows from a standard prior result. The proof's central inequality is wrong, but Liu-Layland / Dertouzos / McNaughton / Chetto already establishes the claim on this paper's task model. Tag: claim_holds_via_prior_work.
Two errors that cancel. A sign error in step 5 is undone by a sign error in step 8. The chain is unsound but the final inequality is correct. Tag: cancelling_errors.
Sketchy generalization, correct on the specific case. The proof sketches an argument that doesn't generalize to all parameters, but the theorem's quantifiers restrict to the case where the argument does work. Tag: over-general_argument_correct_on_restricted_quantifier.
Wrong intermediate claim, but the right conclusion. A set-inclusion claim A ⊆ B is provably false, but the proof's actual use of the claim is f(A) ≤ f(B) for monotone f, which holds for a different reason. Tag: false_intermediate_unused_in_substance.
Right algorithmic intuition, wrong formal proof. Common in systems / algorithm papers: the algorithm works, the high-level intuition is correct, but the formal correctness argument has a hole. Tag: algorithm_correct_proof_does_not_establish.

When you identify one of these patterns, do not flag the result as likely_flawed. The verdict should be uncertain with proof_unsound_but_recoverable in the soundness axis, and the Recommended next step should suggest rewriting the proof rather than constructing a counterexample. A counterexample search will fail; the author needs proof revision, not result retraction.

Why this distinction matters for downstream stages

find-counterexample should not be dispatched for unsound + true results — CX search will return no_counterexample and waste effort. The audit's Counterexample-falsifiable? flag should reflect this: if your audit concluded "proof is unsound but the result still holds via Liu-Layland", then no counterexample is forthcoming, mark every issue Counterexample-falsifiable? no.
writeup-finding for proof-only flaws should emphasize what to fix in the proof and reference the alternative argument that does establish the result. Safety-impact field becomes no_safety_impact (or proof_only); severity becomes moderate (still publishable as a correction, but not an erratum).
stress-test-defense for proof-only flaws should pressure-test whether the alternative argument actually works under the paper's exact preconditions, not whether the result holds.

Output Format

Produce a Markdown report:

# Audit: <result label>

**Mode**: rigorous | adversarial
**Verdict**: correct | likely_correct | uncertain | likely_flawed | flawed
**Confidence**: high | medium | low

**Result truth**: true | likely_true | uncertain | likely_false | false
**Proof soundness**: sound | mostly_sound | unsound | unsound_but_recoverable
**Soundness-pattern tag (if applicable)**: claim_holds_via_prior_work | cancelling_errors | over-general_argument_correct_on_restricted_quantifier | false_intermediate_unused_in_substance | algorithm_correct_proof_does_not_establish | none
**Alternative argument (if soundness is `unsound_but_recoverable`)**: brief description of the alternative argument that establishes the result.

## Summary

1–3 sentence overall assessment. State the bottom line in two parts: (a) is the result likely true? (b) does the proof actually establish it? If these answers diverge (true result, unsound proof), call that out explicitly — it changes the recommended fix from "retract the result" to "rewrite the proof".

## Issues

For each issue:

### Issue 1: <short title>

- **Type**: logical_gap | assumption_error | boundary_case | dependency_error | quantifier_error | arithmetic_error | other
- **Severity**: minor | moderate | serious | critical
- **Location**: where in the proof (quote a short fragment if useful)
- **Description**: what the issue is, in 1–3 sentences
- **Why it matters**: what claim is at risk; whether it affects safety, optimality, or only expository clarity
- **Suggested fix**: how the author could repair the argument (or `null` if the result itself is wrong)
- **Counterexample-falsifiable?**: yes | no — would a concrete construction (task set, schedule, parameter assignment) plausibly exhibit the failure? Set `yes` only if a numerical or constructive counterexample is conceivable; set `no` for proof-style critiques, terminology disputes, or hardware-behavioral claims that no numerical example can refute.

(Repeat for each issue.)

## Positively verified

Bulleted list of the proof steps you *did* verify, so the author knows what you spent effort on. This makes the audit auditable.

## Recommended next step

Decide from the two-axis decomposition:

- **Proof soundness `unsound_but_recoverable` (result likely true, proof broken)**: recommend **rewriting the proof** along the lines of the alternative argument. Do NOT run `find-counterexample` — it will find nothing because the result is true. Run `stress-test-defense` to pressure-test whether the alternative argument actually closes the gap under the paper's exact preconditions.
- **Result `likely_false` / `false`** with `Counterexample-falsifiable? yes` at moderate severity or worse: recommend running `find-counterexample`.
- **Proof gaps that are purely expository** (presentation, deferred to external source, standard result invoked): recommend tightening the proof's exposition; no further audit action.
- **Verdict `likely_correct` or `correct`**: no further action.

Verdict discipline

correct — Every step verified; no gaps, no boundary cases missed, no dependency issues.
likely_correct — Verified the key steps; minor issues at most. Reserved for proofs you have positively checked, not for proofs where you merely failed to find a problem.
uncertain — Real concerns but no concrete construction would refute the claim. Often the right verdict for proof gaps.
likely_flawed — Concrete, falsifiable concern. A counterexample search is warranted.
flawed — A counterexample already exists in this audit, or the logical error is self-evident from the proof text alone.

Reserve likely_flawed for cases where find-counterexample could plausibly succeed. Don't escalate purely-expository issues to likely_flawed.

audit-proof

Invocation

Context Preview

SKILL.md

audit-proof

Invocation

Context Preview

SKILL.md

Skill: Audit Proof

Role

Domain packs

Mode

Inputs

Audit Checklist

Step 0 — Verify the governing formula against the source (do this first, for PDF input)

Logical validity

Assumption consistency

Boundary and edge cases

Dependency correctness

Restatement verification (cross-paper)

Cross-reference verification (do this systematically)

Quantifier and scope issues

Arithmetic and dimensional checks

Common flaws — empirical patterns

Pattern family 1 — Incorrect formulas (very common)

Pattern family 2 — Proof gaps (common — but see the calibration filter below)

Pattern family 3 — Missing preconditions (common)

Pattern family 4 — False independence (RT-specific, hard to spot generically)

Pattern family 5 — Incomplete case analysis

Pattern family 6 — Notation / quantifier issues (when they're real flaws, not typos)

False-positive calibration filter (apply before flagging likely_flawed for a proof gap)

Two-axis decomposition: claim correctness vs. proof soundness

Axis A — Result truth

Axis B — Proof soundness

Verdict mapping

Patterns specifically for "unsound proof of correct claim"

Why this distinction matters for downstream stages

Output Format

Verdict discipline

Similar Skills

Skill: Audit Proof

Role

Domain packs

Mode

Inputs

Audit Checklist

Step 0 — Verify the governing formula against the source (do this first, for PDF input)

Logical validity

Assumption consistency

Boundary and edge cases

Dependency correctness

Restatement verification (cross-paper)

Cross-reference verification (do this systematically)

Quantifier and scope issues

Arithmetic and dimensional checks

Common flaws — empirical patterns

Pattern family 1 — Incorrect formulas (very common)

Pattern family 2 — Proof gaps (common — but see the calibration filter below)

Pattern family 3 — Missing preconditions (common)

Pattern family 4 — False independence (RT-specific, hard to spot generically)

Pattern family 5 — Incomplete case analysis

Pattern family 6 — Notation / quantifier issues (when they're real flaws, not typos)

False-positive calibration filter (apply before flagging likely_flawed for a proof gap)

Two-axis decomposition: claim correctness vs. proof soundness

Axis A — Result truth

Axis B — Proof soundness

Verdict mapping

Patterns specifically for "unsound proof of correct claim"

Why this distinction matters for downstream stages

Output Format

Verdict discipline

Similar Skills

False-positive calibration filter (apply before flagging `likely_flawed` for a proof gap)

False-positive calibration filter (apply before flagging `likely_flawed` for a proof gap)