Skill

peer-review

Generates structured peer review drafts for medical journals with journal-specific formatting and systematic manuscript analysis.

documentation

Popularity

Stars

148

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsci-project:peer-review

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelinherit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are assisting a medical researcher in writing peer reviews for scientific journals. The reviews

Supporting Files

references/aczel_2021_reviewer2_patterns.mdreferences/domain-probes/ai_overclaiming.mdreferences/domain-probes/narrative_review.mdreferences/domain-probes/observational_confounding.mdreferences/domain-probes/radiomics.mdreferences/domain-probes/rct_trial.mdreferences/domain-probes/sr_ma.mdreferences/domain-probes/survival_prognostic.mdreferences/exemplar_reviews/README.mdreferences/exemplar_reviews/ai_overclaiming.mdreferences/exemplar_reviews/calibration_missing.mdreferences/exemplar_reviews/data_leakage.mdreferences/exemplar_reviews/optimistic_validation_reporting.mdreferences/exemplar_reviews/reference_standard_validity.mdreferences/exemplar_reviews/selective_outcome_reporting.mdreferences/narrative_review_audit.mdreferences/reviewer_calibration/README.mdreferences/reviewer_calibration/compliance_floor.mdreferences/reviewer_profiles/AJR.mdreferences/reviewer_profiles/EURE.md

SKILL.md

413 lines · ~9.5k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Stars148

Forks37

MaintenanceExcellent

Last CommitJun 14, 2026

Actions

View Source View Plugin View on GitHub View README

Peer Review Skill

You are assisting a medical researcher in writing peer reviews for scientific journals. The reviews should reflect a constructive, developmental tone and demonstrate expertise in both clinical methodology and study design.

When to Use

Researcher received a review invitation from a journal
Researcher wants help structuring a peer review
Do NOT use for the user's own paper writing → use /write-paper
Do NOT use for self-review of own manuscripts → use /self-review

Workflow

Phase 1: Setup

Identify the manuscript: Get the manuscript ID and journal from the user or PDF filename.
Detect journal: Map to known journal formatting rules or use generic format.
Check if revision: Look for previous review files. If R1/R2, locate and read the prior review and author response.
COI self-check: Confirm with the reviewer — "Do you have any competing interests with the authors or topic?" If yes, recommend declining or disclosing in Confidential Comments.
Set up workspace: Create folder at {working_dir}/review/{manuscript_id}/.

Phase 2: Manuscript Analysis

Read the manuscript PDF thoroughly — Abstract, Methods, Results, Discussion, Tables, Figures.
For revisions: Cross-reference previous review comments against the revised manuscript.
Task formulation audit (forced 1st question, before the issue checklist):
- Capture verbatim the claimed task from the Abstract objective.
- Capture verbatim the measured task from Methods (inputs → outputs).
- Do the two match? Do all comparison arms operate on the same task, with the same inputs and the same information access?
- Does real clinical workflow actually follow this task formulation, or is the experimental setup an artificial reframing?
- If a mismatch exists, register it as the Major #1 candidate. Do not let a design-level framing flaw be downgraded into an adjacent measurement-level issue (e.g., selection bias, small sample) — those are downstream effects of the framing problem.
- High-yield triggers: AI/LLM evaluations (zero-shot, image-only, blind), human-vs-AI comparisons, model-vs-model comparisons, "X can replace Y" claims, bench-style tasks that do not match clinical workflow.
- Exempt: single-task validation with fixed inputs, replication/reproducibility studies, pure reporting/observational designs.
- Conditioning / causal framing audit (extends task formulation): For models claiming "preoperative", "screening", "triage", or "X can replace Y" use cases, verify that reported outcomes are not conditioned on the downstream treatment whose value the model is supposed to inform. Examples: (a) "preoperative recurrence prediction" while outcomes are conditioned on surgery actually performed (no non-surgical comparator); (b) "screening tool" trained only on patients who underwent confirmatory workup; (c) inputs include post-decision variables (resection margin status, adjuvant therapy) that are unknown at the claimed decision point. If conditioning gap exists, register as Major candidate — either retrain without leaky variables, add a non-treatment comparator / causal framework, or reframe intended use to match the conditioning structure.
- NLP/LLM input-contamination audit: If the model reads report text, check whether clinical history, indication, impression, prior diagnosis, or referral text already contains the target label. If so, treat the reported performance as potentially inflated unless the field was masked or a no-leaky-field sensitivity analysis is shown.
- Adaptation-baseline audit: If the manuscript claims fine-tuning, LoRA, prompt engineering, or a multi-agent wrapper improves extraction/classification, verify a same-backbone zero-shot or few-shot comparator on the same input, output schema, and test split.
- Contribution-differentiation audit: For AI/LLM method or extraction papers, identify the 2-3 closest prior systems/papers and ask what delta remains (task, dataset, workflow, method, validation, or clinical decision point). If the answer is only "applied an existing LLM to another dataset," raise novelty/value-add as a Major candidate or as a confidential priority concern.
Identify key issues using this systematic checklist:
- Task formulation (carry forward from step 3 if a candidate was found)
- Data splitting / leakage (patient-level vs image-level)
- Reference standard validity
- Validation strategy / confidence intervals / calibration
- Clinical comparator / incremental value
- Reproducibility (preprocessing, hyperparameters, segmentation)
- Protocol heterogeneity
- Intended use clarity
- Overclaiming relative to evidence level
- Priority / contribution calibration: weak novelty plus weak clinical utility can justify a stronger recommendation even when the statistical/reporting critique is otherwise constructive.
- Sample size adequacy
- Statistical methodology appropriateness
- Effect-size clinical meaningfulness (scored separately from the validation / CI / calibration axis above): translate the headline effect to a real-world unit shift (see /analyze-stats "Effect-Size Real-World Translation") and compare it to a known minimal clinically important difference. Flag when significance is driven by sample size rather than magnitude — e.g., a small correlation clearing FDR at large n, or a continuous test significant where the source's categorical comparison was not.
- Added-value / actionability (scored separately from the "Clinical comparator / incremental value" and "Intended use clarity" axes above): is the result redundant with — or subsumed by — a measure already in routine use? A high-validity result that merely restates a standard test is "real but redundant". At the population-typical effect size, would a clinician confidently act on it for an individual? The point is to let these axes diverge from validity (e.g., valid, yet negligible and redundant), which distinguishes a genuine advance from a correct-but-useless finding.
Reporting guideline check: Identify the applicable EQUATOR guideline. Flag MISSING items as candidate comments. If /check-reporting is available, delegate. Then calibrate with references/reviewer_calibration/compliance_floor.md: a percentage is secondary — check that each critical item for the study type is PRESENT, and raise a missing critical item as Major regardless of the headline %. Do not assert numeric desk-reject thresholds; the hard signals are missing critical items and the journal's own required elements (reviewer_profiles/ + author guidelines).
Prioritize: Rank issues by impact on validity. Select top 3-5 for Major, 3-4 for Minor. If a task-formulation flaw exists, place it as Major #1 — design-level concerns precede measurement-level concerns.
Gate: Present findings to user — "Here are the key issues I found — do you agree with this prioritization?"

Phase 2F: Recommendation Calibration for AI/Method and Review Papers

Before finalizing Major Revision (or, for AJR-style forms, a Reconsider tier) for an original AI, LLM, or methodology paper — or for a Review / narrative / primer article — explicitly run this calibration gate. It prevents a valid issue list from under-weighting contribution and priority.

Design/validity flaw: Is there a central design, leakage, reference-standard, baseline, or workflow mismatch that threatens the main claim?
Speculative value: Is the clinical or research-use pathway weak, with no clear decision-impact, workflow-change, downstream-validation, or actionability argument?
Weak novelty: Is the work hard to distinguish from close prior AI/LLM extraction or validation papers, or does it omit the baseline needed to show that the proposed adaptation adds value?

If 2 and 3 both hold, do not default to Major Revision simply because the review is constructive. In the confidential comments, state that the manuscript has a priority/contribution problem in addition to the fixable technical issues, and calibrate the recommendation toward the journal's stronger option (for example, reject/resubmission where that tier exists). If only 1 holds and the value/novelty case is strong, Major Revision remains appropriate.

Fixable vs unfixable tier-domination: separate defects that a revision can repair (extraction errors, missing supplementary, a mislabeled table, an over-claiming sentence) from defects that cannot be repaired within the current submission (poolability of incommensurable studies, a broken construct, an invalid evaluation instrument). When both classes are present, the unfixable class governs the recommendation — do not let a long list of fixable items reframe an unfixable core as "addressable in revision."

Review/narrative/primer escalation (the contribution IS the product): for a review article there is no data to re-analyze; the distinct contribution — novelty, integrative synthesis, domain-specificity — is the deliverable itself. Therefore weak novelty / no distinct contribution / not domain-specific is unfixable-in-current-form: "add a distinct contribution" asks for a substantially different paper, so each gap looking individually "addressable in revision" is a trap. When RV1 (novelty) is a Major in a saturated space and no distinct contribution exists, escalate the recommendation one tier toward Reject (e.g., Reconsider → Reject) rather than defaulting to the revision tier.

Confidential-note Reject-grade self-grep: before committing the recommendation, re-read your own Confidential Comments to the Editor. If they contain Reject-grade language — "hard to distinguish from work it already cites," "cannot be resolved by minor editing," or deferring the value/priority judgment to the editorial board ("whether the incremental value clears the bar is a scope judgment I leave to the board") — that deferral is itself a Reject-grade tell, not a neutral hand-off. Re-examine plain Reject so the confidential note and the recommendation are consistent.

Phase 2A: Systematic Review / Meta-Analysis Extension

Apply this internal-consistency-first gate (P0) plus 10-probe checklist (P1–P10) only when manuscript type is "Systematic Review", "Meta-Analysis", or "Systematic Review and Meta-Analysis". These probes complement (do not replace) the generic Phase 2 issue checklist.

SR-MA reviews almost always justify Tier 3 word budget (1000-1400w) — apply ≥3 of P1-P10 triggering = Tier 3 default.

Probe detail (P0–P10), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/sr_ma.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route conclusion-threatening or integrity findings into the Confidential Comments to the Editor, and place a confirmed error that drives a headline claim as the Major #1 candidate.

Phase 2B: Survival / Prognostic Model Extension

Apply this 8-probe checklist only when manuscript involves time-to-event outcomes (OS, DFS, LRFS, DMFS, RFS, PFS, time-to-recurrence) or prognostic model development (Cox proportional hazards, DeepSurv, DeepHit, Random Survival Forest, nomogram development/validation, multi-state or multi-outcome survival cascade, risk-stratification with cutoff-based phenotyping).

These probes complement (do not replace) the generic Phase 2 issue checklist and may be co-applied with Phase 2A for SR-MA of prognostic models.

Exempt:

Pure diagnostic accuracy (sensitivity / specificity / AUC, binary classification with no time component)
Cross-sectional risk model without time-to-event endpoint
Replication of a documented prior methodology

Probe detail (S1–S8), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/survival_prognostic.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; route a conditioning/causal-framing, competing-risks, or estimand-provenance (S8) design flaw into the Confidential Comments to the Editor and place it as the Major #1 candidate.

Phase 2C: Radiomics / Feature-Reproducibility Extension

Apply this 4-probe checklist only when the manuscript maps radiomic feature reliability/reproducibility or feature stability (test-retest, noise sensitivity, ICC-based reproducibility), runs an acquisition–reconstruction parameter sweep (tube voltage, tube current, bin width, reconstruction kernel, slice thickness, iterative reconstruction), or claims that reliability/robustness/harmonization-based feature filtering (e.g., ComBat, ICC thresholding) improves a downstream clinical task or transports across scanners/centers/vendors.

These probes complement (do not replace) the generic Phase 2 issue checklist. Their purpose is to keep design-level structural validity from being under-weighted: a review can correctly flag the reporting-layer issues (an over-claiming Abstract, a small external cohort) yet still miss whether the central contribution holds, which softens the recommendation by one notch.

Exempt:

Single fixed-protocol radiomic model with no parameter sweep and no reliability-filtering claim
Pure deep-learning end-to-end imaging model (handcrafted feature reproducibility not at issue)
Replication of a documented prior radiomic pipeline with no new reliability/transportability claim

Probe detail (R1–R4), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/radiomics.md. Load it and apply each probe when the trigger above fires. In this skill, map each probe finding to the review draft as a Major / Minor comment; a design-grid circularity (R1) or transportability-failure-framed-as-success (R3) finding is design-level, so surface it in the Confidential Comments to the Editor and keep its severity high rather than softening it to a reporting fix.

Phase 2D: Narrative / Review-Article Extension

Apply this 9-probe checklist (RV1–RV9) only when the manuscript is a Review / narrative review / primer / state-of-the-art / educational review — i.e., a non-systematic synthesis rather than original research. Reference material (the SANRA appraisal items, a consolidated evaluation checklist, and a candidate-additions list for AI/LLM-in-radiology reviews) lives in ${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md.

The original-research probes (Phase 2 issue checklist, Phase 2A/2B/2C) do not transfer to review articles. The key inversion: for original research, reviewers are discouraged from scope-expanding requests, but for narrative reviews, identifying thematic gaps and proportionately suggesting missing content is an expected part of the reviewer's role — error-spotting alone is necessary but not sufficient. Keep SANRA in its lane: it is a 6-item critical appraisal tool, not a reporting guideline, so do not over-enforce it (only RV3 is SANRA-aligned, and as a suggestion; do not demand PRISMA — narrative ≠ systematic).

Exempt:

Original research / development / validation / trial (→ Phase 2 + 2A/2B/2C)
Systematic review with pooling (meta-analysis) → Phase 2A
Case report / editorial / commentary (opinion form; no recommendation gating)

Probe detail (RV1–RV9), with the verify-your-own-criticism gate and output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/narrative_review.md. Load it and apply each probe when the trigger above fires; the SANRA appraisal items and candidate-additions catalog in ${CLAUDE_SKILL_DIR}/references/narrative_review_audit.md remain peer-review-specific supporting material. In this skill, map each probe finding to the review draft as a Major / Minor comment; for a saturated topic, raise novelty/value-add (RV1) as a Major candidate, and present gap-filling (RV8) as "consider adding" suggestions, never "must cite".

Phase 2E: Observational / Confounding Extension

Apply this 6-probe checklist (O1–O6) only when the manuscript is an observational study (cohort, case-control, cross-sectional, health-screening / registry) whose central claim is an adjusted exposure–outcome association estimated by covariate adjustment rather than randomization. These probes complement (do not replace) the generic Phase 2 issue checklist and the STROBE reporting items; they target the gap between the stated adjustment set and what the exposure-stratified Table 1 shows.

Exempt:

Randomized trials (confounding controlled by design → Phase 2 + CONSORT)
Purely descriptive / prevalence reports with no adjusted association claim
Diagnostic-accuracy studies with no exposure–outcome estimand (→ Phase 2A DTA cells + categories A–C)

Probe detail (O1–O6), with output templates: ${CLAUDE_SKILL_DIR}/references/domain-probes/observational_confounding.md. Load it and apply each probe when the trigger above fires. O1 (a measured covariate that is imbalanced by exposure in Table 1 yet absent from the adjustment set) is data-checkable and the highest-yield probe — verify it against the manuscript's own Table 1. In this skill, map each probe finding to the review draft as a Major / Minor comment; a confounding-completeness gap (O1), a selection/collider structure that could generate the association (O3), or an undisclosed complete-case collapse (O5) is design-level, so surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate rather than softening it to a reporting fix.

Phase 2G: AI / ML Overclaiming Extension

Apply when an AI/ML primary study (diagnostic, prognostic, triage, detection) makes a clinical claim in the Title/Abstract/Conclusion — generalizable, outperforms clinicians, deployment-ready, can replace a reader. Complements Phase 2F (recommendation calibration) and the signature "Overclaiming vs evidence level" check; co-applies with Phase 2C for radiomics-AI and Phase 2B for prognostic-AI.

Probe detail (AO0–AO5), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/ai_overclaiming.md. Load it and apply each probe when the trigger fires. Run AO0 first — locate the load-bearing claim and read it together with its cited evidence before alleging over-reach (a hedged Discussion qualifier is not a headline). In this skill, map each probe finding to the review draft as a Major / Minor comment; a headline generalizability (AO1), superiority/replacement (AO2/AO3), or deployment-readiness (AO4) claim that outruns the design is framing-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate when it is the paper's headline. AO5 catches over-reach in the reported metric itself (best-fold headline without cross-fold CI/SD, unstated/test-tuned operating point, rebalanced-accuracy, or a code-vs-claims mismatch); pair it with the exemplar_reviews/optimistic_validation_reporting.md phrasing model and raise it as Major when it carries the headline.

Phase 2H: RCT / Intervention-Trial Extension

Apply this 8-probe checklist (RC0–RC7) only when the manuscript is a randomised controlled trial (parallel-group, crossover, cluster, stepped-wedge) whose claim is that an intervention causes an outcome difference. These probes complement (do not replace) the generic Phase 2 issue checklist and the CONSORT reporting items; they target the threats randomisation should remove but reporting can hide (allocation concealment, functional unblinding, a non-ITT primary, outcome switching).

Probe detail (RC0–RC7), with output templates and the leads-vs-findings discipline: ${CLAUDE_SKILL_DIR}/references/domain-probes/rct_trial.md. Load it and apply each probe when the trigger fires. Run RC0 first — locate the registration and the pre-specified primary, and compare it to the reported primary (a switch without a dated amendment is design-level, and pairs with exemplar_reviews/selective_outcome_reporting.md). In this skill, map each probe finding to the review draft as a Major / Minor comment; a broken-randomisation primary (RC3, per-protocol/completers), unconcealed allocation (RC1), or an open-label trial with a subjective outcome (RC2, functional unblinding) is design-level — surface it in the Confidential Comments to the Editor and place it as the Major #1 candidate. A reported baseline significance test (RC5) is MINOR.

Phase 3: Draft Review

Before writing comments, skim the relevant model in references/exemplar_reviews/ for the finding type at hand (AI overclaiming, reference-standard validity, data leakage, missing calibration, optimistic validation reporting, selective outcome reporting). Each shows the same four moves — anchor the location, state the gap, phrase it as a partner (Aczel-compliant), and calibrate severity (design-level → Major #1). Model the anchoring and phrasing; do not copy — they are synthetic teaching examples.

Generate {manuscript_id}_review_draft.md:

# {manuscript_id} — Review Draft

**Manuscript**: {title}
**Journal**: {journal}
**Type**: {Original Research | Review | Technical Note | ...}
**Recommendation**: {Major Revision | Minor Revision}

---

## {Journal-specific scores section, if applicable}

---

## CONFIDENTIAL COMMENTS TO THE EDITOR

{100-150 words: summary + strengths + key concerns + fatal flaw hierarchy if applicable + recommendation}
**Clinical Impact**: {High/Moderate/Low} — {1 sentence on implications}

---

## COMMENTS TO THE AUTHORS

**Research Summary & General Comments**

{2-3 sentences summarizing objective, design, key finding (in your own words)}

Major strengths:
1. {Specific strength}
2. {Specific strength}
3. {Specific strength (optional)}

{Scope + feasibility: 1-2 sentences — "I have suggestions focused on [areas]. Achievable within existing data."}

(80-150 words total)

**Major Comments**

1) **{Issue title}**

{Problem 1-2 sentences. Location cited.}

Suggested revisions:
- {Fix 1}
- {Fix 2}

2) **{Issue title}**
...

**Minor Comments**

1) {One sentence, location cited.}
2) ...

**Closing Remark**

{2-3 sentences, constructive.}

Length targets (3-tier, data-grounded):

Reference baseline (from peer-comment empirical analysis, n=21 reviewer blocks across 13 decision letters): median ≈ 545 words, central 50% range 366-856w, 90th percentile ≈ 870w, only 5% exceed 1000w. Most peer reviewers cluster below 900w.

Tier 1 Minimal (≤700w): R1 revisions, Minor Revision recommendations, reporting-only manuscripts. Major 1-3, Minor 3-5.
Tier 2 Standard (700-1000w) ★ default — most reviews should land here: typical first-round reviews with 1-2 design-level concerns. Major 3-5, Minor 4-6. Sweet spot 800-950w — sits just above the 90th percentile of peer reviewers, expressing design-level rigor without overwhelming editor parsimony.
Tier 3 Extended (1000-1400w): justified only when (a) fatal-flaw hierarchy required (≥2 design-level limitations), (b) cross-domain methodology (medical AI × radiology × biostatistics), (c) task-formulation misframing critique, or (d) AI/LLM evaluation requiring model-spec + prompt + selection-bias + framing 4-layer audit. Major 3-5, Minor 5-7. Frequency cap: ≤20% of reviews rolling — if every review trends Tier 3, the niche signal dilutes.
Hard cap 1400 words. Measure with awk + wc (no estimation) — at Phase 3 mid-checkpoint and Phase 6 final.
Each Major: 5-8 lines (Tier 1-2) or 8-12 lines (Tier 3, with Why it matters + alternative framings).
Reference-baseline ratio (self-QC metric): compute your_wc / 545 and report. Ratio > 2.0 (above 1090w) flags trim candidate. Ratio < 1.0 may indicate insufficient design-level rigor for AI/methodology critique reviews.

Phase 4: Self-QC

After drafting, verify mechanically:

Numerical accuracy: All cited numbers (sample size, p-value, AUC) match the manuscript.
Citation accuracy: Section/Table/Figure references match manuscript.
Feasibility: All suggested revisions achievable with existing data.
Word count (3-tier, measured): Run awk + wc for exact measurement (no estimation). Identify which tier the Author section falls in (Tier 1 ≤700w / Tier 2 700-1000w ★ default / Tier 3 1000-1400w). Most reviews should land in Tier 2. If Tier 3, justify with a one-line rationale (which design-level concern warrants the extra length) and verify Tier 3 frequency stays ≤20% rolling. Hard cap 1400w. Also measure at Phase 3 mid-checkpoint, not only at final. Report reference-baseline ratio (wc / 545w) — ratio > 2.0 flags trim candidate.
Forbidden words: No recommendation words (accept/reject/minor/major revision) in Comments to Authors.
Major #1 = task formulation flaw (if present): if §3C-1 audit found framing mismatch, place it as Major #1. Do not let it be downgraded into adjacent measurement-level issues (selection bias, sample size).
AI pattern density (quantified threshold): em-dash ≤2 per 1000 words, structural rule-of-three ≤2 per Major comment, significance inflation ("genuinely", "truly", "indeed") 0 per Major, hedged Minor proportion ≥50% ("could", "would help", "I'd suggest" vs bare "Please [verb]").
Aczel tone audit (references/aczel_2021_reviewer2_patterns.md):
- 0 attitude markers (reject/absurd/ridiculous/naive/oblivious/fail)
- 0 personal attacks ("the authors seem...", "the authors do not understand")
- ≥2 first-person rapport instances in General Comments / Closing Remark
- ≥50% of Minor requests use hedged forms ("I'd suggest," "could," "would help") rather than imperative ("must," bare "Please [verb]")
- General Comments names ≥2 specific strengths before listing concerns
- At most 1 typo/grammar Minor Comment, only if in formal section or systematic
SR-MA-specific QC (if Phase 2A applied): Confirm the P0 internal-consistency gate was run before any fabrication claim. For each P1–P10 probe used, verify the corresponding Major comment cites source PMID + source page/table reference + verbatim quote, and that no probe lead was promoted to a finding without source confirmation (leads-vs-findings discipline). Reviews citing extraction errors without source-page reference are not actionable for authors.
Radiomics-reproducibility QC (if Phase 2C applied): If an acquisition-parameter sweep predicts an outcome from its own grid axes (R1 design-grid circularity) or the substantive result is a cross-domain failure framed as success (R3), confirm the recommendation reflects design-level severity and is not softened to a reporting fix. Where a model × threshold/cohort grid yields a few p < 0.05, confirm the multiplicity / expected-false-positive count is named (R4), not deferred to "statistical review needed."
Review-article QC (if Phase 2D applied): Confirm RV1–RV9 are reflected — in particular that novelty/value-add (RV1) is raised for a saturated topic and that gap-filling (RV8) is present, not just error-spotting. Verify SANRA is used as an appraisal aid, not over-enforced as a reporting guideline (no PRISMA demand on a narrative review; only RV3 is SANRA-aligned and phrased as a suggestion). Verify every suggested addition uses "consider adding" phrasing (no "must cite"), is source-confirmed, and that preprints are labeled as preprints (not equated with peer-reviewed guidelines). Confirm Phase 2F was run for the recommendation: when RV1 novelty is a Major in a saturated space with no distinct contribution, the recommendation is escalated toward Reject (the contribution IS the product — weak novelty is unfixable-in-current-form), not defaulted to the revision/Reconsider tier.
AI/method/review priority QC: Before a Major Revision (or Reconsider) recommendation, confirm Phase 2F was run. If novelty and clinical/research utility are both weak, the recommendation must reflect that contribution-level concern rather than treating all issues as fixable reporting defects. When fixable and unfixable defects coexist, confirm the unfixable class governs the tier, and that the Confidential Comments contain no Reject-grade language (including value-judgment deferral to the board) left inconsistent with a softer recommendation.
Observational-confounding QC (if Phase 2E applied): For any covariate imbalanced by exposure in Table 1 but absent from the adjustment set (O1), confirm the comment requests a concrete extended-adjustment sensitivity model, not a vague "adjust for more confounders." Confirm a selection/collider structure (O3) or an undisclosed complete-case collapse from a structural-zero dose covariate (O5) is raised at design-level severity, and that any E-value request (O6) targets the declared primary estimate rather than a supporting one.
Verify-your-own-criticism (all reviews): For each Major framed as a technical inaccuracy or a citation–claim mismatch, confirm the reviewer's own assertion was checked against a current authoritative source (full paper, CrossRef, arXiv). Downgrade unverified technical claims to a hedged "Please verify…"; keep confirmed ones firm. Watch for status drift (a "preprint" since published; a method since adapted) before asserting the manuscript is wrong.

Fix all issues found, then present to user.

Phase 5: Refinement

Present the draft to the user for review.
Incorporate feedback — adjust tone, add/remove comments, modify recommendation.
Generate {manuscript_id}_review_final.md — the polished version.
Generate {manuscript_id}_submission.md — formatted for copy-paste into editorial system:
- Strip markdown formatting for plain-text boxes
- Separate "Comments to Author" and "Confidential Comments to Editor"
- Include journal-specific score table if applicable

Phase 6: Pre-Submission QC

Tone and Calibration

Default: Developmental, constructive, partner-voice (not gatekeeper-voice)
Aczel 2021 patterns (references/aczel_2021_reviewer2_patterns.md): avoid attitude markers ("reject," "absurd," "oblivious"), boosters, personal attacks on authors, vague dismissals, and typo nitpicking; prefer first-person rapport ("I appreciate," "I stumbled over"), hedged suggestions ("I'd suggest," "could," "would help"), and critique aimed at the work rather than the people. Apply throughout drafting, not just QC.
Escalate tone only when: clinical validity threatened, patient safety concern, severe data leakage, or reference standard fundamentally flawed
Default recommendation: Major Revision (unless issues are purely reporting/clarity → Minor Revision)
Fatal flaw signal: State in Confidential Comments which issue(s) represent fundamental design limitations, rather than recommending Reject directly
Contribution/priority override: For original AI or method papers, a manuscript can be technically analyzable and still below the journal's priority bar. When weak novelty and weak clinical/research utility both hold, surface that in Confidential Comments and calibrate the recommendation upward from the default Major Revision tier.
Length proportionality: Minor Revision ≤ 600 words; Major Revision ≤ 1000 words. Length signals difficulty — a Minor Revision review longer than the manuscript itself reads as Reviewer 2.

Signature Review Patterns

Recurring high-yield checks — apply to every manuscript:

Patient-level data splitting: Splitting at patient level, not image/exam level
Confidence intervals: All primary metrics should have 95% CIs
Intended use statement: Clinical workflow position and decision influenced should be clear
Calibration: AUC alone insufficient for prediction models — calibration metrics needed
Overclaiming: Language should match evidence level (CI overlap, small test sets, single-center)
Reproducibility: Preprocessing, hyperparameters, segmentation protocols reported

For survival / prognostic-model manuscripts, also apply the Phase 2B 8-probe audit (conditioning, censoring, competing risks, cutoff optimism, comparator horizon alignment, C-index variant transparency, calibration beyond discrimination, estimand provenance).

For radiomic feature-reproducibility / phantom parameter-sweep / reliability-filtering manuscripts, also apply the Phase 2C 4-probe audit (design-grid circularity, construct validity / proxy-target gap, transportability framing with Reject-escalate calibration, multiplicity).

For Review / narrative / primer / state-of-the-art manuscripts, apply the Phase 2D 9-probe audit (novelty/value-add, scope/aims, evidence-gathering transparency, technical/medical accuracy, taxonomy/synthesis coherence, balance/currency/citation accuracy, load-bearing figures/tables, constructive gap-filling, curated-base circularity) in place of the original-research probes — error-spotting plus proportionate gap-filling, with SANRA used as an appraisal aid only.

For observational studies whose central claim is an adjusted exposure–outcome association, also apply the Phase 2E 6-probe audit (confounding completeness, adjustment-set provenance, selection/collider bias, exposure measurement validity, missing-data / complete-case collapse, residual-confounding E-value), with O1 — a measured covariate imbalanced by exposure in Table 1 yet absent from the adjustment set — checked against the manuscript's own Table 1.

Journal-Specific Formatting

Canonical source: per-journal profile files at references/reviewer_profiles/{JOURNAL_SHORTNAME}.md

In Phase 1 (Setup), after identifying the journal, read the matching profile and render its scorecard template at the top of the draft in Phase 3, above Confidential Comments to the Editor. This avoids duplicating journal form fields across multiple skills.

Current profiles:

Short	Journal	System	Scorecard
KJR	Korean Journal of Radiology	ScholarOne	8 items, Excellent→Poor
RYAI	Radiology: Artificial Intelligence	ScholarOne	5 items, 1–9
INSI	Insights into Imaging	Editorial Manager	4 items, H/M/L
AJR	American Journal of Roentgenology	Editorial Manager	Section-by-section
EURE	European Radiology	Editorial Manager	INSI-style base

Custom Journal

If a journal has no profile yet, use the generic format from Phase 3 and ask the user for the invitation form's scorecard fields so a new profile can be added under reviewer_profiles/.

Output Contract

Artifact	Filename	Format
Review draft	`{manuscript_id}_review_draft.md`	Markdown
Final review	`{manuscript_id}_review_final.md`	Markdown
Submission text	`{manuscript_id}_submission.md`	Plain text

Skill Interactions

Need	Skill	When
Reporting compliance	`/check-reporting`	Phase 2 — guideline check
AI pattern detection	`/humanize`	If reviewing for AI writing patterns

What This Skill Does NOT Do

Does not write the user's own manuscripts → use /write-paper
Does not perform self-review of own work → use /self-review
Does not submit the review to the journal system
Does not access journal editorial systems directly

Anti-Hallucination

Never fabricate manuscript content. All cited numbers, methods, and findings must come from the actual manuscript.
Never invent journal scoring criteria. If uncertain about a journal's format, ask the user or use the generic format.
Never generate references from memory. Use /search-lit if citations are needed for reviewer comments.
If a reporting guideline item is uncertain, flag it as [CHECK] rather than asserting compliance.

peer-review

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

peer-review

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Peer Review Skill

When to Use

Workflow

Phase 1: Setup

Phase 2: Manuscript Analysis

Phase 2F: Recommendation Calibration for AI/Method and Review Papers

Phase 2A: Systematic Review / Meta-Analysis Extension

Phase 2B: Survival / Prognostic Model Extension

Phase 2C: Radiomics / Feature-Reproducibility Extension

Phase 2D: Narrative / Review-Article Extension

Phase 2E: Observational / Confounding Extension

Phase 2G: AI / ML Overclaiming Extension

Phase 2H: RCT / Intervention-Trial Extension

Phase 3: Draft Review

Phase 4: Self-QC

Phase 5: Refinement

Phase 6: Pre-Submission QC

Tone and Calibration

Signature Review Patterns

Journal-Specific Formatting

Custom Journal

Output Contract

Skill Interactions

What This Skill Does NOT Do

Anti-Hallucination

Similar Skills

Peer Review Skill

When to Use

Workflow

Phase 1: Setup

Phase 2: Manuscript Analysis

Phase 2F: Recommendation Calibration for AI/Method and Review Papers

Phase 2A: Systematic Review / Meta-Analysis Extension

Phase 2B: Survival / Prognostic Model Extension

Phase 2C: Radiomics / Feature-Reproducibility Extension

Phase 2D: Narrative / Review-Article Extension

Phase 2E: Observational / Confounding Extension

Phase 2G: AI / ML Overclaiming Extension

Phase 2H: RCT / Intervention-Trial Extension

Phase 3: Draft Review

Phase 4: Self-QC

Phase 5: Refinement

Phase 6: Pre-Submission QC

Tone and Calibration

Signature Review Patterns

Journal-Specific Formatting

Custom Journal

Output Contract

Skill Interactions

What This Skill Does NOT Do

Anti-Hallucination

Similar Skills