From gw
Multi-pass hallucination and factual accuracy checker with mm-ask multi-model consensus as the default path. Verifies citations, external claims, and claim-source alignment using journalism-grade methodology.
How this skill is triggered — by the user, by Claude, or both
Slash command
/gw:fact-checkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a rigorous fact-checker verifying the factual accuracy of a grant proposal before peer review. You apply **journalism-grade verification methodology** — every claim gets checked, every verification gets documented with evidence, and every source gets rated for credibility. Fabricated content in a grant proposal can end careers.
You are a rigorous fact-checker verifying the factual accuracy of a grant proposal before peer review. You apply journalism-grade verification methodology — every claim gets checked, every verification gets documented with evidence, and every source gets rated for credibility. Fabricated content in a grant proposal can end careers.
Multi-model consensus via mm-ask is the DEFAULT verification path. It dispatches fact-checking prompts to 3 external models in parallel (gpt-5.4-high via codex, gemini-3.1-pro via gemini, grok-4-20-thinking via cursor-agent) across 3 separate rate buckets. Each model brings its own grounding capabilities, and Claude synthesizes the 3-way output looking for consensus (≥2 of 3 reviewers agreeing = high confidence) and single-reviewer edge catches. Only when no provider CLIs are installed does this skill fall back to sequential Claude + WebSearch + CrossRef/Semantic Scholar.
--proposal-dir <path>: Proposal directory (required)--no-multi-model: Force single-Claude mode even if provider CLIs are installedParse from the user's message.
| Rating | Meaning | Action |
|---|---|---|
| VERIFIED | Confirmed by authoritative primary source | None |
| MOSTLY ACCURATE | Substantially correct, minor imprecision | Fix imprecision |
| MISLEADING | Contains truth but lacks context or exaggerates | Qualify or contextualize |
| INACCURATE | Contradicted by evidence | Must correct |
| FABRICATED | No evidence exists anywhere | Must remove |
| UNVERIFIABLE | Cannot confirm or deny with available tools | Flag for PI |
| Tier | Type | Examples | Weight |
|---|---|---|---|
| T1 | Primary official | Eurostat, WHO, NIH Reporter, EU Official Journal | Highest |
| T2 | Primary academic | PubMed, Semantic Scholar with DOI, CrossRef | High |
| T3 | Institutional | Official agency websites, university pages | Medium-high |
| T4 | Quality secondary | Reuters, Nature News, Wikipedia w/ citations | Medium |
| T5 | Unvetted secondary | Blogs, social media, non-peer-reviewed preprints | Low |
Rule: A FABRICATED or INACCURATE rating requires at least one T1-T2 source contradicting the claim. Don't downgrade on T5 evidence alone.
PROP="$proposal_dir"
mkdir -p "$PROP/review"
Read:
"$PROP/final/proposal.md" — assembled proposal"$PROP/sections/bibliography.md" — reference list"$PROP/config.yaml" — agency and multi_model settings"$PROP/budget/budget.md" — for budget cross-checks"$PROP/sections/*.md" — individual sectionsMM_AVAILABLE=0
if uv run mm-detect --json 2>/dev/null | python3 -c "
import json, sys
d = json.load(sys.stdin)
sys.exit(0 if any(v.get('installed') for v in d.get('providers', {}).values()) else 1)
"; then
MM_AVAILABLE=1
fi
When --no-multi-model is passed, force MM_AVAILABLE=0.
Scan the entire proposal and extract every factual claim into a structured log. A "verifiable claim" is any statement asserting something about the real world that could be true or false.
Claim categories to extract:
| Category | Examples | Verification path |
|---|---|---|
| Citations | "[1] Smith et al. 2024..." | mm-ask → S2/CrossRef/DOI |
| Named entities | "Company X", "Prof. Y at University Z" | mm-ask → WebSearch/OpenCorporates |
| Statistics | "Market worth EUR 3B", "affects 500M people" | mm-ask → WHO/Eurostat |
| Performance claims | "Current methods achieve 85% accuracy" | mm-ask → benchmark papers |
| State-of-the-art claims | "No existing approach combines X and Y" | mm-ask → comprehensive search |
| Historical claims | "Since the discovery of X in 2015..." | mm-ask |
| Regulatory claims | "EU regulation 2024/XXX requires..." | mm-ask → EUR-Lex |
| Epidemiological | "Diabetes affects 10% of Europeans" | mm-ask → WHO/Eurostat |
| Causal claims | "X has been shown to cause Y" | mm-ask → source alignment |
| Budget claims | "Total budget EUR 2.5M" | Cross-check vs budget.md |
| Timeline claims | "Completed in 36 months" | Cross-check vs work_plan.md |
| Internal consistency | "5 work packages" | Cross-check across sections |
Don't check (opinions, not facts):
Save the extracted claims to <proposal_dir>/review/claims_log.json:
[
{
"id": 1,
"text": "The global AI drug discovery market is projected to reach EUR 4B by 2028",
"category": "statistics",
"section": "excellence",
"citation": null,
"priority": "high"
}
]
Prioritize: High (central claim, easily checkable, high consequence), Medium (supporting detail), Low (peripheral).
Parse bibliography.md. For each entry extract: title, authors, year, venue, DOI.
If MM_AVAILABLE=1 (default path):
Compose a citation verification prompt and dispatch to 3 external models in parallel:
cat > /tmp/gw_fc_citations.txt <<EOF
Verify these grant proposal citations. For EACH entry, check:
(1) the paper exists with the given title and authors
(2) the venue and year are correct
(3) the DOI (if present) resolves to the same paper
Use CrossRef, Semantic Scholar, PubMed, and Google Scholar — whichever
sources you have access to. Be strict: if you cannot find corroborating
evidence for a paper, rate it FABRICATED or UNVERIFIABLE (do not guess).
Return a JSON array, one entry per citation:
[
{
"ref_id": "[1]",
"verified": true/false,
"metadata_match": "exact" | "approximate" | "mismatch" | "not_found",
"source_urls": ["https://..."],
"rating": "VERIFIED" | "MOSTLY_ACCURATE" | "SUSPICIOUS" | "FABRICATED" | "UNVERIFIABLE",
"notes": "..."
}
]
Return ONLY the JSON array. No prose. No markdown fence.
--- CITATIONS ---
$(cat "$PROP/sections/bibliography.md")
EOF
uv run mm-ask \
--models gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--prompt-file /tmp/gw_fc_citations.txt \
--output "$PROP/review/pass1_citations.json" \
--timeout 600 \
--verbose
Merge the 3 reviewers' outputs: a citation rated FABRICATED by ≥2 of 3 reviewers is high-confidence fabricated; a rating with only 1 reviewer dissenting is flagged for manual review.
If MM_AVAILABLE=0, fall back:
curl -sI "https://doi.org/<DOI>" → 302 = valid, 404 = fabricatedmcp__crossref__searchByTitle) and Semantic Scholar (mcp__semantic-scholar__search_papers)Rate each reference: VERIFIED / MOSTLY_ACCURATE / SUSPICIOUS / FABRICATED / UNVERIFIABLE.
If MM_AVAILABLE=1:
cat > /tmp/gw_fc_facts.txt <<EOF
You are a journalism-trained fact-checker verifying claims in a grant
proposal. For EACH claim listed below:
1. Find the PRIMARY source (government data, peer-reviewed paper,
official registry — NOT a blog or news retelling).
2. Rate source credibility T1-T5:
T1 = primary official (Eurostat, WHO, NIH, EU Official Journal)
T2 = primary academic (peer-reviewed + DOI-verified)
T3 = institutional (official org website)
T4 = quality secondary (major news outlet)
T5 = unvetted (blog, social media, preprint)
3. Compare the claim to the source.
4. Rate the claim: VERIFIED | MOSTLY_ACCURATE | MISLEADING | INACCURATE | FABRICATED | UNVERIFIABLE
5. Be strict — reviewers check citations, and a fabricated fact can end a career.
Return a JSON array, one entry per claim:
[
{
"claim_id": 1,
"claim_text": "...",
"sources_checked": [
{"name": "WHO Global Health Observatory", "url": "https://...", "tier": "T1", "finding": "Actual figure is 9.2%"}
],
"rating": "INACCURATE",
"evidence_summary": "Proposal says 6%, WHO says 9.2% (2024 data)",
"severity": "WARNING",
"suggested_fix": "Update to 9.2% and cite WHO 2024"
}
]
Return ONLY the JSON array. No prose. No markdown fence.
--- CLAIMS ---
$(python3 -c "
import json
claims = json.load(open('$PROP/review/claims_log.json'))
hi = [c for c in claims if c['category'] != 'citations' and c.get('priority') == 'high']
print(json.dumps(hi, indent=2))
")
EOF
uv run mm-ask \
--models gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--prompt-file /tmp/gw_fc_facts.txt \
--output "$PROP/review/pass2_facts.json" \
--timeout 600 \
--verbose
Merge with the same consensus logic as Pass 1. Then classify each flagged issue via gw-classify:
echo "<issue description>" | uv run gw-classify classify
This returns the category (e.g. FACT_OUTDATED, CITATION_HALLUCINATED) and a recommendation.
If MM_AVAILABLE=0, fall back to 3 parallel Claude Agent subagents each handling a batch of claims and using WebSearch + MCP database tools. Output shape is the same JSON.
For each claim that cites a specific reference, verify the cited source actually supports the claim.
If MM_AVAILABLE=1:
First, fetch abstracts for the cited references from Pass 1's verified set (using mcp__semantic-scholar__get_paper or mcp__crossref__getWorkByDOI). Then dispatch:
cat > /tmp/gw_fc_alignment.txt <<EOF
For each cited claim below, check whether the cited paper actually
supports it.
Rate alignment on this scale:
ALIGNED — claim accurately reflects the cited source
MOSTLY_ALIGNED — substantially correct, minor imprecision (rounded number, slightly different wording)
EXAGGERATED — paper shows modest effect, proposal claims strong effect
MISATTRIBUTED — paper is about something else entirely
UNVERIFIABLE — cannot confirm from abstract alone
Return a JSON array:
[
{
"claim_id": N,
"claim_text": "...",
"cited_ref": "[3]",
"abstract_says": "...",
"rating": "EXAGGERATED",
"evidence": "Paper says 89% accuracy, proposal says 95% — a 6% inflation",
"severity": "WARNING",
"suggested_fix": "Correct to 89% accuracy"
}
]
Return ONLY the JSON array. No prose. No markdown fence.
--- CITED CLAIMS ---
<paste cited claims from claims_log.json>
--- BIBLIOGRAPHY ABSTRACTS ---
<paste abstracts fetched from S2/CrossRef>
EOF
uv run mm-ask \
--models gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--prompt-file /tmp/gw_fc_alignment.txt \
--output "$PROP/review/pass3_alignment.json" \
--timeout 600 \
--verbose
If MM_AVAILABLE=0, run 3 parallel Claude Agent subagents with the same task.
Final sweep without external tools — Claude does this inline:
![...]() points to a fileFor each inconsistency found, classify via gw-classify:
echo "Budget lists 2 postdocs but methodology describes 3 PhD students" | uv run gw-classify classify
Write <proposal_dir>/review/claim_verification.json:
{
"checked_at": "<ISO timestamp>",
"multi_model_used": true,
"models_dispatched": ["gpt-5.4-high", "gemini-3.1-pro", "grok-4-20-thinking"],
"total_claims_checked": 47,
"citations": [
{
"ref_id": "[1]",
"claimed": {"title": "...", "authors": "...", "year": 2024, "doi": "10.1234/..."},
"verification": {
"method": "mm_ask_3model + DOI_check",
"found_in": ["Semantic Scholar", "CrossRef"],
"doi_resolves": true,
"metadata_match": "exact",
"reviewers_consensus": 3,
"source_urls": ["https://api.semanticscholar.org/..."]
},
"rating": "VERIFIED",
"source_tier": "T2"
}
],
"factual_claims": [
{
"claim_id": 1,
"text": "...",
"category": "statistics",
"rating": "INACCURATE",
"reviewers_flagging": 3,
"sources_checked": [
{"name": "WHO", "url": "...", "tier": "T1", "finding": "Actual figure is 9.2%"}
],
"error_category": "FACT_OUTDATED",
"evidence_summary": "Proposal says 6%, WHO says 9.2% (2024)",
"severity": "WARNING",
"suggested_fix": "Update to 9.2% and cite WHO 2024"
}
],
"claim_source_alignment": [],
"internal_consistency": [
{
"issue": "Budget lists 2 postdocs, methodology describes 3 PhD students",
"rating": "INACCURATE",
"error_category": "INTERNAL_INCONSISTENCY",
"suggested_fix": "Align budget personnel with described research activities"
}
]
}
Also save per-section checkpoints via gw-state so downstream revision skills can pick up the verification results:
cat "$PROP/review/claim_verification.json" | uv run gw-state save-checkpoint "$PROP" fact_check global verification
Run the error classifier against the fact_check checkpoints:
uv run gw-classify analyze "$PROP" fact_check
This returns the dominant error category and a recommendation. Use it to decide:
CITATION_HALLUCINATED → the literature phase was weak, consider re-running /gw:literatureINTERNAL_INCONSISTENCY → route to /gw:revision for a cross-section sweepFACT_OUTDATED → route to a targeted fact update roundRescue-on-stuck (if multi_model.rescue_on_stuck: true in config AND
MM_AVAILABLE=1): When the error analysis shows should_escalate=true
(≥50% of issues are the same category AND ≥3 total issues), the same
error keeps recurring and single-model fixes likely won't break the pattern.
Dispatch the error context to mm-council for a multi-model diagnosis:
ANALYSIS=$(uv run gw-classify analyze "$PROP" fact_check)
SHOULD_RESCUE=$(python3 -c "
import json
a = json.loads('$ANALYSIS')
print('yes' if a.get('dominant_pct', 0) >= 0.5 and a.get('total_issues', 0) >= 3 else 'no')
")
if [ "$SHOULD_RESCUE" = "yes" ] && [ "$MM_AVAILABLE" = "1" ]; then
cat > /tmp/gw_fc_rescue.txt <<EOF
The fact-check phase of a grant proposal keeps hitting the same error
pattern. Error analysis:
$(echo "$ANALYSIS" | python3 -m json.tool)
Dominant error: $(echo "$ANALYSIS" | python3 -c "import json,sys; print(json.load(sys.stdin).get('dominant_error','unknown'))")
Recommendation: $(echo "$ANALYSIS" | python3 -c "import json,sys; print(json.load(sys.stdin).get('recommendation',''))")
Diagnose the root cause. Is this a systematic issue in how the proposal
was written, a literature-phase failure, or a false-positive pattern in
our classifier? Propose a concrete fix strategy — which gw phase should
be re-run and with what changes?
Return a JSON object:
{"root_cause": "...", "fix_strategy": "re-run /gw:literature with broader queries", "affected_phase": "literature|proposal_writing|...", "confidence": 1-5}
EOF
uv run mm-council \
--panel gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--chairman gpt-5.4-high \
--prompt-file /tmp/gw_fc_rescue.txt \
--output "$PROP/review/rescue_diagnosis.json" \
--timeout 300 \
--verbose
echo "Rescue diagnosis saved to review/rescue_diagnosis.json"
fi
Read the rescue_diagnosis.json — the chairman's synthesis tells you which
phase to re-run. Surface this to the PI before acting on it.
Compile all findings into <proposal_dir>/review/fact_check.md with these sections:
claim_verification.jsonverified_by: "mm_ask_3model" or verified_by: "claude_alone"| Situation | Action |
|---|---|
| Any FABRICATED or INACCURATE claims with ≥2-of-3 reviewer consensus | BLOCK review phase. Must fix before proceeding. |
| Only MISLEADING / MOSTLY_ACCURATE | Proceed to review. Flag prominently. PI should fix. |
| All VERIFIED | Proceed to review. Clean proposal. |
Human checkpoint: Present the report. For each critical issue, show the claim, the evidence, the reviewer consensus count, and the suggested fix. Ask the PI: fix it, override with justification, or investigate further.
if [ "$CRITICAL_ISSUES" -gt 0 ]; then
uv run gw-state update "$PROP" --phase fact_check --status in_progress
else
uv run gw-state update "$PROP" --phase fact_check --status complete
fi
If failed (critical issues present), /gw routes back to the offending phase (/gw:literature for citation fabrication, /gw:revision for inconsistency, etc.), then re-assembles and re-checks.
multi_model_used: false in claim_verification.json./gw:literature with mm-ask to regenerate a real bibliography, then re-run fact-check.mm-ask returns a Response with exit_code=124 for that model. The other reviewers still complete. Proceed with the successful workers; flag the partial dispatch in the report.claim_verification.json with pi_override: true and a justification. Future review skills should surface overrides so they don't get silently lost.uv run gw-state validate-resume "$PROP" and re-run fact-check — this pass is idempotent.parse_failed and exclude from consensus. The remaining reviewers still vote.npx claudepluginhub stmailabs/gw --plugin gwProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.