From idea-forge
Shared quality framework inherited by ALL builder skills. Defines the Mid-Draft Quality Check, Quality Dimensions scoring, Refine-or-Pivot decision, weakness-weighted review tiers, and optional document evaluator integration. Builder skills reference this instead of duplicating quality logic.
How this skill is triggered — by the user, by Claude, or both
Slash command
/idea-forge:builder-foundationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This is NOT a standalone skill. It is a shared foundation that all builder skills inherit. It defines quality gates, scoring dimensions, and review patterns that apply to every document type.
This is NOT a standalone skill. It is a shared foundation that all builder skills inherit. It defines quality gates, scoring dimensions, and review patterns that apply to every document type.
All builder skills MUST reference this foundation. When creating a new builder skill, add this line near the top:
> **Foundation:** This skill inherits the quality framework from `skills/builders/builder-foundation/SKILL.md`. All Mid-Draft Quality Checks, Quality Dimensions, and Review Tiers defined there apply here.
skills/builders/skills/presentations/ that generates content (presentation-builder, presentation-planner)Does NOT apply to:
The North Star is the spec. Mid-Draft Quality Check + Completion Gate ARE evals against the spec - they're not separate artifacts, they're one continuous loop. Treat the spec as the highest-leverage input: clarity at the North Star step multiplies through everything downstream.
"Skip-the-spec is skip-the-eval is dark code." When a builder skips the North Star or hand-waves a section, it removes the eval that would have caught the gap before draft. Builders that produce surprisingly bad output usually trace back to surprisingly thin North Stars.
The North Star is the spec; the Source Inventory is the substrate. A clean spec drafted over a contradictory, stale, or duplicate source set still hallucinates structurally - and a plausibly-sourced wrong number can pass the Completion Gate's claim-honesty check because the citation looks fine. Input-side preparation (the Source Inventory + Phase 1.5 Source Reconciliation in north-star-builder - conflict log, missing-context list, duplicates flag) and output-side honesty (the Completion Gate, below) are complementary, not redundant: the inventory prevents the dirty source from entering the draft; the gate catches what slips through. When a builder draws on multiple sources of uncertain authority, confirm the upstream reconciliation ran before trusting the substrate.
After completing the first full draft, BEFORE presenting to the user, every builder skill must run this internal quality gate.
Rate the draft 1-5 on each dimension. Accurate self-assessment is essential - avoiding self-evaluation bias produces the most useful scores.
| Dimension | Score Range | What It Measures |
|---|---|---|
| Depth | 1-5 | Does the content teach something specific, or just describe at surface level? Would a domain expert learn something new? |
| Originality | 1-5 | Does the content offer non-obvious insights, or is it the first thing anyone would say about this topic? Could this have been the top Google result? Compression test (external-facing content): if an AI agent summarized this to three sentences for someone asking "should I trust/use this?", would a provable, differentiated point of view survive - or would it flatten into the category average? (N/A for tutorials/specs/code docs.) |
| Coherence | 1-5 | Does the narrative build logically toward the North Star's stated outcome? Could you remove any section without breaking the argument? |
| Completeness | 1-5 | Are all North Star must-covers addressed with substance (not just mentioned in passing)? |
Scoring guide:
| Score | Meaning |
|---|---|
| 5 | Exceptional - a domain expert would be impressed |
| 4 | Strong - solid content with genuine insights |
| 3 | Adequate - meets expectations but nothing surprising |
| 2 | Weak - surface-level, generic, or missing key details |
| 1 | Poor - does not address this dimension |
Thresholds:
Eval Gold Standard: A scoring criterion is well-written when two independent reviewers reading the same criterion would reach the same pass/fail (or same numerical score) on the same draft. If two reviewers might disagree, the criterion is too vague - sharpen it.
When any quality dimension scores <= 2, or when a dimension scores 3 on a high-stakes document:
This check is internal - do not show the user the intermediate scores or state.
Each builder skill should define skill-specific prompts for each dimension. The foundation provides the framework; the skill provides the specifics.
Example for whitepaper-builder:
Depth: Do deep-dive sections teach specific mechanics, or just describe feature existence?
Originality: Does competitive analysis reveal non-obvious gaps?
Example for blog-builder:
Depth: Do examples use specific details (names, numbers, dates) or generic placeholders?
Originality: Does the hook avoid cliched openings? Does the post offer a unique perspective?
All review checklists in builder skills must be split into two tiers:
These are dimensions where Claude naturally underperforms. Spend most of your review time here.
Universal Critical Items (apply to ALL builder skills):
Skill-specific Critical Items are defined in each builder skill and address weaknesses particular to that document type.
These are dimensions where Claude naturally excels. Quick verification is sufficient.
Universal Standard Items (apply to ALL builder skills):
Skill-specific Standard Items are defined in each builder skill.
The Weakness-Weighted Review Tiers (above) weight review by Claude's skill at the task. This gradient weights by a second, orthogonal axis: the consequence of the claim being wrong. Both apply at once - a claim can be a Claude strong-spot and high-consequence (a cleanly-formatted board number), and it still earns the full gate.
Review burden is not flat. Before the Completion Gate, tag each claim or element:
| Consequence | Examples | Review burden |
|---|---|---|
| Low | Layout, formatting, chart drafting, summary wording, consistency checks | Trust the model; spot-check only |
| Medium | Source attribution, data extraction, paraphrase of a source | Verify the link, not the whole chain |
| High | Numerical synthesis, financial calculations, regulatory/compliance language, any number that travels to a decision-maker (investor, board, senior leadership) | Mandatory human-verifiable trace; never ship un-spot-checked |
One undefendable high-consequence number in an otherwise finished-looking file survives quick review precisely because everything else looks finished - concentrate scarce verification attention where being wrong is expensive.
Calibration guard: the gradient only helps if Low and Medium genuinely mean "lighter touch." If everything gets tagged High, the signal is lost - re-read the Low/Medium examples and demote anything that does not actually drive a decision.
Mid-Draft Quality Check evaluates content quality. The Completion Gate evaluates completion honesty. Both run before presenting to the user.
The premise: Claude's most common document-generation failure is declaring a draft complete while obvious gaps remain - placeholder text, unverified numerical claims, broken cross-references, missing must-cover topics, fabricated facts. These are not quality issues (covered by Mid-Draft Quality Check); they are honesty issues. The cost of catching them after the user reviews is high - it erodes trust in every claim the document makes.
Before declaring v1 complete, every builder must answer each of these. Empty checkmarks are not allowed - every item gets a real disposition (VERIFIED, SKIPPED: <why>, or BLOCKED: <reason>).
TODO, TBD, <...>, Lorem ipsum, [INSERT, XXX, and FIXME. Every hit must be resolved or removed.Section \d, Step \d, Figure \d, Table \d and confirm each has a descriptive name and the referenced content is present.This section applies only when the deliverable is a binary file the agent cannot directly read (PPTX, DOCX, PDF, MP4, PNG-from-render). Skip if the output is text/Markdown/HTML.
10.5. Render and read. Before declaring v1 complete, convert the binary to a readable form and visually verify at least 3 representative samples (e.g., title slide, mid-document slide, conclusion slide for a deck; first/middle/last page for a doc). Read the rendered images via Claude Code's image-reading capability. Verify: diagrams render fully (no \n line breaks, no truncation); text fits its container (no overflow, no autofit collapse); colors match plan (no hex format mismatch).
The premise: when the agent writes code that emits a binary, the agent never sees the rendered output during the build. The user becomes the renderer-of-last-resort, and visible bugs (Mermaid \n, autofit, color formats) only surface after presentation. Rendering one sample mid-build catches these before they ship.
If your skill produces binary output, document the rendering pipeline in the skill's "Visual Self-QA" subsection (e.g., presentation-builder uses python scripts/office/soffice.py --headless --convert-to pdf + pdftoppm -jpeg).
This section applies when the deliverable references external entities readers may want to follow up on - GitHub repos/PRs/issues/releases, Notion pages, web URLs, papers, standards, products. Skip if the deliverable has no external references.
[name](url) is trivial compared to a reader manually searching for the thing you just named. Subsequent in-section mentions may be plain text if context is unambiguous.solo, the local Hiero/Hedera dev network deployment tool"). The expansion source is, in priority order: (a) a Project Glossary in the source profile/spec if one exists, (b) your verified knowledge written at that level of detail. If you cannot confidently expand a term, hedge ("an internal SDK abstraction") or skip the term - parroting jargon you don't understand is worse than verbose plain English. This rule originated in the team-digest sa-digest skill (May 2026 friction: digest entries like "PR #1328 deprecating local-node in favor of solo" forced readers to ask "what's solo?").Surface the gate result as a brief inline summary when presenting v1: "Completion Gate: 11/11 verified" or "Completion Gate: 9/11 verified (2 skipped: )". Do not paste the full checklist - the act of running it is the value, not a wall of paperwork.
Provenance: the 2026-04-06 through 2026-04-29 Lessons Log rows map one-to-one to the gate items above.
For complex or high-stakes documents, builder skills can invoke the [DOC_EVAL] agent (.claude/agents/document-evaluator.md) as an independent quality check. This provides a separate evaluation context, partially mitigating self-evaluation bias.
After the Mid-Draft Quality Check, if the skill wants an independent evaluation:
[DOC_EVAL] agent with:
The document evaluator agent is tuned to be skeptical. It has explicit anti-leniency rules. However, it is still Claude evaluating Claude - this mitigates but does not eliminate self-evaluation bias. For truly independent evaluation, use the cross-agent review skill with external models.
When the user wants the quality cycle to run hands-off (or pairs the build with /goal), wire the evaluator as an iterate-until-pass loop instead of a one-shot check. The grader runs in an independent context window, which outperforms self-critique:
[DOC_EVAL] on the full draft (North Star + draft path + document type).Each round reports: verdict, dimension scores, what changed. The loop's stop condition is the evaluator's verdict - never the drafting context's own judgment. Goal-ready pairing: /goal "the document evaluator sub-agent returns PASS, or 3 evaluation rounds have completed".
For documents expected to exceed 500 lines (whitepapers, tech-arch docs), builder skills should optionally use the sprint contract pattern before drafting:
A brief agreement (written as a comment in the output file or held in working memory) that specifies:
Every builder skill encodes assumptions about what Claude must be told explicitly; they go stale as models improve. Recurring suspects: output chunking caps, explicit Mermaid formatting rules, detailed template section ordering, review-checklist items that always pass. The test-and-remove procedure (ablate one constraint, build, compare) lives in the Skill Health Audit (docs/skill-health-audit.md). Default to the simplest rule set that demonstrably holds.
Quarterly or after a model upgrade, review each builder skill per the Skill Health Audit (rules that never trigger, boilerplate template sections, always-pass checklist items, merge candidates). Document findings in workspace/idea-forge-meta/ using the learn-and-improve skill format.
When processing long content (500+ lines) during any phase of document generation:
These rules apply to all builder skills when processing North Star docs, plans, research briefs, or reference material during document generation.
When using MCP tools that return lists (e.g., search_thoughts, list_thoughts):
capture_thought), collect all items first, then dedup in a single pass using search_thoughts with the most distinctive term from each itemTwo rules apply to every document produced by any builder skill. Both exist because real Catalyst v3 review surfaced violations that confused readers and caused downstream AI tools to produce wrong answers when queried about the document.
Problem (Catalyst v3): the same base pool was named "USDC/catUSD", "USDC/USDT", and "USDC/USDT0" across three sections - and survived into v2 because no single source of truth existed.
Rule: Before drafting any document with 3+ sections, create a Canonical Glossary at the top of the working file (can be removed before final version, OR kept as an appendix). The glossary locks every key noun to one canonical form:
Every mention of a key noun in the document must match the glossary exactly. No variation, no synonyms, no "USDT0" in one paragraph and "USDT" in the next unless they are genuinely different things.
Self-check before declaring the draft complete:
Why beyond aesthetics: inconsistent nouns also break AI-assisted querying of the doc - the tool cannot tell which form is authoritative.
Problem (Catalyst v3): "Catalyst solves for steps 2-4", "Engine 2", "Section 5 and Section 6 explain this" - forces flip-back lookups and goes stale on reorder.
Rule: Never reference another part of the document by number alone. Always use a descriptive name:
| Wrong | Right |
|---|---|
| "Section 5 and 6 explain this" | "the Liquidity Bootstrapping and Revenue Model sections explain this" |
| "Engine 2 handles CDP issuance" | "the CDP Issuance Engine handles this" |
| "Catalyst solves for steps 2-4" | "Catalyst solves for liquidity bootstrapping, market-making, and credit market access" |
| "See Figure 3" | "See the Metapool Architecture diagram" |
If a section or concept is important enough to reference, it's important enough to have a name the reader will remember. Numbers force lookups; names teach.
Exception: Formal numbered references (e.g., "HIP-991", "GENIUS Act Section 3(a)") are fine because those numbers are external canonical identifiers, not internal document structure.
Self-check: Grep the draft for Section \d, Step \d, Engine \d, Figure \d, Table \d. Every hit must have a descriptive name alongside or replacing the number.
Every builder skill should maintain a dated lessons log. When a skill execution surfaces a problem, add a row. This makes skills smarter over time from real usage.
## Lessons Log
| Date | What Happened | What Changed |
|------|--------------|-------------|
| YYYY-MM-DD | <specific failure description> | <rule or process that was updated> |
| Date | What Happened | What Changed |
|---|---|---|
| 2026-04-04 | Skills had static "Lessons Learned" sections with no dates or failure context. Impossible to assess recency or whether lessons were from real failures. | Added this dated Lessons Log format to builder-foundation. All builder skills now inherit it. |
| 2026-04-04 | Self-evaluation bias: builder skills checked their own review checklists and always "passed." No quality scoring or thresholds. | Added Mid-Draft Quality Check with 4-dimension scoring (Depth/Originality/Coherence/Completeness) and Refine-or-Pivot decision. Source: Anthropic harness design article. |
| 2026-04-04 | Review checklists treated all items equally. Claude spent equal time on structure (strong) and depth (weak). | Added weakness-weighted review tiers: 70% effort on Critical (Claude weak spots), 30% on Standard (Claude strong spots). Source: Anthropic harness design article. |
| 2026-04-06 | Catalyst v3 whitepaper: research agents hallucinated dates and inflated numbers. SaucerSwap TVL reported as $140M (actual: $78M), USDT0 launch reported as 2025 (actual: March 2026), AUDD TVL reported as $5.2M (actual: ~$180K), staking yield reported as 6.5% (actual: 2.5%). | Added "Numerical Claim Verification" section to research skill requiring primary source verification for all numbers. Added numerical verification as top Critical Item in whitepaper-builder review checklist. |
| 2026-04-06 | Catalyst v3 Mid-Draft Quality Check: Originality scored 3/5 (borderline) but didn't trigger refinement because threshold was <= 2. A score of 3 means "adequate but not surprising" which is insufficient for investor-facing whitepapers. Also found: HIP-1215 repeated 4-6 times across sections (redundancy), no quantitative financial model despite describing revenue sources. | Added borderline threshold (dimension = 3 recommends refinement for high-stakes docs). Added no-redundancy rule and financial model requirement to whitepaper-builder. Added both to Critical Items checklist. |
| 2026-04-08 | Catalyst v3 whitepaper v2 user review: base pool composition described inconsistently across sections (USDC/catUSD in one section, USDC/USDT in another, USDC/USDT0 in a third). Caused reader confusion and made AI-assisted summaries unreliable because the tool could not identify the canonical form. | Added Rule 1 (Naming Consistency / Canonical Glossary) to builder-foundation. Every builder must lock key nouns before drafting and self-check via grep before declaring complete. |
| 2026-04-08 | Catalyst v3 whitepaper v2 user review: vague cross-references throughout ("Section 5 and 6 explain this", "Engine 2 handles CDP", "Catalyst solves for steps 2-4"). Forced readers to flip between sections and became stale when sections were reordered. | Added Rule 2 (Descriptive References / No Naked Numbers) to builder-foundation. Every builder must use descriptive names for internal references, not section/step/engine numbers. |
| 2026-04-29 | Claude Code usage report (1,020 messages, 39 sessions) showed 38 friction events (20 wrong_approach + 18 buggy_code) tracing to premature completion claims - shipping work that looks done but isn't (stderr leakage missed, percentages exceeding bounds, surface-level tests missing real bugs, commit messages overstating scope, fabricated technical claims like fictional HIP-904 indefinite rent). Mid-Draft Quality Check evaluates content quality but did not catch completion-honesty issues. | Added Completion Gate (mandatory before presenting v1) covering Claim Honesty, Structural Honesty, Scope Honesty, and the "5 more minutes" reflection. Counters the premature-completion failure mode at the document level. |
| 2026-04-29 | The Smart Ape "Opus 4.7 punishes bad prompting" article (Apr 17 2026) flagged that 4.7 follows instructions literally - vague templates that produced acceptable output on 4.6 leak common AI patterns on 4.7 because the model no longer fills in unstated intent. Universal Critical Items listed positives only. | Added "Negative constraints stated explicitly" as a Universal Critical Item. Builder templates must name what the document should NOT include, not just what it should. References a prose-cleanup pass rather than restating the patterns. |
| 2026-06-07 | learn-and-improve audit of the Nate Jones data-room video surfaced that idea-forge's hallucination defenses were entirely output-side (Completion Gate, Fact Bank, numerical verification). A plausibly-sourced wrong number drafted from a contradictory or duplicate source set passes claim-honesty and ships. No input-side source-prep existed. | Added "Source-Side Honesty" to the Spec-Becomes-Eval Discipline, pointing builders at the upstream Source Inventory + Phase 1.5 Source Reconciliation in north-star-builder as the input-side complement to this gate. |
| 2026-06-08 | learn-and-improve audit of the Nate Jones agentic-office-pipeline video: review effort was weighted only by Claude's skill at the task (Weakness-Weighted Tiers), never by the consequence of a claim being wrong. A board-bound valuation number and a layout tweak got the same (flat) scrutiny - "a financial model in a costume" ships when a high-consequence number looks finished. | Added the Task Risk Gradient (Low/Medium/High consequence, orthogonal to the skill-based tiers) and tied it into Completion Gate Section 1: High-consequence numbers need a human-verifiable trace, not just a plausible citation. |
When creating a new builder skill:
The foundation handles the framework; the skill handles the specifics.
Version: 2.7.0 - leanness trim after the model-upgrade ablation review: provenance prose compressed (rules unchanged), Failure-Patterns table folded into a provenance line, Harness-Assumption/Simplification bodies now point at the Skill Health Audit. No behavioral rule added or removed.
2.6.0 - Autonomous Refine Loop (prior baseline).
npx claudepluginhub kpachhai/idea-forge --plugin idea-forgeGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.