Skill

grade-spec

Grade a completed OpenSpec change with the Spec Quality Index (SQI) — a 0–100 score + letter grade measuring how cheaply the spec implemented (escalations, rework, gaps, corrections), NOT whether the feature or code is good. Use when the user wants to grade, score, or measure the quality of an OpenSpec change/spec after it was implemented; asks "how good was that spec", "run the SQI", "grade add-foo", "what's the spec quality index", or invokes /grade-spec on a named change; or just finished applying/archiving an OpenSpec change and wants its spec→impl fitness measured. Reads the spec, the implementation diff, and the OpenSpec journal (run log), classifies every divergence, and emits the machine-comparable SQI JSON plus write-back recommendations.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/grade-openspec:grade-spec

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Measure how good an OpenSpec change was **as an implementation contract** — by how

Supporting Files

references/spec-quality-index.mdreferences/sqi-report.mdscripts/sqi.py

SKILL.md

188 lines · ~2.6k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 13, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Grade Spec (Spec Quality Index)

Measure how good an OpenSpec change was as an implementation contract — by how cheaply it implemented. Output is the SQI JSON from §10 of the methodology: two sub-scores (Implementability I, Fidelity F), a blended SQI 0–100, a letter grade, a confidence level, and concrete write-back recommendations.

Canonical methodology: docs/kpi/spec-quality-index.md. This skill operationalizes it. Read that doc when you need calibration rationale (§8), anti-gaming pitfalls (§11), grade-band interpretation (§6), or the full worked example (§7). The per-run essentials are inline below.

SQI measures the spec's fitness, not whether the feature is good or the code is good. Keep that framing in the output.

Workflow

1. Resolve the change → its directory + impl diff + journal
2. Count requirements R          (deterministic — script)
3. Mine the journal for I-signals (deterministic — script, then classify)
4. Read spec + diff, enumerate every divergence, classify each S0..S7 (judgment)
5. Score                          (deterministic — script)
6. Persist: write `<change-dir>/sqi.md`, append `./sqi.jsonl`, then present

1. Gather the three inputs

Per methodology §3. Missing inputs lower confidence (§9), not the score.

Input	Where	Used for
Spec change	`openspec/changes/<name>/{proposal,design,specs/**}.md`, `tasks.md`	count `R`, know what spec claimed
Impl diff	`git diff` of the apply (or, if archived, the diff that landed it)	see what was actually built
Run log	`openspec/changes/<name>/journal.jsonl` (archived: `openspec/changes/archive/<date>-<name>/journal.jsonl`)	score Implementability

If the change is archived, look under openspec/changes/archive/. If no journal exists, Implementability is unobserved — set "run_log_present": false in the scorer input; it reports Fidelity-only at partial confidence (§9). Never treat an absent run log as a perfect Implementability score.

2. Count requirements

python3 skills/grade-spec/scripts/sqi.py count-reqs openspec/changes/<name>/specs

Counts ### Requirement: headers across specs/**.md. R normalizes the score so large and small specs are comparable — a 30-requirement spec is allowed more absolute divergence than a 3-requirement one. If specs live elsewhere, point the command at the right directory; cross-check the count by eye.

3. Mine the journal for Implementability signals

python3 skills/grade-spec/scripts/sqi.py scan-journal openspec/changes/<name>/journal.jsonl

This surfaces candidates — you still classify them:

Journal signal	Candidate class	Note
`verifier.result note=fail`	S5 rework	counted per extra round beyond the first fail per task ref (the script does this in `extra_rounds`)
`task.blocked`	S7 blocker	only if a spec defect halted impl until the spec changed; a block on an external dep is not an S7
`decision`	S1 / S3 / S4	read `input`/`output`: was it a deferred TBD (S1), a silent gap filled (S3), or a wrong assertion fixed (S4)?

Escalation (S6) is not journaled. Read the run transcript / handoff notes for moments where impl had to ask a stakeholder or the spec author to proceed. S6 is the cost the KPI most wants to penalize — do not miss it.

4. Enumerate and classify divergences

Read spec + diff side by side. Every place impl deviated from, resolved, or corrected the spec is a divergence. Assign exactly one class. When two classes fit, take the higher-cost one and count it once (an S1 that's also an S2 is one divergence, classified at its highest cost).

Class	Name	Definition	`w`
S0	Clean	Implemented as written; no divergence.	0
S1	Resolved-defer	Spec explicitly deferred this (open Q / TBD / "confirm at apply"); impl resolved it via introspection, no human. Spec did its job by naming the unknown.	0.5
S2	Write-back	Impl learned a concrete fact the spec should now record; non-blocking, no human. (Often the same divergence as an S1 — don't double-count.)	0.5
S3	Gap	Spec was silent on something impl needed (tooling flag, env detail, teardown, unstated infra assumption). Impl decided alone.	1
S4	Correction	Spec asserted something impl found wrong or oversimplified. Fixed without a human.	2
S5	Rework	Ambiguity/under-spec caused a verifier FAIL→fix cycle. Count per extra round beyond the first.	3 each
S6	Escalation	Impl had to ask a stakeholder / spec author to proceed.	6 each
S7	Blocker	A spec defect halted impl until the spec itself was changed.	10 each

Decision rules (apply top-down; first match wins):

Impl had to ask a human? → S6 (or S7 if it also halted).
Verifier failed and forced a fix loop? → S5 per extra round.
Spec said something that turned out wrong? → S4.
Spec was simply silent and impl filled it in? → S3.
Spec explicitly flagged it TBD and impl resolved it? → S1.
None of the above (built as written)? → S0.

Anti-gaming (§11): don't reward silence — a spec that says nothing dodges S4 but racks up S3 and (on wrong guesses) S5/S6. Actively hunt the diff and run log for undocumented "I decided X myself" moments; a strong implementer who silently absorbs ambiguity will hide S3/S4 unless you look. Vagueness is penalized.

5. Score

Write a JSON file with requirements, the classified divergences, optional change/graded_at/writeback_recommendations, and run_log_present (false if no journal). Then:

python3 skills/grade-spec/scripts/sqi.py score /path/to/divergences.json

The script applies §5 exactly: score = 100 × (1 − min(1, penalty / (k·R))) with k=4, blend=0.5. Implementability uses {S5,S6,S7}; Fidelity uses {S1,S2,S3,S4}. Do not hand-compute — let the script do the arithmetic. Override defaults only with explicit reason: --k <n> (lower = stricter) or --blend <0..1> (toward 1.0 weights "buildable without help" over "factually complete").

Grade bands: A ≥90, B ≥80, C ≥70, D ≥60, F <60. A partial run (no run log) gets no letter — the scorer sets grade: null; report it as a Fidelity-only SQI number.

6. Persist and present the result

Every run MUST produce two durable artifacts — this is not optional. Write them before presenting, then read the result back in plain language.

6a. Markdown report — always written, colocated with the change. Write the human report to <change-dir>/sqi.md, where <change-dir> is the directory resolved in step 1 — openspec/changes/<name>/ for an active change, or openspec/changes/archive/<date>-<name>/ for an archived one. One report per change; overwrite on re-grade. Use the structure in references/sqi-report.md.

Because active and archived changes sit at different directory depths, the report references the methodology and ledger as repo-root-relative path text, not clickable relative links (a ../../ link would break for one depth):

> Methodology: `docs/kpi/spec-quality-index.md`. Machine ledger: `sqi.jsonl` (repo root).

6b. Machine ledger — always appended at the repo root. Append the scorer's §10 JSON as exactly one line to ./sqi.jsonl at the repo root (create it if absent). One line per grading run; never rewrite prior lines. After appending, verify every line still parses:

python3 -c "import json; [json.loads(l) for l in open('sqi.jsonl') if l.strip()]; print('ok')"

6c. Present. Show the SQI JSON (the scorer prints the §10 shape). Always include:

Both sub-scores, never just the blend — I and F fail independently and a single number hides the diagnosis.
confidence — full (all three inputs), partial (no run log → I unobserved, SQI is Fidelity-only — say so explicitly), or low (only the spec, pre-impl → SQI can't be measured; offer a predictive lint instead, clearly labeled a prediction, never a measurement).
No letter on partial runs. When run_log_present is false the scorer emits grade: null and a grade_basis of "fidelity-only". Present it as "grade withheld — Fidelity-only SQI = NN (Implementability unobserved)". Never paste an A–F letter for a partial run: it cannot lose S5/S6/S7 points, so a letter would overstate the spec's fitness.
writeback_recommendations — one concrete spec edit per S1/S2/S4 divergence. This closes the loop: it's how the next spec scores higher. The KPI is only useful if it feeds back.

Then read the result in plain language (e.g. "perfect implementability, good fidelity — a few self-resolved gaps and one oversimplified claim") and stop. This skill measures, persists, and recommends; it does not edit the graded spec unless asked.

Script reference

scripts/sqi.py — three subcommands, all deterministic:

score <json> — compute I/F/SQI/grade, emit §10 JSON. Reads k/blend from the JSON if present, else flags/defaults. Honors run_log_present:false.
count-reqs <specs-dir> — count ### Requirement: headers; returns total + per-file breakdown.
scan-journal <journal.jsonl> — surface S5/S7/decision candidates from an OpenSpec journal. Tolerates malformed lines; matches event or type keys.

grade-spec

Invocation

Context Preview

Supporting Files

SKILL.md

grade-spec

Invocation

Context Preview

Supporting Files

SKILL.md

Grade Spec (Spec Quality Index)

Workflow

1. Gather the three inputs

2. Count requirements

3. Mine the journal for Implementability signals

4. Enumerate and classify divergences

5. Score

6. Persist and present the result

Script reference

Similar Skills

Grade Spec (Spec Quality Index)

Workflow

1. Gather the three inputs

2. Count requirements

3. Mine the journal for Implementability signals

4. Enumerate and classify divergences

5. Score

6. Persist and present the result

Script reference

Similar Skills