From grade-openspec
Grade a completed OpenSpec change with the Spec Quality Index (SQI) — a 0–100 score + letter grade measuring how cheaply the spec implemented (escalations, rework, gaps, corrections), NOT whether the feature or code is good. Use when the user wants to grade, score, or measure the quality of an OpenSpec change/spec after it was implemented; asks "how good was that spec", "run the SQI", "grade add-foo", "what's the spec quality index", or invokes /grade-spec on a named change; or just finished applying/archiving an OpenSpec change and wants its spec→impl fitness measured. Reads the spec, the implementation diff, and the OpenSpec journal (run log), classifies every divergence, and emits the machine-comparable SQI JSON plus write-back recommendations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grade-openspec:grade-specThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Measure how good an OpenSpec change was **as an implementation contract** — by how
Measure how good an OpenSpec change was as an implementation contract — by how
cheaply it implemented. Output is the SQI JSON from §10 of the methodology:
two sub-scores (Implementability I, Fidelity F), a blended SQI 0–100, a
letter grade, a confidence level, and concrete write-back recommendations.
Canonical methodology: docs/kpi/spec-quality-index.md. This skill operationalizes
it. Read that doc when you need calibration rationale (§8), anti-gaming pitfalls
(§11), grade-band interpretation (§6), or the full worked example (§7). The
per-run essentials are inline below.
SQI measures the spec's fitness, not whether the feature is good or the code is good. Keep that framing in the output.
1. Resolve the change → its directory + impl diff + journal
2. Count requirements R (deterministic — script)
3. Mine the journal for I-signals (deterministic — script, then classify)
4. Read spec + diff, enumerate every divergence, classify each S0..S7 (judgment)
5. Score (deterministic — script)
6. Persist: write `<change-dir>/sqi.md`, append `./sqi.jsonl`, then present
Per methodology §3. Missing inputs lower confidence (§9), not the score.
| Input | Where | Used for |
|---|---|---|
| Spec change | openspec/changes/<name>/{proposal,design,specs/**}.md, tasks.md | count R, know what spec claimed |
| Impl diff | git diff of the apply (or, if archived, the diff that landed it) | see what was actually built |
| Run log | openspec/changes/<name>/journal.jsonl (archived: openspec/changes/archive/<date>-<name>/journal.jsonl) | score Implementability |
If the change is archived, look under openspec/changes/archive/. If no journal
exists, Implementability is unobserved — set "run_log_present": false in the
scorer input; it reports Fidelity-only at partial confidence (§9). Never treat
an absent run log as a perfect Implementability score.
python3 skills/grade-spec/scripts/sqi.py count-reqs openspec/changes/<name>/specs
Counts ### Requirement: headers across specs/**.md. R normalizes the score
so large and small specs are comparable — a 30-requirement spec is allowed more
absolute divergence than a 3-requirement one. If specs live elsewhere, point the
command at the right directory; cross-check the count by eye.
python3 skills/grade-spec/scripts/sqi.py scan-journal openspec/changes/<name>/journal.jsonl
This surfaces candidates — you still classify them:
| Journal signal | Candidate class | Note |
|---|---|---|
verifier.result note=fail | S5 rework | counted per extra round beyond the first fail per task ref (the script does this in extra_rounds) |
task.blocked | S7 blocker | only if a spec defect halted impl until the spec changed; a block on an external dep is not an S7 |
decision | S1 / S3 / S4 | read input/output: was it a deferred TBD (S1), a silent gap filled (S3), or a wrong assertion fixed (S4)? |
Escalation (S6) is not journaled. Read the run transcript / handoff notes for moments where impl had to ask a stakeholder or the spec author to proceed. S6 is the cost the KPI most wants to penalize — do not miss it.
Read spec + diff side by side. Every place impl deviated from, resolved, or corrected the spec is a divergence. Assign exactly one class. When two classes fit, take the higher-cost one and count it once (an S1 that's also an S2 is one divergence, classified at its highest cost).
| Class | Name | Definition | w |
|---|---|---|---|
| S0 | Clean | Implemented as written; no divergence. | 0 |
| S1 | Resolved-defer | Spec explicitly deferred this (open Q / TBD / "confirm at apply"); impl resolved it via introspection, no human. Spec did its job by naming the unknown. | 0.5 |
| S2 | Write-back | Impl learned a concrete fact the spec should now record; non-blocking, no human. (Often the same divergence as an S1 — don't double-count.) | 0.5 |
| S3 | Gap | Spec was silent on something impl needed (tooling flag, env detail, teardown, unstated infra assumption). Impl decided alone. | 1 |
| S4 | Correction | Spec asserted something impl found wrong or oversimplified. Fixed without a human. | 2 |
| S5 | Rework | Ambiguity/under-spec caused a verifier FAIL→fix cycle. Count per extra round beyond the first. | 3 each |
| S6 | Escalation | Impl had to ask a stakeholder / spec author to proceed. | 6 each |
| S7 | Blocker | A spec defect halted impl until the spec itself was changed. | 10 each |
Decision rules (apply top-down; first match wins):
Anti-gaming (§11): don't reward silence — a spec that says nothing dodges S4 but racks up S3 and (on wrong guesses) S5/S6. Actively hunt the diff and run log for undocumented "I decided X myself" moments; a strong implementer who silently absorbs ambiguity will hide S3/S4 unless you look. Vagueness is penalized.
Write a JSON file with requirements, the classified divergences, optional
change/graded_at/writeback_recommendations, and run_log_present (false if
no journal). Then:
python3 skills/grade-spec/scripts/sqi.py score /path/to/divergences.json
The script applies §5 exactly: score = 100 × (1 − min(1, penalty / (k·R))) with
k=4, blend=0.5. Implementability uses {S5,S6,S7}; Fidelity uses {S1,S2,S3,S4}.
Do not hand-compute — let the script do the arithmetic. Override defaults only
with explicit reason: --k <n> (lower = stricter) or --blend <0..1> (toward 1.0
weights "buildable without help" over "factually complete").
Grade bands: A ≥90, B ≥80, C ≥70, D ≥60, F <60. A partial run (no run log)
gets no letter — the scorer sets grade: null; report it as a Fidelity-only
SQI number.
Every run MUST produce two durable artifacts — this is not optional. Write them before presenting, then read the result back in plain language.
6a. Markdown report — always written, colocated with the change.
Write the human report to <change-dir>/sqi.md, where <change-dir> is the
directory resolved in step 1 — openspec/changes/<name>/ for an active change, or
openspec/changes/archive/<date>-<name>/ for an archived one. One report per change;
overwrite on re-grade. Use the structure in references/sqi-report.md.
Because active and archived changes sit at different directory depths, the report
references the methodology and ledger as repo-root-relative path text, not
clickable relative links (a ../../ link would break for one depth):
> Methodology: `docs/kpi/spec-quality-index.md`. Machine ledger: `sqi.jsonl` (repo root).
6b. Machine ledger — always appended at the repo root.
Append the scorer's §10 JSON as exactly one line to ./sqi.jsonl at the repo
root (create it if absent). One line per grading run; never rewrite prior lines.
After appending, verify every line still parses:
python3 -c "import json; [json.loads(l) for l in open('sqi.jsonl') if l.strip()]; print('ok')"
6c. Present. Show the SQI JSON (the scorer prints the §10 shape). Always include:
I and F fail independently and a
single number hides the diagnosis.confidence — full (all three inputs), partial (no run log → I
unobserved, SQI is Fidelity-only — say so explicitly), or low (only the spec,
pre-impl → SQI can't be measured; offer a predictive lint instead, clearly
labeled a prediction, never a measurement).run_log_present is false the scorer emits
grade: null and a grade_basis of "fidelity-only". Present it as
"grade withheld — Fidelity-only SQI = NN (Implementability unobserved)". Never
paste an A–F letter for a partial run: it cannot lose S5/S6/S7 points, so a
letter would overstate the spec's fitness.writeback_recommendations — one concrete spec edit per S1/S2/S4 divergence.
This closes the loop: it's how the next spec scores higher. The KPI is only
useful if it feeds back.Then read the result in plain language (e.g. "perfect implementability, good fidelity — a few self-resolved gaps and one oversimplified claim") and stop. This skill measures, persists, and recommends; it does not edit the graded spec unless asked.
scripts/sqi.py — three subcommands, all deterministic:
score <json> — compute I/F/SQI/grade, emit §10 JSON. Reads k/blend
from the JSON if present, else flags/defaults. Honors run_log_present:false.count-reqs <specs-dir> — count ### Requirement: headers; returns total +
per-file breakdown.scan-journal <journal.jsonl> — surface S5/S7/decision candidates from an
OpenSpec journal. Tolerates malformed lines; matches event or type keys.Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub goern/grade-openspec --plugin grade-openspec