Assess the current verification maturity of a codebase and identify gaps. Trigger phrases: "assess verification", "verification maturity", "how is my testing", "verification strategy assessment", "what is my verification coverage", "audit my tests"
How this skill is triggered — by the user, by Claude, or both
Slash command
/software-verification:assess-verification [path-to-codebase] (defaults to current directory)[path-to-codebase] (defaults to current directory)This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Assess the current verification maturity of a codebase. Produce a `verification-report.md` with maturity tier, component breakdown, missing oracles, exactness analysis, human review requirements, autonomy candidates, feedback loop completeness, workflow gate assessment, and shift-left positioning.
references/agentops-telemetry-assessment.mdreferences/agentops-telemetry.mdreferences/decision-framework.mdreferences/documentation-verification.mdreferences/feedback-loop-model.mdreferences/maturity-model.mdreferences/method-failure-modes.mdreferences/report-template.mdreferences/shift-left-model.mdreferences/traceability-model.mdreferences/verification-taxonomy.mdAssess the current verification maturity of a codebase. Produce a verification-report.md with maturity tier, component breakdown, missing oracles, exactness analysis, human review requirements, autonomy candidates, feedback loop completeness, workflow gate assessment, and shift-left positioning.
Search for all verification-related artifacts:
Testing:
*_test.*, *_spec.*, test_*.*, tests/, __tests__/jest.config.*, pytest.ini, pyproject.toml [tool.pytest], vitest.config.*hypothesis, fast-check, proptest, QuickCheckfuzz/, *_fuzz_test.go, cargo-fuzz config, AFL configsmutmut, stryker.conf.*, cargo-mutants configStatic analysis:
.eslintrc*, ruff.toml, .golangci.yml, clippy.tomltsconfig.json, mypy.ini, pyrightconfig.json.semgrep/, codeql-config.ymlpprof, perf, py-spy configurations or scriptsContracts and schemas:
*.schema.json, *.proto, openapi.*, *.graphqlicontract, contracts library imports, invariantpact/, pacts/, Pact broker config, Spring Cloud Contract stubs*.tla, *.cfg (TLC), *.als, *.dfyCI and operational:
.github/workflows/, .gitlab-ci.ymlcodecov.yml, coverage report configsIdentify distinct components or modules. For each, determine:
Load references/decision-framework.md for classification guidance. Load references/method-failure-modes.md to understand risks of current methods and gaps in assurance.
| Property | Options |
|---|---|
| Archetype | Deterministic library, CRUD/API service, Distributed/stateful, Safety/security kernel, ML-backed, Agent-written |
| Criticality | High (safety, security, money, core data), Medium (business logic), Low (UI, utilities) |
| Determinism | Deterministic, Concurrent/distributed, Probabilistic/learned |
| Current verification | List which methods are already applied |
Load references/maturity-model.md and assign a tier (0-5):
Score both the overall codebase and each individual component.
For each component, answer: "Can we determine if the output is correct?"
Load references/verification-taxonomy.md for oracle types.
| Oracle type | When applicable |
|---|---|
| Exact expected output | Deterministic, well-specified inputs/outputs |
| Metamorphic relations | Output hard to predict but transformations have known effects |
| Differential oracle | Multiple implementations or versions to compare |
| Statistical threshold | Stochastic outputs with bounded distributions |
| Performance envelope | Measurable load/latency/resource bounds that must hold |
| Replay/held-out data | Historical inputs with known-good outputs |
| Behavioral twin | Third-party integrations where interface mocks can't verify real behavior |
| LLM-as-Judge | Non-deterministic output where human review is too slow/costly |
| Human judgment | Ambiguous outputs requiring domain expertise |
For components with third-party integrations, specifically assess:
Flag components with no oracle at all as critical gaps. Flag third-party integrations verified only via interface mocks as oracle weakness — agents can fabricate plausible API behavior that mocks will not catch.
For each oracle that exists, also rate its strength, not just its presence (see the "Oracle
strength" section in references/verification-taxonomy.md):
This matters because autonomy is capped by oracle strength: a component can have an oracle and still be unsafe for autonomous change if that oracle is weak. Flag oracle rot — tests that pass while the requirement has drifted, so the oracle no longer asserts the right thing — as a distinct gap from "no oracle"; it is more dangerous because it gives false confidence.
For each component, determine:
Components that require human review:
Components where AI agents could iterate autonomously:
Evaluate whether verification outputs can be consumed by agents for self-correction.
Load references/feedback-loop-model.md for maturity levels and assessment criteria.
Search for indicators:
For each verification method found in Step 1, classify its feedback loop level (0-3):
| Method | Output format | Routable? | Feedback loop level |
|---|---|---|---|
| ... | ... | ... | ... |
Flag methods at Level 0-1 as gaps: verification exists but agents cannot act on its results.
Identify all human review and approval checkpoints in the development workflow:
Branch protection rules and required-approval counts live in the hosting platform's settings, not
the working tree. Read them with gh api (e.g. gh api repos/{owner}/{repo}/branches/{branch}/protection)
when available; if gh/network is unavailable, mark these "not inspectable from working tree" rather
than assuming they are absent.
For each gate, evaluate:
| Gate | Risk class served | Rejection rate (if estimable) | Could agent pre-review replace? | Produces actionable feedback? |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
Identify:
git commit --no-verify, push without CI).
A gate that only fires client-side is not enforceable — the authoritative gate must be server-side
(branch protection + required status checks). Flag any check that exists only as a skippable local hook.Evaluate whether checks run at the earliest possible point in the development loop.
Load references/shift-left-model.md for the tier model and indicators.
Search for shift-left indicators:
.pre-commit-config.yaml, .husky/, lefthook.yml, .git/hooks/jest --watch, vitest, cargo-watch, scripts for changed-files-only test runsFor each verification method found in Step 1, classify its current execution tier (T1-T4):
| Check | Current tier | Ideal tier | Shift-left gap? |
|---|---|---|---|
| ... | ... | ... | ... |
Flag checks that run only at T3-T4 but could run at T1-T2 (e.g., type checking only in CI, no pre-commit hooks, no focused test mode).
Evaluate whether documentation stays synchronized with code through automated checks.
Load references/documentation-verification.md for assessment dimensions and maturity levels.
Search for indicators:
protoc-gen-docmarkdown-link-check, linkinator, internal cross-ref checksdoctest (Python/Rust), tested code blocks in Markdown, markdown-execClassify overall documentation verification level (0-3):
Flag Level 0-1 as a risk: stale docs become a fabrication vector for agents relying on them for context.
Evaluate whether verification outputs are observable and measurable at the operational level.
Load references/agentops-telemetry.md and references/agentops-telemetry-assessment.md for telemetry streams and assessment criteria.
Search for indicators of operational visibility into verification:
For each telemetry stream, classify the current level (0-3):
| Stream | Current level | Key gaps | Impact on verification improvement |
|---|---|---|---|
| Trajectory | ... | ... | ... |
| Cost | ... | ... | ... |
| Quality | ... | ... | ... |
| Autonomy compliance | ... | ... | ... |
| Domain | ... | ... | ... |
Flag critical gaps: verification that cannot be improved because there is no measurement of its effectiveness.
Evaluate whether intent traces to evidence: requirement → acceptance criterion → test → code → result.
Load references/traceability-model.md before scoring — it defines the 0-3 maturity levels and the
three RTM properties; do not invent your own scale.
Search for these signals:
docs/requirements/, docs/specs/, EARS-style statements, structured issue templates.feature files and BDD step definitions (Cucumber, Behave, SpecFlow)git log (e.g.
git log --oneline -50) and, if available, gh pr list. A valid link is a closing keyword
(Closes #123, Fixes #123), a bare #123/Refs:, or a tracker key (e.g. PROJ-123). Sample
recent merges and report the fraction lacking any link rather than asserting a number you cannot compute.
If git history or gh is unavailable, mark this signal "not inspectable" rather than guessing.trace, AI-DLCClassify the overall level (0-3). Assess the three RTM properties (scope verification, impact analysis, test sufficiency) and flag the failure each gap allows. Flag Level 0-1 as critical: intent drift is caught only by slow, fully-human review, which cannot keep pace with agents.
Load references/report-template.md for the output structure.
Write verification-report.md following the template, including all sections assessed in Steps 1-13 (maturity tier through requirement traceability).
npx claudepluginhub krokoko/cairn --plugin software-verificationProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.