From cli-printing-press
Internal sub-skill that reviews printed CLI output for plausibility bugs (semantic mismatches, silent source drops, ranking failures, format bugs) that rule-based checks miss. Invoked by parent skills, not for direct use.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cli-printing-press:printing-press-output-reviewThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based `scorecard --live-check` rules can't catch. Wave B policy: all findings surface as warnings, never errors.
Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based scorecard --live-check rules can't catch. Wave B policy: all findings surface as warnings, never errors.
This skill is internal-only (user-invocable: false). It's invoked by parents — main printing-press skill at shipcheck Phase 4.85, polish skill during its diagnostic loop. Running it standalone would produce floating findings text with no ship verdict, no fixes applied, no publish offer; the actionable wrappers are /printing-press and /printing-press-polish. The skill carries context: fork so the reviewer agent's diagnostic chatter stays isolated from the calling skill's context.
The caller passes $CLI_DIR as the argument: an absolute path to the printed CLI's working directory.
Bugs that rule-based checks miss, typically surfaced by 5 minutes of hands-on testing but slipping past dogfood, verify, and scorecard --live-check rules:
# Locate research.json. Adjacent to the binary covers the post-promote
# layout (standalone polish, shipcheck against the library copy). The
# grandparent fallback covers mid-pipeline invocations where $CLI_DIR is
# $PRESS_RUNSTATE/runs/<id>/working/<cli> and research.json lives at
# $PRESS_RUNSTATE/runs/<id>/research.json. Without the fallback, scorecard
# reports `unable: true` mid-pipeline and we SKIP the most informative review.
# Use a bash array so the flag survives paths with spaces.
RESEARCH_ARGS=()
if [ ! -f "$CLI_DIR/research.json" ]; then
_grandparent="$(dirname "$(dirname "$CLI_DIR")")"
if [ -f "$_grandparent/research.json" ]; then
RESEARCH_ARGS=(--research-dir "$_grandparent")
fi
fi
cli-printing-press scorecard --dir "$CLI_DIR" "${RESEARCH_ARGS[@]}" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || true
If the scorecard call fails or /tmp/output-review-livecheck.json is empty, return the SKIP result (Step 3) without dispatching the reviewer.
Use the Agent tool (general-purpose) with this prompt contract:
Review the sampled outputs from the shipped CLI at
$CLI_DIR. You have these ground-truth sources:
- Sampled command output: read
/tmp/output-review-livecheck.jsonand inspect thelive_check.features[]array. Each entry has the command, example invocation, redacted stdout evidence (inoutput_sample, bounded to ~4 KiB), the redacted pass/fail reason, and awarningsarray (populated by rule-based checks like the raw-HTML-entity detector). Treat<redacted>markers as privacy scrubbed values, not format bugs.- Review only
status: passentries. Entries withstatus: faileither crashed, timed out, or had placeholder args (<id>,<url>) that never produced real output — their sample is empty and there's nothing for you to judge. Phase 5 dogfood handles test-coverage and exit-code concerns.$CLI_DIR/research.jsonnovel_features(planned behavior per feature) andnovel_features_built(verified built commands).- The CLI binary at
$CLI_DIR/<cli-name>-pp-cli— you may invoke additional commands to gather more output when a finding needs verification.For each of these checks, report findings under 50 words each. Only report issues a human user would notice in 5 minutes of hands-on testing — not every edge case a thorough QA pass might find:
- Output semantically matches query intent. For sampled novel features with a query argument, judge relevance beyond what the mechanical query-token check in live-check already enforced. A feature that passed live-check's
outputMentionsQuerytest still contains some query token somewhere — but "buttermilk" appearing as a substring of "butter" results, or "brownies" returning a chili recipe because the extractor fell back to adjacent content, both slip past the mechanical check. Only flag when a human user would look at the top results and say "this isn't what I asked for." Skip this check when the example has no query argument.- No obvious format bugs. Does the output contain raw HTML entities, mojibake (question marks or replacement chars in titles), or malformed URLs (pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks)? Rule-based live-check catches numeric entities; this layer catches the broader class.
- Aggregation commands show all requested sources. For commands with a
--source/--site/--regionCSV flag: if the user requested N sources, does output show N, or does stderr explain the missing ones? Silent drops of failed sources are a top failure mode for fan-out commands.- Result ordering/ranking makes sense. For commands that claim to rank or sort, does the top result look plausibly best given the query? Watch for broken score weights, off-by-one sort bugs, and silent fallback to recency when relevance computation fails.
Return a list of findings. For each: check name, severity (
warningin Wave B;errorreserved for Wave C), one-line description, one-sentence fix suggestion. If the CLI passes all four checks, return "PASS — no findings."
End the skill response with a ---OUTPUT-REVIEW-RESULT--- block the parent parses:
On clean pass:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---
On warnings:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
severity: warning
description: <one-line>
suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---
On reviewer failure (timeout, agent-budget exhaustion, missing live-check data):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---
warning — never error. Shipcheck proceeds regardless.manuscripts/<api>/<run>/proofs/phase-4.85-findings.md) and surfaces them to the user. Findings are not persisted to scorecard.json — that path is reserved for Wave C.Non-interactive contract (CI, cron, batch regeneration):
status: SKIP (reviewer crash, timeout, missing data) is informational — shipcheck does not block on it.--auto-approve-warnings flag yet. The policy is already "warnings don't block" in Wave B, so the flag has no effect to gate.Wave C (separate future PR) will flip error-severity findings to blocking after calibration data across the library shows false-positive rate below 10%.
Output-plausibility questions are not pattern-matchable against source. Rule-based live-check rules cover what regexes can (numeric HTML entities, query-token absence). Everything else — "are these substitution results plausibly correct for the query?", "does the top search result look related?" — is an LLM-shaped question. The token cost is bounded (once per run, not per command) and the catch rate against the bug classes that motivated this phase justifies the dispatch.
live_check.features[]. Full command-tree coverage belongs in Phase 5 dogfood.npx claudepluginhub mvanhorn/cli-printing-press --plugin cli-printing-pressAnalyzes a Printing Press session to identify systemic improvements to templates, Go binary, skills, and catalog. Creates a GitHub issue with actionable findings for the next CLI generation.
Performs Phase 4 quality gate for cli-web Python CLIs: 3-agent implementation review, 75-check checklist, pip package publishing, and read/write smoke tests.
Probes docs, skills, plans, or claims for weaknesses, gaps, or unstated assumptions before shipping. Returns structured verdicts.