From starry-harness
This skill should be used when the user asks to "run the harness autonomously", "evolve", "auto-hunt", "autonomous mode", "self-evolving loop", "sweep syscalls", "deep dive", "what should I work on next", "pick next target", "run a sweep", "continuous improvement", or wants the starry-harness to autonomously select targets, run analysis cycles, and track progress.
How this skill is triggered — by the user, by Claude, or both
Slash command
/starry-harness:evolveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Orchestrates autonomous and human-driven kernel improvement cycles. Maintains persistent strategy state, selects targets based on coverage gaps and effectiveness history, alternates between broad sweeps and deep investigations, and enforces a mandatory multi-agent review pipeline.
Orchestrates autonomous and human-driven kernel improvement cycles. Maintains persistent strategy state, selects targets based on coverage gaps and effectiveness history, alternates between broad sweeps and deep investigations, and enforces a mandatory multi-agent review pipeline.
These are hard constraints. Violating any one of them invalidates the entire round.
Linux defines correctness. Linux return values, errno, output, side effects, blocking semantics, concurrency semantics, and resource cleanup semantics are the baseline. Never let StarryOS's current behavior retroactively define what is "correct."
Test before fix. Write a test that proves the bug exists on StarryOS and passes on Linux BEFORE modifying any kernel code. A fix without a pre-existing failing test is not a verified fix.
Evidence before claims. Any finding without tier 1-4 evidence is a "pending hypothesis," not a confirmed bug. Mark it explicitly as such. The reviewer will reject unsubstantiated claims.
One bug per round. Each cycle fixes one bug or investigates one target. No bundled changes. No drive-by refactors. No "while I'm here" additions.
Harness before patch. Every round must produce at least one reusable test asset (a test case, a pattern scanner rule, a regression check). The test outlives the fix.
Deterministic tools first. Run lock-order-graph.py, pattern-scanner.py, kernel-graph.py, change-tracker.py BEFORE applying LLM reasoning. Their output is ground truth that cannot hallucinate.
Reviewer has veto power. If the reviewer (kernel-reviewer agent, Codex, or human) says REVISE or REJECT, the round is not done. Address every specific objection.
Autonomous: The system picks targets, runs cycles, and stops when the session budget is exhausted or no targets remain above minimum value. Invoke with "run autonomously" or "auto-hunt."
Human-driven: The system presents priorities and recommendations; the human picks the target. Invoke with "what should I work on next" or "pick next target."
At the start of every evolve session:
docs/starry-reports/strategy.json — if it doesn't exist, generate it from os/StarryOS/tests/known.json and the kernel sourcepython3 ${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py — check what kernel files changed since last runpython3 ${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py — any new pattern hits?python3 ${CLAUDE_PLUGIN_ROOT}/scripts/abi-check.py — any syscall arg count mismatches vs Linux?docs/starry-reports/journal.md for recent activitycheck-upstream to deprioritize bugs already fixed or claimed in upstream PRsCompute priorities in this order:
needs_deep from prior sweepsknown.jsonWithin each tier, prefer techniques with higher historical yield (from strategy.json effectiveness tracking).
Scan 5-10 syscall handlers quickly per batch. For each:
os/StarryOS/kernel/src/syscall/Ok(0) without logic), catch-all match arm, ignored parameters, TODO/FIXMEswept_clean in strategyswept_suspicious with reasonneeds_deepBudget: ~5 minutes per target. Skip full review pipeline — sweep is discovery, not fix.
Pick one target from needs_deep or swept_suspicious. Execute the full cycle:
${CLAUDE_PLUGIN_ROOT}/scripts/man-lookup.sh${CLAUDE_PLUGIN_ROOT}/scripts/linux-ref-test.sh — test MUST pass on Linux${CLAUDE_PLUGIN_ROOT}/scripts/stress-test.sh with SMP sweepingBudget: ~30 minutes per target. Full review pipeline required.
StarryOS supports 4 architectures: riscv64 (primary test target), aarch64, x86_64, loongarch64. The xtask build system handles all four.
Default: Test on riscv64 first (fastest iteration, most tooling support).
Cross-arch verification (after a fix passes review on riscv64):
# Build and test on other architectures
cargo starry build --arch aarch64
cargo starry build --arch x86_64
# Run QEMU tests on each
cargo starry test qemu --target aarch64
cargo starry test qemu --target x86_64
When cross-arch testing is mandatory:
os/StarryOS/kernel/src/config/ (per-arch config)os/arceos/modules/axhal/ (hardware abstraction)When to skip (single-arch is sufficient):
Always note in the bug report whether the fix was verified single-arch or multi-arch, and flag any cross-arch risks.
Every fix MUST go through this pipeline. No exceptions. No shortcuts. The Stop hook enforces this.
Re-read the fix against the man page and the test output. Does it address the root cause? Does it handle error paths? Does it break adjacent behavior?
Dispatch the kernel-reviewer agent with fresh context. It reviews Rust idioms, safety, code reuse, API consistency. If it finds critical issues → revise the fix, restart from Step 1.
Run ${CLAUDE_PLUGIN_ROOT}/scripts/regression-check.sh to verify no existing tests broke. Run cargo xtask clippy --package starry-kernel and cargo fmt --check. Any regression → fix must be revised.
Dispatch the Codex agent (via the codex:rescue skill or codex:codex-rescue agent) with:
If Codex says REVISE → address each point and re-submit. If Codex says REJECT → reconsider the approach.
Dispatch a separate agent with ONLY the bug description + man page. NOT the proposed fix. Compare independently-derived fix with the proposed one. If they disagree → reconciliation round.
Record all review rounds in strategy.json reviews section.
The evolve skill relies on deterministic scripts for analysis — the LLM interprets results, but the scanning itself is reproducible and hallucination-free.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/lock-order-graph.py --json /tmp/lock-order.json
Rust ownership-aware analysis: distinguishes let guard = x.lock() (held) from x.lock().method() (temporary dropped at semicolon). Detects drop() calls. Cycles in the graph are concrete deadlock evidence.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py --json /tmp/pattern-hits.json
Reads rules from docs/starry-reports/patterns.json. Default 9 patterns including negative-to-unsigned casts, Ok(0) stubs, AB/BA lock patterns. Pattern evolution: when a new bug class is found, add a grep rule to patterns.json. The scanner finds new instances deterministically.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/kernel-graph.py --json /tmp/kernel-graph.json
Maps all 204 syscalls to subsystems, files, locks, unsafe blocks. Shows which untested syscalls touch the most shared state.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/abi-check.py --json /tmp/abi-check.json
Compares StarryOS's uctx.argN() usage per syscall against Linux kernel SYSCALL_DEFINE arities. Catches mismatches where StarryOS reads the wrong number of arguments (e.g., 5 args when Linux passes 6). Run at startup and before writing any new syscall test.
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py --json /tmp/changes.json
Git-aware: identifies which tests need re-running based on file changes since last run.
Every 3-5 runs within a session, the loop pauses to reflect:
as _ casts in different syscalls")docs/starry-reports/patterns.json with new rulesdocs/starry-reports/insights.mdBudget: ~2K tokens. Saves tokens downstream by improving target selection.
Load strategy + run change-tracker + run pattern-scanner
│
▼
Compute priorities (incorporating deterministic scan results)
│
├─ autonomous → pick top target
└─ human → present top 5, ask
│
▼
Is target a sweep or deep?
├─ sweep batch → run sweep mode on 5-10 targets
└─ deep target → run deep mode on 1 target (includes MANDATORY review pipeline)
│
▼
Update strategy.json (coverage, effectiveness, queue, review rounds)
│
▼
Every 3-5 runs → REFLECT (run scanners, synthesize, update patterns)
│
▼
Check stopping conditions:
├─ session budget exhausted (default 5 cycles) → stop
├─ no targets above minimum value → stop
├─ human requests stop → stop
└─ otherwise → loop back to "Compute priorities"
references/review-pipeline.md — Full adaptive review protocol with convergence rulesreferences/strategy-schema.md — Complete strategy.json schema and field definitions${CLAUDE_PLUGIN_ROOT}/scripts/lock-order-graph.py — Static lock ordering + cycle detection (Rust ownership-aware)${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py — Regex-based bug pattern scanner with evolving rule set${CLAUDE_PLUGIN_ROOT}/scripts/kernel-graph.py — Kernel architecture graph (204 syscalls mapped)${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py — Git-aware change detection${CLAUDE_PLUGIN_ROOT}/scripts/stress-test.sh — Multi-run SMP-sweeping test runner${CLAUDE_PLUGIN_ROOT}/scripts/regression-check.sh — Full regression suite${CLAUDE_PLUGIN_ROOT}/scripts/strace-profiler.sh — Application syscall profiling${CLAUDE_PLUGIN_ROOT}/scripts/draft-pr.sh — PR draft generator (never auto-submits)Before presenting results to the user, self-check:
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub josephjoshua/starry-harness --plugin starry-harness