Skill

evolve

This skill should be used when the user asks to "run the harness autonomously", "evolve", "auto-hunt", "autonomous mode", "self-evolving loop", "sweep syscalls", "deep dive", "what should I work on next", "pick next target", "run a sweep", "continuous improvement", or wants the starry-harness to autonomously select targets, run analysis cycles, and track progress.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/starry-harness:evolve

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Orchestrates autonomous and human-driven kernel improvement cycles. Maintains persistent strategy state, selects targets based on coverage gaps and effectiveness history, alternates between broad sweeps and deep investigations, and enforces a mandatory multi-agent review pipeline.

Supporting Files

references/review-pipeline.mdreferences/strategy-schema.md

SKILL.md

262 lines · ~3.5k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitApr 19, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

StarryOS Self-Evolving Harness

Non-Negotiable Principles

These are hard constraints. Violating any one of them invalidates the entire round.

Linux defines correctness. Linux return values, errno, output, side effects, blocking semantics, concurrency semantics, and resource cleanup semantics are the baseline. Never let StarryOS's current behavior retroactively define what is "correct."
Test before fix. Write a test that proves the bug exists on StarryOS and passes on Linux BEFORE modifying any kernel code. A fix without a pre-existing failing test is not a verified fix.
Evidence before claims. Any finding without tier 1-4 evidence is a "pending hypothesis," not a confirmed bug. Mark it explicitly as such. The reviewer will reject unsubstantiated claims.
One bug per round. Each cycle fixes one bug or investigates one target. No bundled changes. No drive-by refactors. No "while I'm here" additions.
Harness before patch. Every round must produce at least one reusable test asset (a test case, a pattern scanner rule, a regression check). The test outlives the fix.
Deterministic tools first. Run lock-order-graph.py, pattern-scanner.py, kernel-graph.py, change-tracker.py BEFORE applying LLM reasoning. Their output is ground truth that cannot hallucinate.
Reviewer has veto power. If the reviewer (kernel-reviewer agent, Codex, or human) says REVISE or REJECT, the round is not done. Address every specific objection.

Modes

Autonomous: The system picks targets, runs cycles, and stops when the session budget is exhausted or no targets remain above minimum value. Invoke with "run autonomously" or "auto-hunt."

Human-driven: The system presents priorities and recommendations; the human picks the target. Invoke with "what should I work on next" or "pick next target."

Startup: Load Strategy

At the start of every evolve session:

Read docs/starry-reports/strategy.json — if it doesn't exist, generate it from os/StarryOS/tests/known.json and the kernel source
Run python3 ${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py — check what kernel files changed since last run
Run python3 ${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py — any new pattern hits?
Run python3 ${CLAUDE_PLUGIN_ROOT}/scripts/abi-check.py — any syscall arg count mismatches vs Linux?
Read docs/starry-reports/journal.md for recent activity
If convenient (not every session), run check-upstream to deprioritize bugs already fixed or claimed in upstream PRs
Present current status: category gaps, analysis queue, ABI mismatches, change-tracker findings, upstream overlap
In human-driven mode: present the top 5 recommended targets and ask
In autonomous mode: pick the top target and begin

Target Selection

Compute priorities in this order:

Category gaps and unexplored areas — bug categories with 0 coverage, no benchmarks yet, no app-compat yet
Change-tracker findings — files modified since last run that affect tested syscalls → re-verify
Analysis queue — targets flagged needs_deep from prior sweeps
High-value untested syscalls — used by target apps but not yet in known.json
Coverage expansion — untested syscalls in order of estimated importance

Within each tier, prefer techniques with higher historical yield (from strategy.json effectiveness tracking).

Sweep Mode (broad, shallow)

Scan 5-10 syscall handlers quickly per batch. For each:

Read the handler source in os/StarryOS/kernel/src/syscall/
Check for obvious patterns: stub (Ok(0) without logic), catch-all match arm, ignored parameters, TODO/FIXME
If suspicious: generate a minimal test (2-3 test cases), run Linux comparison
Classify result:
- Clean: 0 divergences → mark swept_clean in strategy
- Suspicious: 1+ divergences or pattern matches → add to swept_suspicious with reason
- Needs deep: ≥2 bugs, touches shared state, or concurrency-relevant → add to needs_deep

Budget: ~5 minutes per target. Skip full review pipeline — sweep is discovery, not fix.

Deep Mode (narrow, thorough)

Pick one target from needs_deep or swept_suspicious. Execute the full cycle:

Fetch man page via ${CLAUDE_PLUGIN_ROOT}/scripts/man-lookup.sh
Establish Linux baseline: Document expected behavior for normal input, invalid input, boundary conditions, errno values, blocking/concurrency semantics, side effects, resource cleanup
Generate comprehensive test case (all documented behaviors, error codes, edge cases)
Run Linux comparison via ${CLAUDE_PLUGIN_ROOT}/scripts/linux-ref-test.sh — test MUST pass on Linux
Run StarryOS pipeline — capture divergences
For concurrency targets: run ${CLAUDE_PLUGIN_ROOT}/scripts/stress-test.sh with SMP sweeping
Root cause analysis: Locate the exact source file:line. Identify which category: missing implementation, semantic divergence, error path bug, boundary handling, concurrency defect. Cite actual code — no guessing.
Implement fix — minimal, local, no unrelated changes
Run mandatory review pipeline (see below)
Report: bug report + journal entry + strategy update

Budget: ~30 minutes per target. Full review pipeline required.

Multi-Architecture Awareness

StarryOS supports 4 architectures: riscv64 (primary test target), aarch64, x86_64, loongarch64. The xtask build system handles all four.

Default: Test on riscv64 first (fastest iteration, most tooling support).

Cross-arch verification (after a fix passes review on riscv64):

# Build and test on other architectures
cargo starry build --arch aarch64
cargo starry build --arch x86_64
# Run QEMU tests on each
cargo starry test qemu --target aarch64
cargo starry test qemu --target x86_64

When cross-arch testing is mandatory:

Fixes touching os/StarryOS/kernel/src/config/ (per-arch config)
Fixes touching os/arceos/modules/axhal/ (hardware abstraction)
Fixes involving inline assembly, page table manipulation, or signal trampolines
Any fix where the root cause is arch-dependent (different struct layouts, endianness, syscall numbers)

When to skip (single-arch is sufficient):

Pure syscall logic bugs (wrong errno, missing check) — these are arch-independent
File system bugs — arch-independent
Most semantic/correctness bugs

Always note in the bug report whether the fix was verified single-arch or multi-arch, and flag any cross-arch risks.

Mandatory Review Pipeline

Every fix MUST go through this pipeline. No exceptions. No shortcuts. The Stop hook enforces this.

Step 1: Self-check (always)

Re-read the fix against the man page and the test output. Does it address the root cause? Does it handle error paths? Does it break adjacent behavior?

Step 2: kernel-reviewer agent (always)

Dispatch the kernel-reviewer agent with fresh context. It reviews Rust idioms, safety, code reuse, API consistency. If it finds critical issues → revise the fix, restart from Step 1.

Step 3: Regression check (always)

Run ${CLAUDE_PLUGIN_ROOT}/scripts/regression-check.sh to verify no existing tests broke. Run cargo xtask clippy --package starry-kernel and cargo fmt --check. Any regression → fix must be revised.

Step 4: Codex independent review (for P0/P1 bugs, or if codex plugin is available)

Dispatch the Codex agent (via the codex:rescue skill or codex:codex-rescue agent) with:

The bug description and man page
The proposed fix
Ask for PASS / REVISE / REJECT with specific reasoning

If Codex says REVISE → address each point and re-submit. If Codex says REJECT → reconsider the approach.

Step 5: Independent re-derivation (for P0 bugs)

Dispatch a separate agent with ONLY the bug description + man page. NOT the proposed fix. Compare independently-derived fix with the proposed one. If they disagree → reconciliation round.

Step 6: Convergence assessment

All steps pass + 0 regressions → high confidence → commit (autonomous) or present (human)
Partial agreement after reconciliation → medium confidence → flag for human review
Cannot converge after max_rounds → low confidence → do NOT commit, escalate

Record all review rounds in strategy.json reviews section.

Deterministic Tooling

The evolve skill relies on deterministic scripts for analysis — the LLM interprets results, but the scanning itself is reproducible and hallucination-free.

Lock Order Graph

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/lock-order-graph.py --json /tmp/lock-order.json

Rust ownership-aware analysis: distinguishes let guard = x.lock() (held) from x.lock().method() (temporary dropped at semicolon). Detects drop() calls. Cycles in the graph are concrete deadlock evidence.

Pattern Scanner

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py --json /tmp/pattern-hits.json

Reads rules from docs/starry-reports/patterns.json. Default 9 patterns including negative-to-unsigned casts, Ok(0) stubs, AB/BA lock patterns. Pattern evolution: when a new bug class is found, add a grep rule to patterns.json. The scanner finds new instances deterministically.

Kernel Graph

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/kernel-graph.py --json /tmp/kernel-graph.json

Maps all 204 syscalls to subsystems, files, locks, unsafe blocks. Shows which untested syscalls touch the most shared state.

ABI Arg Count Checker

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/abi-check.py --json /tmp/abi-check.json

Compares StarryOS's uctx.argN() usage per syscall against Linux kernel SYSCALL_DEFINE arities. Catches mismatches where StarryOS reads the wrong number of arguments (e.g., 5 args when Linux passes 6). Run at startup and before writing any new syscall test.

Change Tracker

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py --json /tmp/changes.json

Git-aware: identifies which tests need re-running based on file changes since last run.

Reflect Phase (cross-run synthesis)

Every 3-5 runs within a session, the loop pauses to reflect:

Run the pattern scanner — any new hits since last reflect?
Run the lock order graph — any new cycles since last reflect?
Read the last N runs' results from strategy.json
Identify cross-cutting patterns (e.g., "3 bugs all involve as _ casts in different syscalls")
Generate new pattern scanner rules from discovered bugs (deterministic grep rules, not LLM guesses)
Update docs/starry-reports/patterns.json with new rules
Update priorities based on what techniques actually worked
Append insights to docs/starry-reports/insights.md

Budget: ~2K tokens. Saves tokens downstream by improving target selection.

Session Flow

Load strategy + run change-tracker + run pattern-scanner
    │
    ▼
Compute priorities (incorporating deterministic scan results)
    │
    ├─ autonomous → pick top target
    └─ human → present top 5, ask
    │
    ▼
Is target a sweep or deep?
    ├─ sweep batch → run sweep mode on 5-10 targets
    └─ deep target → run deep mode on 1 target (includes MANDATORY review pipeline)
    │
    ▼
Update strategy.json (coverage, effectiveness, queue, review rounds)
    │
    ▼
Every 3-5 runs → REFLECT (run scanners, synthesize, update patterns)
    │
    ▼
Check stopping conditions:
    ├─ session budget exhausted (default 5 cycles) → stop
    ├─ no targets above minimum value → stop
    ├─ human requests stop → stop
    └─ otherwise → loop back to "Compute priorities"

Token Budget

Sweep: ~2K tokens per target (read handler, quick pattern check)
Deep: ~15K tokens per target (full cycle with review pipeline)
Reflect: ~2K tokens (run deterministic tools, synthesize)
Default session budget: 5 deep cycles or 2 sweeps + 3 deeps
Early termination: if a target shows 0 divergences in sweep, skip it in <500 tokens
Deterministic tools (pattern scanner, lock graph, etc.) cost 0 LLM tokens

Additional Resources

Reference Files

references/review-pipeline.md — Full adaptive review protocol with convergence rules
references/strategy-schema.md — Complete strategy.json schema and field definitions

Deterministic Scripts

${CLAUDE_PLUGIN_ROOT}/scripts/lock-order-graph.py — Static lock ordering + cycle detection (Rust ownership-aware)
${CLAUDE_PLUGIN_ROOT}/scripts/pattern-scanner.py — Regex-based bug pattern scanner with evolving rule set
${CLAUDE_PLUGIN_ROOT}/scripts/kernel-graph.py — Kernel architecture graph (204 syscalls mapped)
${CLAUDE_PLUGIN_ROOT}/scripts/change-tracker.py — Git-aware change detection
${CLAUDE_PLUGIN_ROOT}/scripts/stress-test.sh — Multi-run SMP-sweeping test runner
${CLAUDE_PLUGIN_ROOT}/scripts/regression-check.sh — Full regression suite
${CLAUDE_PLUGIN_ROOT}/scripts/strace-profiler.sh — Application syscall profiling
${CLAUDE_PLUGIN_ROOT}/scripts/draft-pr.sh — PR draft generator (never auto-submits)

Before Finishing

Before presenting results to the user, self-check:

Findings are backed by evidence (tier 1-5 only). Pending hypotheses are clearly marked.
Any proposed fix went through the review pipeline. Incomplete reviews are finished, not skipped.
State files are updated: strategy.json, known.json, journal.md.
If in autonomous mode: the cycle completed fully or was stopped at a clean boundary, not mid-fix.

evolve

Invocation

Context Preview

Supporting Files

SKILL.md

evolve

Invocation

Context Preview

Supporting Files

SKILL.md

StarryOS Self-Evolving Harness

Non-Negotiable Principles

Modes

Startup: Load Strategy

Target Selection

Sweep Mode (broad, shallow)

Deep Mode (narrow, thorough)

Multi-Architecture Awareness

Mandatory Review Pipeline

Step 1: Self-check (always)

Step 2: kernel-reviewer agent (always)

Step 3: Regression check (always)

Step 4: Codex independent review (for P0/P1 bugs, or if codex plugin is available)

Step 5: Independent re-derivation (for P0 bugs)

Step 6: Convergence assessment

Deterministic Tooling

Lock Order Graph

Pattern Scanner

Kernel Graph

ABI Arg Count Checker

Change Tracker

Reflect Phase (cross-run synthesis)

Session Flow

Token Budget

Additional Resources

Reference Files

Deterministic Scripts

Before Finishing

Similar Skills

StarryOS Self-Evolving Harness

Non-Negotiable Principles

Modes

Startup: Load Strategy

Target Selection

Sweep Mode (broad, shallow)

Deep Mode (narrow, thorough)

Multi-Architecture Awareness

Mandatory Review Pipeline

Step 1: Self-check (always)

Step 2: kernel-reviewer agent (always)

Step 3: Regression check (always)

Step 4: Codex independent review (for P0/P1 bugs, or if codex plugin is available)

Step 5: Independent re-derivation (for P0 bugs)

Step 6: Convergence assessment

Deterministic Tooling

Lock Order Graph

Pattern Scanner

Kernel Graph

ABI Arg Count Checker

Change Tracker

Reflect Phase (cross-run synthesis)

Session Flow

Token Budget

Additional Resources

Reference Files

Deterministic Scripts

Before Finishing

Similar Skills