codegen-validation | claude-for-hardware

Stats

Actions

Tags

codegen-validation | claude-for-hardware

Codegen Validation

Overview

A codegen backend is correct when the code it emits computes the right answer on a real machine. Reading the assembly proves nothing; a plausible-looking instruction sequence with a wrong ABI detail or a clobbered callee-saved register passes every eyeball review and fails on hardware.

Core principle: Execution-validate. Compile a known program, run the output on a real CPU or a fast emulator, and assert the observed result. If you didn't run it, you don't know it works.

When to Use

Bringing up a new codegen target or instruction selection
Implementing calling conventions, stack frames, register spilling, relocations
Debugging "the assembly looks right but the answer is wrong"
Adding an optimization pass and needing to prove it preserves behavior

The Validation Loop

known program (expected result known)
   -> codegen -> machine code
   -> run on real CPU / fast emulator
   -> assert observed result == expected

Pick programs with known answers. Start tiny: return a constant, add two args, a call to a leaf function, a loop that sums. Each isolates one capability (literals, ABI, calls, control flow).
Run on something real and fast. A native emulator for your target CPU closes the loop in milliseconds, so it can run on every build. The point is real execution semantics, not a model of what you think the instruction does.
Assert the observed value, not "it didn't crash." Read the result register or memory and compare to the expected answer.

This Catches Bugs Review Misses

Execution validation reliably catches the codegen bugs that look fine on paper:

A select/conditional-move lowering that picks the wrong operand.
Critical-edge splitting that drops or duplicates a value.
A prologue/epilogue that fails to save/restore a callee-saved register (return-address corruption shows up only when something actually calls).
ABI mistakes: argument in the wrong register, stack misaligned, return value in the wrong place.
Relocations that resolve to the wrong address.

These are exactly the bugs that turn into silicon respins or weeks of "intermittent" debugging if they escape. A handful of executed tests finds them in seconds.

Build It Up By Capability

Order tests so each new one depends only on capabilities already validated:

Return a constant (codegen + run harness works at all).
Arithmetic on arguments (ABI in, result out).
Stack frame (alloca, alignment, prologue/epilogue).
Calls (the full ABI, callee-saved save/restore, return address).
Control flow (branches, loops, phi/select).
Spilling (more live values than registers).

When a higher test fails and the lower ones pass, the bug is in the new capability. That ordering is the debugger.

Red Flags

Smell	Do instead
"The disassembly looks correct"	Execute it and assert the result
Test asserts "no crash"	Assert the actual computed value
Slow full-system sim per test	Fast target emulator, run every build
One big program as the only test	Capability-ordered tests, smallest first
Skipping ABI/spill tests	Those are exactly where the bugs hide

Midstall House Style

Vulcan is the reference: a reusable codegen system whose RISC-V output is execution-validated on Midstall's River CPU via a fast river-emulator. That harness caught real codegen bugs (select lowering, critical-edge splitting, return-address save) that passed inspection.
Targets nest under vulcan-target/<arch>; validation is a first-class part of bring-up, not an afterthought.
No em dashes, no emoji. The compare-against-truth loop is the same shape as differential-verification.