experiment-planning | nerd

Stats

Actions

Tags

experiment-planning | nerd

Experiment Planning for Nerd

Every Plan Needs

Competing theories (3+) — not just "is X optimal?" but "what's really going on?"
Testable predictions — what we'd observe if each theory is correct
Ablation baseline — what happens if we remove the feature entirely?
Metric — specific, computable (F1, nDCG, latency, token count)
Sweep spec — ranges, steps, --max-combos cap
Ground truth — how "correct" is defined, with circularity caveats
Theory-linked acceptance criteria — each theory can be confirmed or rejected

Theory Types

Parameter is wrong: a different value would improve the metric
Model is wrong: a different model (not just parameters) would fit better
Feature is unnecessary: removing it entirely causes no degradation
Metric is wrong: we're optimizing the wrong thing
Data is the bottleneck: the parameter doesn't matter because the input data is the real problem
Architecture is the bottleneck: no parameter value can fix this

Ground Truth Strategies

Auto-resolved data: measures sensitivity, not absolute quality (circular)
User feedback: clicks/engagement (sparse, position-biased)
Hand-labeled: gold standard (expensive, small samples)
Synthetic: tests metric computation, not real-world quality

Feasibility Check

combos = product(range_sizes)
time = combos * data_size * cost_per_eval
if time > 1 hour: reduce ranges or use random search

Analytical vs Experimental

Not all parameters can be swept. Parameters in non-executable files (markdown agent prompts, documentation, config comments) or those requiring human judgment are analytical — they can be reasoned about with competing theories but cannot be iterated in a nerd-loop.

For analytical parameters:

Still generate competing theories and testable predictions
Recommendations come from reasoning and analogies, not measured data
Do NOT propose sweep harnesses or eval commands
Mark as experiment_type: "analytical" in the plan

For experimentable parameters:

Full sweep harness with automated metric
The metric must be: automated (shell command → number), deterministic, sensitive to changes, and fast (<5 min)

Anti-Patterns

Single-hypothesis experiments (only testing "is the parameter optimal?")
Sweeping dead code parameters (verify the hot path first)
Optimizing metrics that don't correlate with user satisfaction
No ablation (never tested if the feature matters at all)
Insufficient data (<30 points per config = noise)
No baseline (always include current values as comparison)
Proposing a nerd-loop on parameters that have no automated metric — use analytical batch analysis instead