research-principles | clab

Stats

Actions

Tags

research-principles | clab

Research Principles

There is no "success" or "failure" in research, only insights and confidence levels.

Hypothesis-Driven Exploration

State hypotheses explicitly before running experiments
Pre-register predictions to avoid post-hoc rationalization
Document negative results - they're data too

Red-team Your Results

Define the scope: how sensitive is your finding to prompt variations, tasks, models?
Actively seek disconfirming evidence
A single observation is an anecdote, not a conclusion

Documentation Standards

Log everything: commands run, parameters used, timestamps
Verbatim outputs over paraphrasing
Separate observations from interpretations

Rigor Over Speed

Quick proxies (keyword grep, eyeballing samples) are fine for early triage — deciding what's worth investigating. But any result that feeds into hypothesis updates or gets reported must be analyzed rigorously.
Use LLM judges for subjective classification, not regex/keyword heuristics. Regex misses nuance and produces misleading stats. Reserve regex only for purely mechanical checks (e.g. "contains non-ASCII characters").
Audit before scaling: run judges on a small batch first, verify the scores match your intuition, then scale.
Report effect sizes with context: sample sizes, variance, whether the effect is prompt-specific or general.
Include verbatim examples alongside aggregates — numbers without examples are uninterpretable.

Reduce Uncertainty at the Fastest Possible Rate

The goal of research is not to run experiments — it's to update your beliefs. Every decision should optimize for information gain per unit time.

Before launching a heavy experiment, ask: is there a cheaper way to get the same signal?
A single message to a model via API call can sometimes resolve a question that would otherwise take a day-long experiment.
Prefer many small, fast experiments over one large, slow one.

The Escalation Ladder

When trying to get a model to do something, try approaches in this order. Only escalate when simpler methods fail or plateau:

Zero-shot prompting — Try it in a chat interface. Send 10-100 messages, iterate on the prompt.
Few-shot prompting — Add 1-10 gold examples of what you want.
Many-shot prompting — Fill the context window with labeled examples.
Best-of-N sampling — Sample N times, pick the best via a judge or reward model. No training needed.
Supervised fine-tuning — Only when prompting hits a ceiling. Start with an API (e.g. OpenAI) for fast iteration.
RL/RLHF — Last resort. Slower iteration, more complex code, harder to debug.

Each step is roughly an order of magnitude more expensive in time and complexity. Don't skip steps.

Cache LLM Responses

Any experiment involving LLM API calls should cache responses to disk. This lets you:

Kill and restart scripts without losing progress
Tweak analysis code without re-running inference
Resume from where you left off after errors

Use one file per response keyed by a deterministic hash of the request (model, prompt, temperature, etc.). Use hashlib.md5, not Python's built-in hash() (which is non-deterministic across runs).

Avoid Common Pitfalls

Confirmation bias: Actively seek disconfirming evidence
Cherry-picking: Don't ignore results that don't fit
Over-interpreting: Single observations are anecdotes, not conclusions
Blind fixing: Don't try random fixes without understanding the root cause
False precision: A regex classifier giving "43.1% human fabrication" looks precise but the methodology is sloppy — prefer proper LLM evaluation with transparent criteria