baseline-first-modeling | datapowers

Stats

Actions

Tags

baseline-first-modeling | datapowers

Baseline-First Modeling

Overview

A baseline is the floor every other model must beat. Without one, "85% accuracy" is a number, not a result.

Core principle: If a trivial baseline already hits the deploy threshold, you don't need a model — you need a rule.

The Iron Law

NO COMPLEX MODEL (XGBoost, deep net, fine-tune) WITHOUT A DOCUMENTED BASELINE FIRST

The baseline lives in code, in the experiment spec §5, and as a tracked MLflow run.

When to Use

Start of every modeling phase.
Before tuning anything.
After a major data refresh (re-baseline).
When stakeholder asks "is this good?" — only the baseline lets you answer.

Baselines by Problem Type

Problem	Trivial baseline	Strong baseline
Binary classification	Majority class	Logistic regression on numeric features
Multi-class	Class prior	Logistic / linear SVM
Regression	Mean / median	Ridge regression on numeric features
Time-series forecast	Last value, seasonal naive	Exponential smoothing / ARIMA
Ranking / recsys	Popularity, recency	BPR / ALS / lightFM
NLP classification	Class prior	TF-IDF + logistic regression
NLP generation	Retrieve nearest training example	Small fine-tune of base model
Vision classification	Class prior	Linear probe on a pretrained backbone
Anomaly detection	Rate threshold on a single feature	IsolationForest

The Process

Step 1: Implement the trivial baseline

10 lines or less.
Tracked in MLflow.
Score noted in spec §5.

Step 2: Implement the strong baseline

Use the same validation strategy that the future complex model will use.
Same data hash, same pre-processing.
Tracked in MLflow with confidence intervals.

Step 3: Decide

Strong baseline beats deploy threshold? → Stop. Ship the baseline. Open a ticket for re-evaluation if context changes.
Strong baseline misses threshold? → Authorize complex modeling. The complex model must beat the strong baseline by a margin > CI to count.

Pattern

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
import mlflow

with mlflow.start_run(run_name="baseline_majority"):
    mlflow.log_param("data_hash", DATA_HASH)
    mlflow.log_param("kind", "majority")
    m = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
    log_metrics_with_ci(m, X_val, y_val)

with mlflow.start_run(run_name="baseline_logreg"):
    mlflow.log_param("data_hash", DATA_HASH)
    mlflow.log_param("kind", "logreg")
    m = LogisticRegression(max_iter=1000).fit(X_train, y_train)
    log_metrics_with_ci(m, X_val, y_val)

Log: primary metric ± CI, segment metrics, calibration, prediction latency.

Anti-Patterns

Anti-pattern	Cost
Skip to XGBoost "to save time"	No reference; cannot tell if features help
Baseline on different data than the model	Comparison is meaningless
Baseline once, never re-run	Data drifts; old baseline lies
Reporting "model A vs. model B" without baseline	Neither may beat trivial
Baseline only on global metric	Segment performance unknown

Red Flags

Strong baseline beats deploy threshold and you're still building a deep model "because it's interesting."
Complex model beats baseline by less than the CI width.
You can't show the baseline run in MLflow.

Verification Checklist

Trivial baseline implemented and tracked.
Strong baseline implemented and tracked.
Both use the same data hash and validation strategy as planned complex models.
Scores ± CI documented in spec §5.
Decision recorded: ship baseline, or authorize complex modeling.

Cross-References

REQUIRED BEFORE: datapowers:test-driven-modeling
REQUIRED WITH: datapowers:validation-strategy-design, datapowers:experiment-tracking, datapowers:model-evaluation-rigorously