casetrack
Lifecycle data management for computational biology pipelines on HPC.
Answers two questions about a multi-patient, multi-specimen, multi-assay
cohort: "is this analysis complete?" and "is this sample usable?"
New to casetrack? Start with the 15-minute Quickstart —
clone → init → register a 3-patient demo cohort → record an analysis →
query for pending work. No prior knowledge assumed.
Storage layers, one CLI:
- v0.10 (current, alpha):
register-cohort — load patients + specimens
- assays from one schema-native wide sample sheet in a single transaction
(proposal 0012). Builds on the additive sibling-table layers added since
v0.6: cohort-level artifacts (joint VCFs / PoNs / matrices, proposal 0009),
versioned reference artifacts with downstream staleness (proposal 0010),
and artifact-to-artifact lineage with transitive
derived_stale
(proposal 0011). The three-level core is untouched by all of these.
- v0.6: identity layer on top of v0.4. Every project
gets a
project_id slug at init, persisted in TOML + project_meta
SQLite table + ~/.casetrack/registry.json, so commands can address a
project by name (casetrack --project hgsoc-2026 query "...") instead
of a fragile path. Hierarchy IDs (patient_id, specimen_id,
assay_id) are now validated against an ASCII regex at insert time —
typos in samplesheets fail loudly at register, not silently
downstream. Per-level escape hatches via [levels.<level>] id_pattern
for legacy LIMS IDs. See proposal 0005.
- v0.4: QC / censoring / consent subsystem. Every read path
(
status, rerun, export, query, dashboard) filters out
QC-failed and consent-revoked entities by default. SLURM summary TSVs
auto-flag via qc_pass / qc_fail_reason / qc_warn columns.
Paired-design readiness via casetrack cohort --pair-by.
- v0.3 (project mode): SQLite-backed project directory with
normalized
patient → specimen → assay tables, enforced foreign keys,
typed columns, and DuckDB-powered SQL queries. Survives DB corruption
— everything is regenerable from casetrack.toml + provenance.jsonl.
- v0.2 (flat mode — deprecated): one TSV manifest per project, one
row per sample. Still works, loud deprecation warning, removed in v1.0.
Upgrade paths:
v0.2 → v0.3 via casetrack migrate (guide).
v0.3 → v0.4 via casetrack migrate-qc (guide).
v0.4 → v0.6 is automatic for new projects (init writes project_id); legacy projects continue to work without one until v0.6 final ships casetrack migrate-project-id.
cohort_v3/
├── casetrack.toml # declared schema — git-tracked source of truth
├── casetrack.db # SQLite, WAL + busy_timeout=30000 + FK enforcement
├── provenance.jsonl # append-only audit log (git-trackable)
├── .gitignore # excludes casetrack.db, casetrack.db-wal/-shm, exports/
└── sandbox/ # preserved source TSVs (migration artifact)
How people actually use this
casetrack is a CLI that wraps a SQLite DB. It's installed once (globally
or per-env) and used against many projects. Three layers — keep them separate:
| Layer | Where it lives | How many | What it is |
|---|
1. casetrack package | Wherever pip put it | One per env | The CLI itself — install once with pip install casetrack |
| 2. Casetrack projects | Your data filesystem (/data1/.../cohort_X/) | Many per user, one per cohort | A directory with casetrack.toml, casetrack.db, provenance.jsonl |
| 3. Your pipeline code | Your own git repo (Snakemake / Nextflow / bash / etc.) | Many per user, one per pipeline | Orchestration + summary scripts — ends each job with casetrack append --project-dir ... |
Users do not clone this repo to use casetrack — they install it once, create
project directories wherever their data lives, and call it from their own
pipeline code. The examples/giab_chr21/ directory is a demo and
reference for the three-phase SLURM pattern; it's not a template you need
to copy wholesale.
Three recommended patterns by user shape: