From sdrf-skills
Looks up cell line metadata (HeLa, MCF-7, etc.) from Cellosaurus and enriches SDRF files with organism, disease, sex, sampling site, ancestry, and age.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sdrf-skills:sdrf-cellline [cell line name | CVCL_XXXX | path/to/file.sdrf.tsv][cell line name | CVCL_XXXX | path/to/file.sdrf.tsv]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are translating cell line identity into the SDRF columns required by the
You are translating cell line identity into the SDRF columns required by the
cell-lines template. The source of truth is Cellosaurus (SIB / Expasy) —
it is not hosted on EBI OLS, so this skill queries Cellosaurus directly. OLS
is still used for the target ontologies the SDRF columns reference
(NCBITaxon, MONDO, UBERON, HANCESTRO, CLO/BTO/EFO). This skill encodes the
rules for the translation — it does not ship a local database.
| Mode | Endpoint / file | Notes |
|---|---|---|
| Web (browse) | https://www.cellosaurus.org/<CVCL_id> | Human-readable record. |
| REST API (JSON) | https://api.cellosaurus.org/cell-line/<CVCL_id>?format=json | Single-entry fetch; preferred for accession lookups. |
| REST search | https://api.cellosaurus.org/search/cell-line?q=<query>&format=json | Free-text / field-qualified search (id:HeLa, sy:HeLa-S3). |
| Bulk download | https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt (flat) https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo (OBO) https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml (XML) | Offline / batch use only. Re-download monthly to stay current. |
Use the REST API by default. Drop to the bulk file only when the user asks for offline mode or needs to enrich many SDRFs in one pass.
characteristics[cell line] is filled but the
Cellosaurus-derivable columns (organism, disease, sampling site, sex,
ancestry, age, developmental stage, cellosaurus accession/name) are blank,
generic, or inconsistent./sdrf:annotate,
/sdrf:validate, or /sdrf:fix.For pure ontology-term lookup unrelated to cell lines, use /sdrf:terms.
Always read the template first — column names, requirement levels, and target ontologies must come from the spec, never from memory.
Read: spec/sdrf-proteomics/sdrf-templates/cell-lines/{version}/cell-lines.yaml
Required columns supplied by Cellosaurus (most current spec):
Flat-file field codes are the two-letter codes in cellosaurus.txt (ID, AC,
SY, OX, DI, DR, SX, AG, …); JSON field names from the REST API are
listed in Step 2a.
| SDRF column | Source field in Cellosaurus (flat / JSON) | Target ontology |
|---|---|---|
characteristics[cell line] | ID / identifier | CLO / BTO / EFO |
characteristics[cellosaurus accession] | AC / accession | Cellosaurus (CVCL_XXXX) |
characteristics[cellosaurus name] | ID / identifier | Cellosaurus |
characteristics[disease] | DI / disease-list (NCIt / ORDO) | MONDO / EFO / DOID — translate via OLS xrefs |
characteristics[sampling site] | DR (derived-from-site) / derived-from | UBERON / BTO |
characteristics[ancestry category] | OX / species-list population annotation | HANCESTRO |
characteristics[developmental stage] | AG / age (donor age class) | EFO |
The cell-lines template also requires an organism layer
(human / vertebrates / invertebrates), which contributes
characteristics[organism]. Take the species from Cellosaurus's OX line
(taxon ID) and verify against the organism template's NCBITaxon column.
Cell line names in the wild are messy. Apply this pipeline before any lookup:
"['HeLa']" → HeLa. (The /sdrf:fix artifact rule handles this; rerun if dirty.)^CVCL_[A-Z0-9]{4,}$ → skip name lookup, fetch the accession.lower(input) with [\s\-_]+ collapsed away.
So HeLa-S3, hela s3, HELA_S3, hela.s3 all key to helas3.not available,
not applicable), tissue names without a clonal identifier, and primary
tissue codes are not cell lines — return early and tell the user.By accession (CVCL_XXXX):
GET https://api.cellosaurus.org/cell-line/<CVCL_id>?format=json
By name (exact, then synonyms):
GET https://api.cellosaurus.org/search/cell-line?q=id:<name>&format=json
GET https://api.cellosaurus.org/search/cell-line?q=sy:<name>&format=json
By normalized key (last resort, broad search):
GET https://api.cellosaurus.org/search/cell-line?q=<normalized>&format=json
A JSON response carries a Cellosaurus.cell-line-list array. For each hit, read
identifier (recommended name), accession, name-list (synonyms with
type=synonym), species-list (NCBI taxon), disease-list (NCIt / ORDO
xrefs), derived-from, category, sex, age, and xref-list (which
includes CLO / BTO / EFO cross-references — see Step 4.1).
If the API is unreachable (network blocked, rate-limited), fall back to 2b.
If the user has downloaded the bulk release, point them to one of:
~/cellosaurus.txt # flat-file format, grep-friendly
~/cellosaurus.obo # OBO, parsable with standard tools
~/cellosaurus.xml # XML, machine-readable
Download command (only suggest if the user has no copy):
curl -sSLO https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt
For ad-hoc lookups in the flat file:
# By accession
awk -v RS='//' '/AC CVCL_0030/' cellosaurus.txt
# By name (case-insensitive, exact line match on ID or SY)
awk -v RS='//' 'BEGIN{IGNORECASE=1} /^ID HeLa$|^SY[^\n]*\bHeLa\b/' cellosaurus.txt
Never paste large excerpts of the file into the SDRF — extract only the fields needed for Step 4.
When Step 2 yields more than one candidate, pick in this order:
identifier (case-sensitive).name-list entries with type=synonym.HeLa, CVCL_0030) and a
subclone (HeLa-S3, CVCL_0058) match, prefer the parent only when the
user input has no qualifier. If the input contains a digit, letter suffix,
or - segment (e.g. HeLa-S3, K562/ADR), prefer the subclone.Hybridoma, Patient-derived xenograft cell line) are valid hits — flag
them in the report so reviewers know the donor metadata semantics differ.293, SK, HCT, HEK, T-47.If nothing matches:
/sdrf:terms cell line "<name>"
for a broader CLO/BTO/EFO search.characteristics[cell line] to the user's input verbatim and the rest of
the cell-line columns to not available (never N/A, never unknown).For each matched cell line, fill columns from these rules. CVCL accessions are
verified against Cellosaurus (Step 2). Every other accession written to the
SDRF (NCBITaxon:*, MONDO:*, UBERON:*, HANCESTRO:*, EFO:*) must be
verified via OLS before writing.
| SDRF column | Rule |
|---|---|
characteristics[cellosaurus accession] | CVCL_XXXX from the primary accession. |
characteristics[cellosaurus name] | Recommended name (identifier field) exactly as Cellosaurus returns it. |
characteristics[cell line] | Same as recommended name unless Cellosaurus's xref-list has a CLO / BTO / EFO cross-reference whose label is preferred by the lab. Verify any such alias resolves in OLS (searchClasses(query="<alias>", ontologyId="clo") etc.) before writing it. |
Cellosaurus OX gives NCBI_TaxID=<n>; ! <species>. Translate:
NCBITaxon:<n> via OLS. Use the canonical label
(e.g. Homo sapiens, not human) for characteristics[organism].human = 9606, vertebrates = non-human vertebrates, invertebrates =
the rest), warn: the user picked the wrong organism template.Cellosaurus DI lines reference NCIt (e.g. NCIt; C27677; …). The SDRF
disease column wants MONDO / EFO / DOID / PATO (per TERMS.tsv).
Translation steps:
searchClasses(query="<NCIt id>", ontologyId="ncit").cross_references. Prefer in this order: MONDO > EFO > DOID.searchClasses(query="<label>", ontologyId="mondo"). Choose the closest
match by exact label, then by synonym.DI absent or "Normal tissue"), set
characteristics[disease] to normal (per the cell-lines template
guidance), not "not applicable".DI lines → use the most specific one; record the others in
comment[disease history] only if the template extends that column.Cellosaurus DR and SX describe origin tissue/cell type.
characteristics[sampling site] ← UBERON term for the tissue. Use OLS
searchClasses(query="<site>", ontologyId="uberon"). Fall back to BTO if
UBERON has no exact match.characteristics[developmental stage] to embryonic / fetal (EFO terms)
alongside the sampling site.Cellosaurus SX field (Sex: Female | Male | Mixed sex | Sex unspecified):
Female → femaleMale → maleMixed sex → mixedSex unspecified / absent → not availableLowercase always. Never M/F. The cell-lines template inherits
characteristics[sex] from the organism layer.
Cellosaurus OX may include population annotations (e.g. ! European).
European → HANCESTRO:0005,
African → HANCESTRO:0010, East Asian → HANCESTRO:0009.characteristics[ancestry category] to not available.
Do not infer ancestry from the disease or organism part.Cellosaurus AG (donor age) and category give:
31Y) → characteristics[age] formatted as
<n>Y / <n>M / <n>W / <n>D (SDRF rule). Reject free text like
31 years — fix it.30Y-35Y) → keep the range, hyphen only.Adult, Embryo, Fetus, Newborn → characteristics[developmental stage]
using the EFO term, not characteristics[age].The cell-lines template also defines passage number, biorepository,
cell line authentication, culture medium, and sample storage temperature.
Cellosaurus has no values for these — they are study-specific. Either:
/sdrf:annotate, ornot available if the paper does not state them.When the input is a .sdrf.tsv file:
characteristics[cell line], case-insensitive trim).not available Cellosaurus-derivable columns.
Do not overwrite existing values that disagree with Cellosaurus —
instead, surface them as conflicts and ask the user, exactly the way
/sdrf:review does.characteristics[cellosaurus accession]), insert it adjacent to
characteristics[cell line] and re-emit the full TSV.Cell line annotation report
Unique cell lines: 4
HeLa → CVCL_0030 (matched: exact)
MCF-7 → CVCL_0031 (matched: synonym "MCF7")
HEK 293T → CVCL_0063 (matched: normalized "hek293t")
in-house ABC-1 → unmatched (kept verbatim, others = not available)
Conflicts: 0
Filled cells: 18 across 12 rows
After enrichment, validate against the combined templates:
parse_sdrf validate-sdrf \
--sdrf_file <enriched>.sdrf.tsv \
--template cell-lines
Then run /sdrf:validate for ontology-level checks. Round-trip rules:
CVCL_* accessions must resolve via the Cellosaurus REST API
(/cell-line/<CVCL_id>).NCBITaxon:*, MONDO:*, UBERON:*, HANCESTRO:*, EFO:* must resolve via
OLS.not available.not available for unknown, not applicable for
inapplicable. Never N/A, NA, unknown, blank.cellosaurus name exactly as Cellosaurus returns it
(it is a proper noun). Free-text cell line may be the lab's preferred
alias.Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub bigbio/sdrf-skills