Classify ENA microbiome samples into labeled cohorts (disease, control, or other) given a paper ID and ENA project accession. Invoke as: /microbiome-harmonization:harmonize <PMID_or_PMCID> <ENA_ACCESSION>
How this skill is triggered — by the user, by Claude, or both
Slash command
/microbiome-harmonization:harmonizeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Parse `$ARGUMENTS` as two space-separated tokens:
Parse $ARGUMENTS as two space-separated tokens:
38243197) or PMCID (e.g. PMC10797958)PRJEB46665)If either token is missing or malformed, stop and ask the user to provide both.
All scripts are at ${CLAUDE_PLUGIN_ROOT}/scripts/. Run every script with:
uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/<script>
usage: get_abstracts.py [-h] [-o OUTPUT] ids [ids ...]
Fetch abstract text for PMID/PMCID input(s).
positional arguments:
ids One or more PMIDs or PMCIDs (e.g. 27102758 or PMC3531190)
options:
-h, --help show this help message and exit
-o OUTPUT Write CSV output to this path. If omitted, CSV is printed to stdout for multi-ID input.
Single ID: prints abstract text directly to stdout.
Multiple IDs: produces CSV with columns: requested_id, input_type, pmid, pmcid, abstract, error.
usage: get_disease_entities.py [-h] [-o OUTPUT] ids [ids ...]
Fetch PubTator3 disease entities for PMID/PMCID input(s).
positional arguments:
ids One or more PMIDs or PMCIDs (e.g. 38243197 or PMC10797958)
options:
-h, --help show this help message and exit
-o OUTPUT Write CSV output to this path. If omitted, CSV is printed to stdout for multi-ID input.
Single ID: prints one disease per line as name<TAB>mesh_id.
Multiple IDs: produces CSV with columns: requested_id, input_type, pmid, pmcid, disease_names, disease_ids, error.
disease_names and disease_ids are semicolon-separated when multiple entities exist.
usage: get_ena_project_samples.py [-h] [-o OUTPUT] [--max-samples N] project_accession
Fetch ENA sample metadata for a project/study accession and flatten it into CSV.
positional arguments:
project_accession ENA project/study accession (e.g. PRJEB46665)
options:
-h, --help show this help message and exit
-o OUTPUT Write CSV to this path. If omitted, CSV is printed to stdout.
--max-samples N Only fetch the first N sample accessions. Use for smoke tests on large studies.
Output columns: project_accession, sample_accession, sample_alias, sample_title, center_name,
primary_id, secondary_id, taxon_id, scientific_name, description, error, then any additional
SAMPLE_ATTRIBUTE fields found in the XML (normalized to snake_case).
Run these steps in order. Save outputs to temp files with -o so you can probe them with shell tools.
Step 1 — Fetch ENA sample metadata
uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_ena_project_samples.py <ENA_ACCESSION> -o /tmp/samples.csv
For studies with more than 200 samples, first run with --max-samples 10 to inspect the column structure before fetching all samples.
Step 2 — Fetch abstract
uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_abstracts.py <PMID_or_PMCID>
Read the abstract carefully. Note: stated group sizes, cohort names, inclusion/exclusion criteria, and any sample count that can be used to verify the split.
Step 3 — Fetch MeSH disease entities
uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_disease_entities.py <PMID_or_PMCID>
Use these to set mesh_term and inform canonical_disease in the output.
Step 4 — Probe the sample CSV for case/control signal
IMPORTANT: The sample CSV can be very large (100s of samples, 100s of columns). Do NOT read the full file directly with the Read tool.
First, always inspect the raw data structure:
wc -l /tmp/samples.csv
head -5 /tmp/samples.csv
This shows you:
Then, choose your probing tool based on what you see:
For simple CSVs (no quoted fields with commas):
# List all columns
head -1 /tmp/samples.csv | tr ',' '\n' | nl
# Count unique values in a column (e.g., column 5)
cut -d, -f5 /tmp/samples.csv | tail -n +2 | sort | uniq -c
# Search for patterns (e.g., disease-related columns)
head -1 /tmp/samples.csv | grep -io 'disease\|health\|case\|control'
For complex CSVs (quoted fields with commas inside):
# Use Python's CSV reader to handle quoting correctly
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); cols=[next(r)]; [cols.append(row) for row in r]; print('\\n'.join([f'{i}: {cols[0][i]}' for i in range(len(cols[0]))]))"
# Extract and analyze a specific column (e.g., column 5)
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); next(r); vals=[row[5] for row in r if len(row)>5]; from collections import Counter; print(Counter(vals).most_common(20))"
# Extract multiple columns for analysis
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); [print(row[1], row[4], row[5]) for row in r]"
Never use cut -d',' on CSVs with quoted fields — it breaks on embedded commas. Always preview with head -5 first.
Step 5 — Assign labels and produce output CSV
Apply the labeling rules, then write a CSV with the Output Schema below.
Stop at the first probe level that assigns a confident label to most samples. Record which level resolved the label in label_source.
Level 1 — Explicit metadata columns (label_source: explicit_field)
Check for columns named (case-insensitive, partial match acceptable):
disease, health_state, host_disease, host_phenotype, diagnosis, condition, case_control, treatment, phenotype, subject_disease_status, disease_status
If found, use the column values directly. Common value patterns:
case, disease, patient, affected, positive, column named after a diseasecontrol, healthy, HC, HV, normal, unaffected, negativeLevel 2 — Sample alias / title patterns (label_source: alias_pattern)
Inspect sample_alias and sample_title for structured prefixes or suffixes. Common examples:
CRC_01, CRC01 → colorectal cancer caseHC_01, HV_01, NC_01 → healthy controlCD_, UC_, IBD_ → inflammatory bowel disease subtypesT2D_, HbA1c_ → metabolic diseaseBRC_, BRCA_ → breast cancerExtract the prefix/suffix pattern; apply it uniformly to all samples with that pattern.
Level 3 — Free-text fields (label_source: free_text)
Scan description, sample_title, and any other string columns for disease/control keywords. Use this level only when Levels 1–2 yield no signal. Assign confidence: low for any label derived here.
Level 4 — Abstract reconciliation (label_source: abstract_reconciliation)
Use the abstract to infer group structure when metadata alone is insufficient:
confidence: lownotesBeyond cohort labels, probe the sample CSV for phenotypic metadata: age, sex, disease. Most ENA studies lack this, but extract when present.
Age: Look for columns named age, age_at_collection, host_age, or numeric values in units (years, months). Record as <value> or <min>-<max> if a range.
Sex: Check columns sex, gender, host_sex. Map values: M / male / boy → male; F / female / girl → female; other / not_determined → blank.
Disease: Already extracted via signal probing, but capture the specific disease term if available (e.g. colorectal cancer, inflammatory bowel disease).
Leave blank if not found.
Produce a CSV with exactly these columns in this order:
| Column | Values | Notes |
|---|---|---|
sample_accession | e.g. SAMEA12345 | from ENA |
study_accession | e.g. PRJEB46665 | the input accession |
pmid | numeric string | resolved from input; blank if unavailable |
mesh_term | raw MeSH disease term | from PubTator3; semicolon-separated if multiple |
canonical_disease | e.g. colorectal cancer, IBD | standardized label derived from mesh_term and abstract |
label | disease / control / other / unresolved | |
label_source | explicit_field / alias_pattern / free_text / abstract_reconciliation | the probe level that assigned this label |
control_type | e.g. healthy_volunteer, adjacent_normal, antibiotic_naive | blank when label is not control |
confidence | high / medium / low | |
age | e.g. 42, 18-65, blank if not found | extracted from ENA metadata |
sex | male / female / blank if not found | extracted and normalized |
disease | e.g. colorectal cancer | phenotypic disease term, blank if not found |
separable | true / false | whether the dataset can be reliably split into cohorts |
notes | free text | required when separable=false, confidence=low, or when ENA metadata and abstract disagree |
unresolved.confidence: low on affected rows and explain in notes.separable: false applies at the dataset level when metadata is too sparse or inconsistent to support reliable cohort splitting. Always explain why in notes.notes for any confidence: medium or lower assignment.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub alanthisis/ena-metadata-harmonization --plugin microbiome-harmonization