Skill

Harmonize Microbiome Cohorts

Classify ENA microbiome samples into labeled cohorts (disease, control, or other) given a paper ID and ENA project accession. Invoke as: /microbiome-harmonization:harmonize <PMID_or_PMCID> <ENA_ACCESSION>

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/microbiome-harmonization:harmonize

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Parse `$ARGUMENTS` as two space-separated tokens:

SKILL.md

244 lines · ~2.6k tokens

Stats

LanguagePython

Parent stars0

MaintenanceExcellent

Last CommitMay 26, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Harmonize Microbiome Cohorts

Parse $ARGUMENTS as two space-separated tokens:

Token 1: a PMID (digits only, e.g. 38243197) or PMCID (e.g. PMC10797958)
Token 2: an ENA project or study accession (e.g. PRJEB46665)

If either token is missing or malformed, stop and ask the user to provide both.

Scripts

All scripts are at ${CLAUDE_PLUGIN_ROOT}/scripts/. Run every script with:

uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/<script>

get_abstracts.py

usage: get_abstracts.py [-h] [-o OUTPUT] ids [ids ...]

Fetch abstract text for PMID/PMCID input(s).

positional arguments:
  ids          One or more PMIDs or PMCIDs (e.g. 27102758 or PMC3531190)

options:
  -h, --help   show this help message and exit
  -o OUTPUT    Write CSV output to this path. If omitted, CSV is printed to stdout for multi-ID input.

Single ID: prints abstract text directly to stdout.
Multiple IDs: produces CSV with columns: requested_id, input_type, pmid, pmcid, abstract, error.

get_disease_entities.py

usage: get_disease_entities.py [-h] [-o OUTPUT] ids [ids ...]

Fetch PubTator3 disease entities for PMID/PMCID input(s).

positional arguments:
  ids          One or more PMIDs or PMCIDs (e.g. 38243197 or PMC10797958)

options:
  -h, --help   show this help message and exit
  -o OUTPUT    Write CSV output to this path. If omitted, CSV is printed to stdout for multi-ID input.

Single ID: prints one disease per line as name<TAB>mesh_id.
Multiple IDs: produces CSV with columns: requested_id, input_type, pmid, pmcid, disease_names, disease_ids, error.
disease_names and disease_ids are semicolon-separated when multiple entities exist.

get_ena_project_samples.py

usage: get_ena_project_samples.py [-h] [-o OUTPUT] [--max-samples N] project_accession

Fetch ENA sample metadata for a project/study accession and flatten it into CSV.

positional arguments:
  project_accession   ENA project/study accession (e.g. PRJEB46665)

options:
  -h, --help          show this help message and exit
  -o OUTPUT           Write CSV to this path. If omitted, CSV is printed to stdout.
  --max-samples N     Only fetch the first N sample accessions. Use for smoke tests on large studies.

Output columns: project_accession, sample_accession, sample_alias, sample_title, center_name,
primary_id, secondary_id, taxon_id, scientific_name, description, error, then any additional
SAMPLE_ATTRIBUTE fields found in the XML (normalized to snake_case).

Workflow

Run these steps in order. Save outputs to temp files with -o so you can probe them with shell tools.

Step 1 — Fetch ENA sample metadata

uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_ena_project_samples.py <ENA_ACCESSION> -o /tmp/samples.csv

For studies with more than 200 samples, first run with --max-samples 10 to inspect the column structure before fetching all samples.

Step 2 — Fetch abstract

uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_abstracts.py <PMID_or_PMCID>

Read the abstract carefully. Note: stated group sizes, cohort names, inclusion/exclusion criteria, and any sample count that can be used to verify the split.

Step 3 — Fetch MeSH disease entities

uv run --with requests python3 ${CLAUDE_PLUGIN_ROOT}/scripts/get_disease_entities.py <PMID_or_PMCID>

Use these to set mesh_term and inform canonical_disease in the output.

Step 4 — Probe the sample CSV for case/control signal

IMPORTANT: The sample CSV can be very large (100s of samples, 100s of columns). Do NOT read the full file directly with the Read tool.

First, always inspect the raw data structure:

wc -l /tmp/samples.csv
head -5 /tmp/samples.csv

This shows you:

File size (rows)
Raw data including quoted fields and commas inside values
Whether the CSV is simple (no quotes) or complex (quoted fields with embedded commas)

Then, choose your probing tool based on what you see:

For simple CSVs (no quoted fields with commas):

# List all columns
head -1 /tmp/samples.csv | tr ',' '\n' | nl

# Count unique values in a column (e.g., column 5)
cut -d, -f5 /tmp/samples.csv | tail -n +2 | sort | uniq -c

# Search for patterns (e.g., disease-related columns)
head -1 /tmp/samples.csv | grep -io 'disease\|health\|case\|control'

For complex CSVs (quoted fields with commas inside):

# Use Python's CSV reader to handle quoting correctly
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); cols=[next(r)]; [cols.append(row) for row in r]; print('\\n'.join([f'{i}: {cols[0][i]}' for i in range(len(cols[0]))]))"

# Extract and analyze a specific column (e.g., column 5)
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); next(r); vals=[row[5] for row in r if len(row)>5]; from collections import Counter; print(Counter(vals).most_common(20))"

# Extract multiple columns for analysis
python3 -c "import csv; r=csv.reader(open('/tmp/samples.csv')); [print(row[1], row[4], row[5]) for row in r]"

Never use cut -d',' on CSVs with quoted fields — it breaks on embedded commas. Always preview with head -5 first.

Step 5 — Assign labels and produce output CSV

Apply the labeling rules, then write a CSV with the Output Schema below.

Signal Probe Order

Stop at the first probe level that assigns a confident label to most samples. Record which level resolved the label in label_source.

Level 1 — Explicit metadata columns (label_source: explicit_field)

Check for columns named (case-insensitive, partial match acceptable): disease, health_state, host_disease, host_phenotype, diagnosis, condition, case_control, treatment, phenotype, subject_disease_status, disease_status

If found, use the column values directly. Common value patterns:

Disease: case, disease, patient, affected, positive, column named after a disease
Control: control, healthy, HC, HV, normal, unaffected, negative

Level 2 — Sample alias / title patterns (label_source: alias_pattern)

Inspect sample_alias and sample_title for structured prefixes or suffixes. Common examples:

CRC_01, CRC01 → colorectal cancer case
HC_01, HV_01, NC_01 → healthy control
CD_, UC_, IBD_ → inflammatory bowel disease subtypes
T2D_, HbA1c_ → metabolic disease
BRC_, BRCA_ → breast cancer

Extract the prefix/suffix pattern; apply it uniformly to all samples with that pattern.

Level 3 — Free-text fields (label_source: free_text)

Scan description, sample_title, and any other string columns for disease/control keywords. Use this level only when Levels 1–2 yield no signal. Assign confidence: low for any label derived here.

Level 4 — Abstract reconciliation (label_source: abstract_reconciliation)

Use the abstract to infer group structure when metadata alone is insufficient:

Extract stated group sizes and labels from the abstract
Map unresolved samples by count: if abstract says N=40 CRC and N=40 healthy and metadata assigns 40 samples to neither level 1–3, assign by process of elimination with confidence: low
If sample count in ENA does not match stated N in abstract, note the discrepancy in notes

Phenotypic data extraction

Beyond cohort labels, probe the sample CSV for phenotypic metadata: age, sex, disease. Most ENA studies lack this, but extract when present.

Age: Look for columns named age, age_at_collection, host_age, or numeric values in units (years, months). Record as <value> or <min>-<max> if a range.

Sex: Check columns sex, gender, host_sex. Map values: M / male / boy → male; F / female / girl → female; other / not_determined → blank.

Disease: Already extracted via signal probing, but capture the specific disease term if available (e.g. colorectal cancer, inflammatory bowel disease).

Leave blank if not found.

Output Schema

Produce a CSV with exactly these columns in this order:

Column	Values	Notes
`sample_accession`	e.g. `SAMEA12345`	from ENA
`study_accession`	e.g. `PRJEB46665`	the input accession
`pmid`	numeric string	resolved from input; blank if unavailable
`mesh_term`	raw MeSH disease term	from PubTator3; semicolon-separated if multiple
`canonical_disease`	e.g. `colorectal cancer`, `IBD`	standardized label derived from mesh_term and abstract
`label`	`disease` / `control` / `other` / `unresolved`
`label_source`	`explicit_field` / `alias_pattern` / `free_text` / `abstract_reconciliation`	the probe level that assigned this label
`control_type`	e.g. `healthy_volunteer`, `adjacent_normal`, `antibiotic_naive`	blank when label is not `control`
`confidence`	`high` / `medium` / `low`
`age`	e.g. `42`, `18-65`, blank if not found	extracted from ENA metadata
`sex`	`male` / `female` / blank if not found	extracted and normalized
`disease`	e.g. `colorectal cancer`	phenotypic disease term, blank if not found
`separable`	`true` / `false`	whether the dataset can be reliably split into cohorts
`notes`	free text	required when `separable=false`, `confidence=low`, or when ENA metadata and abstract disagree

Rules

Never force a label onto an ambiguous sample. Use unresolved.
If the study has more than two cohorts, preserve all of them — do not collapse to disease vs control.
If ENA metadata and abstract disagree on group structure or sample counts, set confidence: low on affected rows and explain in notes.
separable: false applies at the dataset level when metadata is too sparse or inconsistent to support reliable cohort splitting. Always explain why in notes.
Quote the exact column name or text snippet that justified each label in notes for any confidence: medium or lower assignment.
All three scripts hit external APIs (NCBI Entrez, PubTator3, ENA). If any call fails, report the error and continue with available data rather than aborting.

Harmonize Microbiome Cohorts

Invocation

Context Preview

SKILL.md

Harmonize Microbiome Cohorts

Invocation

Context Preview

SKILL.md

Harmonize Microbiome Cohorts

Scripts

get_abstracts.py

get_disease_entities.py

get_ena_project_samples.py

Workflow

Signal Probe Order

Phenotypic data extraction

Output Schema

Rules

Similar Skills

Harmonize Microbiome Cohorts

Scripts

get_abstracts.py

get_disease_entities.py

get_ena_project_samples.py

Workflow

Signal Probe Order

Phenotypic data extraction

Output Schema

Rules

Similar Skills