From ai-hub-skill-set
Use when the user wants to pull a dataset from TCGA / GDC, Kaggle (competition or dataset), HuggingFace, or Google Drive, or asks for an sbatch script for a long download. Triggers on phrases like "grab the CESC slides", "download BraTS from huggingface", "pull RSNA pneumonia from kaggle", "GDC manifest", "gdown", "sbatch for this download", and on the tools gdc-client, kaggle, huggingface_hub, snapshot_download, gdown. Does NOT cover DICOM→NIfTI conversion or nnUNet formatting — hand off to the dicom-converter or nnunet-converter skill.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-hub-skill-set:dataset-acquisitionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate the right commands (and sbatch scripts) to download medical imaging
Generate the right commands (and sbatch scripts) to download medical imaging and genomics datasets from public sources. The skill does not kick off long-running downloads itself — it produces the commands and the sbatch script, and the user submits them.
This SKILL.md is intentionally compact. Source-specific details live in
references/*.md and are loaded on demand. The pointer table at the end
tells you which reference to read for a given source. Mandatory pre-reads are
flagged with MUST read — they are non-negotiable.
This skill handles:
snapshot_download with revision pinning.gdown for shared Google Drive folders / files.This skill does NOT handle:
dicom-converter skill.nnunet-converter skill.--access open.huggingface-cli login.If the user asks for any of the NOT handled items, tell them which sibling skill to use (or that the workflow doesn't exist) and stop.
Decide which of the four sources the user actually wants. Most requests are single-source; some are "grab data from X then preprocess with Y" — handle the acquisition part here and hand off to the appropriate sibling skill for the preprocessing.
Archive inventory rule. For any archive you download, receive, or unpack
(.zip, .tar, .tar.gz, .tgz, .7z, etc.), list the full archive contents
or extracted folder tree before choosing the downstream conversion input. Inspect
every candidate folder, including names that look processed (_preprocessed,
preprocessed, processed, derived, etc.). "Prefer least processed data" does
not mean ignoring processed-looking folders; inspect everything first, then pick
the correct source for dicom-converter or nnunet-converter.
For each source, you MUST read the corresponding reference and confirm its prerequisite checks pass. Do not try to work around missing auth — tell the user what's missing and stop.
| Source | Mandatory pre-read |
|---|---|
| TCGA / GDC | MUST read references/tcga_gdc.md before generating any manifest or running gdc-client. |
| Kaggle competitions or datasets | MUST read references/kaggle.md before running kaggle competitions download or kaggle datasets download. |
| HuggingFace dataset or model | MUST read references/huggingface.md before running snapshot_download. |
Google Drive (gdown) | MUST read references/google_drive.md before running gdown. |
You MUST read references/sbatch_template.md before generating any sbatch
script. Use sbatch when any of these apply:
$SCRATCH, $SLURM_TMPDIR, a SLURM account, or a cluster.Otherwise, run the download command directly in the user's session.
For live runs, execute the command and report the result. For sbatch jobs,
print the generated script to the user with the suggested filename and tell
them to sbatch it themselves — never sbatch it for them.
After the data lands, recommend writing a provenance manifest:
nnunet-converter skill's scripts/write_manifest.py once the dataset has
been formatted. That writes _manifest.json with file-list checksum and
source metadata.This step is strongly recommended but not enforced — dataset-acquisition
does not, by itself, produce a structured manifest. Reproducibility comes from
pinning revisions / GDC manifests at acquisition time and recording them.
| Source | Command shape | Notes |
|---|---|---|
| TCGA / GDC | scripts/gdc_manifest.py … -o m.txt then gdc-client download -m m.txt -d <dest> | Open data only by default; controlled needs token. |
| Kaggle competition | kaggle competitions download -c <slug> -p <dest> | User must accept rules on kaggle.com first. |
| Kaggle dataset | kaggle datasets download -d <owner>/<name> -p <dest> --unzip | --unzip only works for datasets. |
| HuggingFace | python scripts/hf_download.py --repo-id … --revision <SHA> --local-dir <dest> | Always pin the revision. |
| Google Drive | gdown <url> or gdown --folder <url> | Public/shared only. No reliable resume. |
The detailed flags, common cohorts, gating notes, and rate-limit behaviours are in the per-source references.
| Situation | Action |
|---|---|
| Pulling a TCGA cohort, generating a GDC manifest, controlled vs open data | MUST read references/tcga_gdc.md. |
| Pulling a Kaggle competition or dataset | MUST read references/kaggle.md. |
| Pulling a HuggingFace dataset or model, gated repos, revision pinning | MUST read references/huggingface.md. |
Pulling from Google Drive via gdown | MUST read references/google_drive.md. |
| Generating an sbatch script for any download | MUST read references/sbatch_template.md. |
| Any downloaded, user-provided, or extracted archive | List the full archive or extracted-folder contents before choosing the downstream conversion input. |
| Preprocessing the downloaded data | Hand off to dicom-converter (DICOM→NIfTI) or nnunet-converter (nnUNet formatting). |
dataset-acquisition/
├── SKILL.md # This file (entry point)
├── references/
│ ├── tcga_gdc.md # TCGA / GDC: filters, common cohorts, controlled vs open
│ ├── kaggle.md # Kaggle competitions + datasets, prerequisite checks
│ ├── huggingface.md # HF snapshot_download, revision pinning, gated repos
│ ├── google_drive.md # gdown, public/shared limits, rate limits
│ └── sbatch_template.md # SLURM sbatch template for downloads
├── scripts/
│ ├── gdc_manifest.py # Generate a GDC manifest via the GDC REST API
│ └── hf_download.py # snapshot_download with mandatory revision pinning
└── README.md
gdc_manifest.py and hf_download.py were adapted from
ryanwangk/medimg_skills under MIT.
main moves; commit SHAs don't. The
hf_download.py helper enforces this.npx claudepluginhub bardli/aihubskillset-dataconvert --plugin ai-hub-skill-setGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.