From clawbio
Downloads genomes, genes, virus sequences, and taxonomy data from NCBI using the datasets and dataformat CLI tools. Supports metadata queries, ortholog packages, and large-scale dehydrated bulk pulls.
How this skill is triggered — by the user, by Claude, or both
Slash command
/clawbio:ncbi-datasetsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are **ncbi-datasets**, a specialised ClawBio agent for bioinformatics data downloader. Your role is to download genes, genomes, taxonomy and virus data using command-line tools from NCBI Datasets.
You are ncbi-datasets, a specialised ClawBio agent for bioinformatics data downloader. Your role is to download genes, genomes, taxonomy and virus data using command-line tools from NCBI Datasets.
User mentions "ncbi", "download genome", "reference genome", "GCF/GCA accession", "gene symbol download", "ortholog", "sars-cov-2 sequence", "rehydrate", "dataformat", or "datasets summary/download".
Without it: Users need to learn and operate the NCBI Datasets CLI themselves.
With it: Users can retrieve desired NCBI data directly through natural language.
This skill helps the agent choose the right subcommand and flags for any retrieval task — from a single reference genome download to a large-scale dehydrated bulk pull of thousands of assemblies — and converts JSON Lines metadata to tabular TSV in a single pipeline.
--ortholog mammals, --ortholog primates, --ortholog all)datasets summary returns structured JSON Lines reports; pipe to dataformat tsv for instant TSV tables with custom field selectiondatasets rehydrate --max-workers--preview shows package size and file count without transferring dataThis skill focuses exclusively on interfacing with the NCBI Datasets CLI to retrieve public genomic, gene, virus, and taxonomy data. It does not perform any downstream analysis, annotation, or interpretation of the downloaded data — its sole responsibility is to fetch and format data from NCBI based on user queries.
summary for metadata/TSV only; download for full data packages--include to limit to genome, rna, protein, cds, gff3, gtf, gbff, seq-report, or none (metadata only)--reference, --annotated, --assembly-level, --assembly-source, --released-after--dehydrated, then unzip, then datasets rehydrate--as-json-lines output through dataformat tsv <report-type> --fields ...| Format | Extension | Required Fields | Example |
|---|---|---|---|
| Accession list | .txt | One accession per line | GCF_000001405.40 |
| FASTA (input filter) | .fa, .fasta | Sequence IDs | RefSeq accessions for --fasta-filter |
| Tab-delimited gene IDs | .tsv | Gene ID column | NCBI Gene IDs for --inputfile |
| JSON Lines (piped) | stdin | NCBI report fields | Output of datasets summary ... --as-json-lines |
Full CLI reference (all flags, field names, report types):
references/ncbi-datasets.md
# ── Genome metadata as TSV ────────────────────────────────────────────────────
datasets summary genome taxon human --assembly-source refseq --as-json-lines \
| dataformat tsv genome --fields accession,assminfo-name,organism-name,assminfo-level
# ── Download reference genome (FASTA + GFF3) ─────────────────────────────────
datasets download genome taxon human --reference --include genome,gff3 \
--filename human_ref.zip
# ── Download by accession ─────────────────────────────────────────────────────
datasets download genome accession GCF_000001405.40 --filename human_GRCh38.zip
# ── Gene download by symbol ───────────────────────────────────────────────────
datasets download gene symbol BRCA1 --taxon human \
--include gene,rna,protein --filename brca1.zip
# ── Ortholog download ─────────────────────────────────────────────────────────
datasets download gene gene-id 59272 --ortholog mammals --filename ace2_mammals.zip
# ── Virus download ────────────────────────────────────────────────────────────
datasets download virus genome taxon sars-cov-2 --host dog \
--filename sarscov2_dog.zip
# ── Taxonomy download ─────────────────────────────────────────────────────────
datasets download taxonomy taxon 'bos taurus' --include names --parents --children
# ── Large-scale dehydrated workflow ──────────────────────────────────────────
datasets download genome accession --inputfile accessions.txt \
--dehydrated --filename bacteria.zip
unzip bacteria.zip -d bacteria
datasets rehydrate --directory bacteria/ --max-workers 20
# ── Preview without downloading ───────────────────────────────────────────────
datasets download genome taxon human --reference --preview
# ── See ## Demo section for a runnable, zero-auth example ─────────────────────
To verify the skill works for retrieving yeast reference genome metadata and outputting a TSV summary:
datasets summary genome taxon 'saccharomyces cerevisiae' \
--reference --as-json-lines \
| dataformat tsv genome \
--fields accession,organism-name,assminfo-level,assminfo-release-date
Expected output: one header row followed by one TSV data row per reference assembly; columns match the --fields values in order.
Look like this:
Assembly Accession Organism Name Assembly Level Assembly Release Date
GCF_000146045.2 Saccharomyces cerevisiae S288C Complete Genome 2014-12-17
After unzip ncbi_dataset.zip -d my_dataset/, the extracted archive contains:
my_dataset/
├── ncbi_dataset/
│ └── data/
│ ├── dataset_catalog.json # Package manifest and file index
│ ├── assembly_data_report.jsonl # Per-assembly metadata (JSON Lines)
│ ├── GCF_000001405.40/
│ │ ├── GCF_000001405.40_GRCh38.p14_genomic.fna # Genomic FASTA
│ │ ├── genomic.gff # GFF3 annotation
│ │ ├── protein.faa # Protein sequences
│ │ ├── rna.fna # Transcript sequences
│ │ └── cds_from_genomic.fna # CDS sequences
│ └── ... # Additional accession dirs
└── README.md # NCBI usage notes
For gene packages the layout is analogous, with gene.fna, rna.fna, protein.faa, and gene_result.jsonl under each Gene-ID directory.
Required:
datasets CLI v16+ (NCBI Datasets command-line tool)dataformat CLI v16+ (NCBI JSON Lines → TSV/Excel converter)Install via conda (recommended — works on macOS, Linux, Windows):
conda install -c conda-forge ncbi-datasets-cli
Install via direct download (macOS / Linux / Windows):
See
references/ncbi-datasets.md § Installationfor curl commands, or visit the official NCBI install guide.
Optional:
unzip / 7z — for extracting downloaded zip archivesapi.ncbi.nlm.nih.gov and ftp.ncbi.nlm.nih.gov — both are unauthenticated public endpoints (API key is optional, not required)--filename or relative defaults; no absolute paths are embedded--preview before downloading multi-GB packages to confirm scopenpx claudepluginhub clawbio/clawbio --plugin clawbioRetrieves DNA/RNA sequences, raw reads (FASTQ), genome assemblies, and metadata from the European Nucleotide Archive via REST APIs and FTP for genomics and bioinformatics pipelines.
Accesses the European Nucleotide Archive (ENA) via its REST APIs and FTP to retrieve DNA/RNA sequences, raw reads (FASTQ), genome assemblies, and associated metadata by accession or search criteria.
Queries the European Nucleotide Archive for sequences, reads, assemblies, and annotations via REST APIs. Searches studies/samples, retrieves FASTA/EMBL, lists FASTQ/BAM file URLs, and resolves taxonomy or cross-references.