From encode-toolkit
Downloads ENCODE genomics files (BED, FASTQ, BAM, bigWig) to local machine. Use for specific accessions or batch downloads by criteria with MD5 verification and organization options.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:download-encodeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to download ENCODE data files to their local machine
Help the user download ENCODE data files to their local machine.
Specific files by accession: Use encode_download_files with file accession IDs (e.g., "ENCFF635JIA").
Batch download by criteria: Use encode_batch_download to search and download in one step.
dry_run=True (default) to preview what will be downloadeddry_run=False after user confirmsDownload organization options:
"flat": All files in one directory"experiment": Organized by experiment accession (recommended)"format": Organized by file format"experiment_format": Organized by experiment, then formatverify_md5=True)encode_manage_credentialsverify_md5=False unless the user explicitly requests it and understands the risk.preferred_default=True to get ENCODE's recommended files, or filter by output_type (e.g., "IDR thresholded peaks", "fold change over control") to avoid downloading raw data unnecessarily.encode_manage_credentials(action="check") to verify credentials are configured before attempting to download restricted data.assembly filter (e.g., "GRCh38") in batch downloads. Without it, you may download files aligned to different genome assemblies (hg19, GRCh38, mm10), making downstream analysis impossible.encode_batch_download handles retries and concurrent downloads better than individual encode_download_files calls. The default limit of 100 files provides a safety cap.When users request "files" without specifying a type, use this priority to suggest the right output_type:
output_type="IDR thresholded peaks" (most stringent, recommended for ChIP-seq/ATAC-seq)file_format="bigWig", output_type="fold change over control" (for genome browser tracks)output_type="gene quantifications" (for RNA-seq TPM/FPKM tables)file_format="fastq" (only when user needs to run their own pipeline)preferred_default=True (ENCODE's recommended files for any experiment)| Analysis Goal | File Format | Output Type | Why This File |
|---|---|---|---|
| Peak locations (ChIP/ATAC) | bed narrowPeak | IDR thresholded peaks | Gold-standard replicated peaks passing irreproducibility threshold |
| Broad domain marks (H3K27me3) | bed broadPeak | replicated peaks | Broad marks need broadPeak format, not narrowPeak |
| Signal visualization | bigWig | fold change over control | Normalized signal track for genome browser display |
| Signal statistics | bigWig | signal p-value | Statistical significance of signal over background |
| Raw data reprocessing | fastq | reads | Starting from scratch with your own pipeline |
| Alignment inspection | bam | alignments | Check read mapping quality, fragment sizes, duplicates |
| Browser-compatible peaks | bigBed | peaks | UCSC/IGV-compatible binary peak format |
| Gene expression levels | tsv | gene quantifications | TPM/FPKM tables for RNA-seq differential expression |
| Transcript isoforms | tsv | transcript quantifications | Isoform-level expression for splicing analysis |
| 3D genome contacts | hic | contact matrix | Hi-C interaction matrices for loop/TAD calling |
| Methylation levels | bed | methylation state at CpG | Per-CpG methylation fractions for WGBS |
| Assay | Primary Download | Secondary Download |
|---|---|---|
| Histone ChIP-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| TF ChIP-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| ATAC-seq | IDR thresholded peaks (bed) | fold change over control (bigWig) |
| DNase-seq | peaks (bed) | signal of unique reads (bigWig) |
| RNA-seq | gene quantifications (tsv) | signal of unique reads (bigWig) |
| WGBS | methylation state at CpG (bed) | signal (bigWig) |
| Hi-C | contact matrix (hic) | contact domains (bed) |
| CUT&RUN | peaks (bed) | fold change over control (bigWig) |
| CUT&Tag | peaks (bed) | fold change over control (bigWig) |
| eCLIP | peaks (bed) | signal of unique reads (bigWig) |
When multiple files exist for the same experiment, choose files in this priority order:
preferred_default=True: ENCODE curators mark recommended files. Always prefer these when available. Use encode_list_files(experiment_accession="ENCSR...", preferred_default=True) to find them.
Peak file hierarchy (most to least stringent):
Signal track hierarchy:
Assembly preference:
Replicate preference:
Status preference:
Plan disk space before downloading. Use dry_run=True to get exact sizes for your query.
| File Type | Typical Size per File | 10 Experiments | 50 Experiments |
|---|---|---|---|
| BED peaks (narrowPeak) | 1-10 MB | 10-100 MB | 50-500 MB |
| BED peaks (broadPeak) | 5-50 MB | 50-500 MB | 250 MB - 2.5 GB |
| bigWig signal tracks | 200 MB - 2 GB | 2-20 GB | 10-100 GB |
| bigBed peaks | 1-20 MB | 10-200 MB | 50 MB - 1 GB |
| TSV quantifications | 5-50 MB | 50-500 MB | 250 MB - 2.5 GB |
| BAM alignments | 2-50 GB | 20-500 GB | 100 GB - 2.5 TB |
| FASTQ reads | 5-100 GB | 50 GB - 1 TB | 250 GB - 5 TB |
| HiC contact matrices | 500 MB - 5 GB | 5-50 GB | 25-250 GB |
Rule of thumb: Peak files and quantifications are small (MB). Signal tracks are medium (hundreds of MB). Alignments and raw reads are large (GB to tens of GB). Always preview with dry_run=True before committing to a large download.
This walkthrough shows the full process for downloading H3K27ac ChIP-seq data from human pancreas tissue.
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
biosample_type="tissue",
assembly="GRCh38"
)
-> Returns experiments, e.g., ENCSR831JOY
encode_list_files(
experiment_accession="ENCSR831JOY",
assembly="GRCh38"
)
-> Returns all files: FASTQs, BAMs, peaks, signals
-> Note the file accessions for the files you need
Filter to what you actually need — usually peaks + signal tracks:
encode_list_files(
experiment_accession="ENCSR831JOY",
assembly="GRCh38",
preferred_default=True
)
-> Returns ENCODE-recommended files only
-> Typically: IDR peaks (bed) + fold change signal (bigWig)
Or be specific about output types:
encode_list_files(
experiment_accession="ENCSR831JOY",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
-> Returns only IDR peak files, e.g., ENCFF635JIA
encode_download_files(
file_accessions=["ENCFF635JIA", "ENCFF388RZD"],
download_dir="/Users/you/data/encode/h3k27ac_pancreas",
organize_by="experiment",
verify_md5=True
)
-> Downloads files with integrity verification
-> Creates: download_dir/ENCSR831JOY/ENCFF635JIA.bed.gz
download_dir/ENCSR831JOY/ENCFF388RZD.bigWig
Check the returned JSON for:
summary.successful — number of files downloadedsummary.failed — should be 0summary.total_size_human — total bytes downloadedmd5_verified — should be True for all filesIf any file fails MD5 verification, re-download that specific file. Do not proceed with a corrupted file.
Track the experiment and log where the data came from:
encode_track_experiment(
accession="ENCSR831JOY",
notes="H3K27ac ChIP-seq, pancreas tissue, downloaded for enhancer analysis"
)
-> Stores experiment metadata, publications, and pipeline info locally
If you create derived files later (e.g., filtered peaks), log them too:
encode_log_derived_file(
file_path="/Users/you/data/encode/h3k27ac_pancreas/filtered_peaks.bed",
source_accessions=["ENCSR831JOY", "ENCFF635JIA"],
description="H3K27ac peaks filtered against ENCODE Blacklist v2",
file_type="filtered_peaks",
tool_used="bedtools intersect v2.31.0",
parameters="bedtools intersect -v -a ENCFF635JIA.bed.gz -b hg38-blacklist.v2.bed"
)
Use encode_batch_download when downloading data across multiple experiments, such as collecting all H3K4me3 peaks across many tissues.
Always start with dry_run=True to see what will be downloaded:
encode_batch_download(
download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
output_type="IDR thresholded peaks",
target="H3K4me3",
assembly="GRCh38",
assay_title="Histone ChIP-seq",
biosample_type="tissue",
organize_by="experiment",
dry_run=True
)
-> Preview: 42 files, 180MB total, from 42 experiments
-> Shows file list with accessions, sizes, and experiment info
Present the dry run results to the user:
If the count is too large, narrow with additional filters (e.g., add organ="pancreas").
encode_batch_download(
download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
output_type="IDR thresholded peaks",
target="H3K4me3",
assembly="GRCh38",
assay_title="Histone ChIP-seq",
biosample_type="tissue",
organize_by="experiment",
dry_run=False
)
-> Downloads all 42 files with MD5 verification
-> Creates: download_dir/ENCSR.../ENCFF....bed.gz (one per experiment)
If some files fail:
errors array in the response for specific failure reasonsencode_download_filesencode_download_files(
file_accessions=["ENCFF_FAILED_1", "ENCFF_FAILED_2"],
download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
organize_by="experiment",
verify_md5=True
)
Choose organize_by based on your analysis plan:
| Strategy | Directory Structure | Best For |
|---|---|---|
"experiment" | download_dir/ENCSR.../files | Comparing files within experiments |
"format" | download_dir/bed/files, download_dir/bigWig/files | Running format-specific pipelines |
"experiment_format" | download_dir/ENCSR.../bed/files | Large multi-format downloads |
"flat" | download_dir/files | Small downloads, quick access |
Assembly mismatch: Always specify assembly="GRCh38" for human or assembly="mm10" for mouse. Omitting this in batch downloads can produce a mix of GRCh38 and hg19 files that cannot be compared. There is no automated liftover in the download tools — you must use the liftover-coordinates skill separately if you need to convert between assemblies.
File status matters: Only status="released" files are fully validated by ENCODE. Archived files may have been superseded by newer processing. Revoked files had quality issues discovered after release. Always check file status before using data in analysis.
MD5 verification is not optional: Corrupted files produce silent errors in downstream analysis — wrong peak counts, shifted signal tracks, truncated alignments. The few extra seconds for MD5 verification prevents hours of debugging. Only disable with verify_md5=False if you are re-downloading a file you already verified.
Streaming for large files: BAM and FASTQ files are downloaded with streaming to avoid loading entire files into memory. The encode_batch_download tool handles this automatically. If a download is interrupted, re-running the same command will skip already-completed files (idempotent).
The 100-file safety limit: encode_batch_download defaults to limit=100 to prevent accidentally downloading thousands of files. If your query returns more than 100 files, narrow your filters or run multiple targeted batches. You can increase the limit explicitly if you have confirmed the download is intentional.
preferred_default may return nothing: Not all experiments have files marked as preferred_default=True. If this filter returns empty results, fall back to filtering by specific output_type and assembly instead.
Credential requirements: Files with status "in progress" or "submitted" require ENCODE DCC credentials. Use encode_manage_credentials(action="check") before attempting restricted downloads. Contact the ENCODE DCC for access to unreleased data.
Duplicate files across experiments: When downloading the same file type across many experiments, some control files (e.g., input ChIP-seq) may be shared between experiments. The download tool skips already-existing files, so shared controls will not be downloaded twice.
Step 1: Preview with dry run
encode_batch_download(
download_dir="/Users/you/data/encode",
output_type="IDR thresholded peaks",
target="H3K4me3",
assembly="GRCh38",
assay_title="Histone ChIP-seq",
dry_run=True
)
-> Shows: 18 files, 45MB total (peak files are small)
-> Present file list to user for confirmation
Step 2: Confirm and download
encode_batch_download(
download_dir="/Users/you/data/encode",
output_type="IDR thresholded peaks",
target="H3K4me3",
assembly="GRCh38",
assay_title="Histone ChIP-seq",
dry_run=False
)
-> Downloads with MD5 verification, skips already-downloaded files
Step 1: Dry run to see what's available
encode_batch_download(
download_dir="/Users/you/data/encode/atac_pancreas",
file_format="bigWig",
assay_title="ATAC-seq",
organ="pancreas",
assembly="GRCh38",
organize_by="experiment",
dry_run=True
)
-> Review: 24 files, 8.3GB total, from 6 experiments
Step 2: Download after user confirms
(same call with dry_run=False)
Step 1: Download specific files with organization
encode_download_files(
file_accessions=["ENCFF635JIA", "ENCFF388RZD", "ENCFF901ABC"],
download_dir="/Users/you/data/encode",
organize_by="experiment_format"
)
-> Creates: download_dir/ENCSR.../bed/file.bed.gz
download_dir/ENCSR.../bigWig/file.bigWig
encode_batch_download(
download_dir="/Users/you/data/encode/defaults",
preferred_default=True,
assembly="GRCh38",
assay_title="Histone ChIP-seq",
target="H3K27me3",
organ="liver",
organize_by="experiment",
dry_run=True
)
-> Returns only ENCODE-curated default files
-> Typically the most useful subset for standard analyses
encode_batch_download(
download_dir="/Users/you/data/encode/rnaseq_brain",
output_type="gene quantifications",
assay_title="total RNA-seq",
organ="brain",
assembly="GRCh38",
organize_by="experiment",
dry_run=True
)
-> Preview: TSV files with TPM/FPKM values, typically 5-20MB each
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Downloaded FASTQ files | pipeline-chipseq through pipeline-cutandrun | Raw data for pipeline processing |
| Downloaded BED peak files | peak-annotation | Peak files for gene assignment |
| Downloaded bigWig signals | visualization-workflow | Signal tracks for genome browser |
| MD5-verified files | data-provenance | Verified file acquisition for audit trail |
| Downloaded BED files | histone-aggregation | Peak files for cross-experiment merge |
| Downloaded methylation files | methylation-aggregation | CpG methylation data for aggregation |
| File download metadata | track-experiments | Record which files were downloaded |
| Downloaded reference data | bioinformatics-installer | Reference genomes and annotations |
When presenting download results to the user:
dry_run=True, present what WOULD be downloaded with total size estimate and file countencode_track_experiment)encode_log_derived_file)| Skill | When to Use Instead/Additionally |
|---|---|
search-encode | Finding experiments and files before downloading |
track-experiments | Tracking downloaded experiments locally |
data-provenance | Logging derived files created from downloaded data |
quality-assessment | Evaluating experiment quality before downloading |
publication-trust | Evaluating the provenance and trustworthiness of linked publications |
liftover-coordinates | Converting between genome assemblies if you downloaded hg19 data |
batch-analysis | Running analyses across multiple downloaded experiments |
npx claudepluginhub ammawla/encode-toolkitDownloads ENCODE genomics files (BED, FASTQ, BAM, bigWig) to local machine. Use for specific accessions or batch downloads by criteria with MD5 verification and organization options.
Given a DOI or PubMed ID, discovers and downloads genomics data files (VCF, FASTA, H5AD, etc.) from repositories like GEO, ENA, Zenodo, Figshare, Dryad, and OSF.
Downloads and parses scientific data from any source—genomics formats (VCF, h5ad, BAM), tabular files, multi-step API workflows—using Python code via Bash.