Skill

download-encode

Downloads ENCODE genomics files (BED, FASTQ, BAM, bigWig) to local machine. Use for specific accessions or batch downloads by criteria with MD5 verification and organization options.

data-engineering

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/encode-toolkit:download-encode

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- User wants to download ENCODE data files to their local machine

Supporting Files

references/literature.md

SKILL.md

477 lines · ~5.4k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Stars35

Forks5

MaintenanceExcellent

Last CommitJun 14, 2026

Actions

View Source View Plugin View on GitHub View README

Download ENCODE Files

When to Use

User wants to download ENCODE data files to their local machine
User asks to "download", "get", or "fetch" ENCODE files
User needs specific file formats (BED, FASTQ, BAM, bigWig) from experiments
User wants to batch download files matching search criteria
User needs to verify file integrity after download (MD5 checksums)
User asks about organizing downloaded files by experiment or format

Help the user download ENCODE data files to their local machine.

Download Strategy

Specific files by accession: Use encode_download_files with file accession IDs (e.g., "ENCFF635JIA").
Batch download by criteria: Use encode_batch_download to search and download in one step.
- Always start with dry_run=True (default) to preview what will be downloaded
- Show the user the file count, total size, and file list
- Only proceed with dry_run=False after user confirms
Download organization options:
- "flat": All files in one directory
- "experiment": Organized by experiment accession (recommended)
- "format": Organized by file format
- "experiment_format": Organized by experiment, then format

Important Notes

All downloads include MD5 verification by default (verify_md5=True)
Ask the user for a download directory if not specified
Warn about large downloads (>1GB total or >50 files)
Files already downloaded will be skipped (idempotent)
For restricted files, credentials must be configured first via encode_manage_credentials

Pitfalls & Edge Cases

Disk space: BAM files can be 5-50GB each; FASTQ files 1-20GB. Before any batch download, warn the user about estimated total size from the dry_run preview. A single ChIP-seq experiment can produce 10-30GB of raw data files.
MD5 verification failures: If MD5 verification fails, the file may be corrupted or incompletely downloaded. Always re-download rather than skipping verification. Never set verify_md5=False unless the user explicitly requests it and understands the risk.
Downloading too much data: Users often request BAM files when they only need peak calls or signal tracks. Suggest preferred_default=True to get ENCODE's recommended files, or filter by output_type (e.g., "IDR thresholded peaks", "fold change over control") to avoid downloading raw data unnecessarily.
Restricted/unreleased data: Files with status other than "released" may require ENCODE credentials. Use encode_manage_credentials(action="check") to verify credentials are configured before attempting to download restricted data.
Mixed assemblies in batch download: Always specify the assembly filter (e.g., "GRCh38") in batch downloads. Without it, you may download files aligned to different genome assemblies (hg19, GRCh38, mm10), making downstream analysis impossible.
Timeout on large files: For downloading many files or very large files, encode_batch_download handles retries and concurrent downloads better than individual encode_download_files calls. The default limit of 100 files provides a safety cap.

File Type Guide

When users request "files" without specifying a type, use this priority to suggest the right output_type:

Peak analysis: output_type="IDR thresholded peaks" (most stringent, recommended for ChIP-seq/ATAC-seq)
Signal visualization: file_format="bigWig", output_type="fold change over control" (for genome browser tracks)
Gene expression: output_type="gene quantifications" (for RNA-seq TPM/FPKM tables)
Raw data reprocessing: file_format="fastq" (only when user needs to run their own pipeline)
Quick defaults: preferred_default=True (ENCODE's recommended files for any experiment)

What to Download for Each Analysis

Analysis Goal	File Format	Output Type	Why This File
Peak locations (ChIP/ATAC)	bed narrowPeak	IDR thresholded peaks	Gold-standard replicated peaks passing irreproducibility threshold
Broad domain marks (H3K27me3)	bed broadPeak	replicated peaks	Broad marks need broadPeak format, not narrowPeak
Signal visualization	bigWig	fold change over control	Normalized signal track for genome browser display
Signal statistics	bigWig	signal p-value	Statistical significance of signal over background
Raw data reprocessing	fastq	reads	Starting from scratch with your own pipeline
Alignment inspection	bam	alignments	Check read mapping quality, fragment sizes, duplicates
Browser-compatible peaks	bigBed	peaks	UCSC/IGV-compatible binary peak format
Gene expression levels	tsv	gene quantifications	TPM/FPKM tables for RNA-seq differential expression
Transcript isoforms	tsv	transcript quantifications	Isoform-level expression for splicing analysis
3D genome contacts	hic	contact matrix	Hi-C interaction matrices for loop/TAD calling
Methylation levels	bed	methylation state at CpG	Per-CpG methylation fractions for WGBS

Assay-Specific Recommendations

Assay	Primary Download	Secondary Download
Histone ChIP-seq	IDR thresholded peaks (bed)	fold change over control (bigWig)
TF ChIP-seq	IDR thresholded peaks (bed)	fold change over control (bigWig)
ATAC-seq	IDR thresholded peaks (bed)	fold change over control (bigWig)
DNase-seq	peaks (bed)	signal of unique reads (bigWig)
RNA-seq	gene quantifications (tsv)	signal of unique reads (bigWig)
WGBS	methylation state at CpG (bed)	signal (bigWig)
Hi-C	contact matrix (hic)	contact domains (bed)
CUT&RUN	peaks (bed)	fold change over control (bigWig)
CUT&Tag	peaks (bed)	fold change over control (bigWig)
eCLIP	peaks (bed)	signal of unique reads (bigWig)

File Selection Priority

When multiple files exist for the same experiment, choose files in this priority order:

preferred_default=True: ENCODE curators mark recommended files. Always prefer these when available. Use encode_list_files(experiment_accession="ENCSR...", preferred_default=True) to find them.
Peak file hierarchy (most to least stringent):
- IDR thresholded peaks — replicated, irreproducibility-filtered (gold standard)
- Optimal IDR thresholded peaks — union of replicate-level peaks
- Conservative IDR thresholded peaks — intersection of replicate-level peaks
- Pseudoreplicated peaks — peaks from pooled pseudoreplicates
- Replicated peaks — peaks found in both replicates (broad marks)
Signal track hierarchy:
- fold change over control — normalized signal, best for comparing across experiments
- signal p-value — statistical significance of enrichment
- signal of unique reads — uniquely mapped read signal
- signal of all reads — includes multi-mapped reads (noisier)
Assembly preference:
- GRCh38 for human (current standard) — always use this
- hg19 for human (legacy) — only if collaborators require it
- mm10 for mouse (current standard)
- Never mix assemblies within an analysis
Replicate preference:
- Replicated files (combined replicates) over single-replicate files
- Biological replicates over technical replicates
- Isogenic replication over anisogenic
Status preference:
- released — fully validated, use these
- archived — older versions, avoid unless specifically needed
- revoked — quality issues found, never use

Storage Estimates

Plan disk space before downloading. Use dry_run=True to get exact sizes for your query.

File Type	Typical Size per File	10 Experiments	50 Experiments
BED peaks (narrowPeak)	1-10 MB	10-100 MB	50-500 MB
BED peaks (broadPeak)	5-50 MB	50-500 MB	250 MB - 2.5 GB
bigWig signal tracks	200 MB - 2 GB	2-20 GB	10-100 GB
bigBed peaks	1-20 MB	10-200 MB	50 MB - 1 GB
TSV quantifications	5-50 MB	50-500 MB	250 MB - 2.5 GB
BAM alignments	2-50 GB	20-500 GB	100 GB - 2.5 TB
FASTQ reads	5-100 GB	50 GB - 1 TB	250 GB - 5 TB
HiC contact matrices	500 MB - 5 GB	5-50 GB	25-250 GB

Rule of thumb: Peak files and quantifications are small (MB). Signal tracks are medium (hundreds of MB). Alignments and raw reads are large (GB to tens of GB). Always preview with dry_run=True before committing to a large download.

Walkthrough: Downloading a Complete ChIP-seq Dataset

This walkthrough shows the full process for downloading H3K27ac ChIP-seq data from human pancreas tissue.

Step 1: Find the experiment

encode_search_experiments(
  assay_title="Histone ChIP-seq",
  target="H3K27ac",
  organ="pancreas",
  biosample_type="tissue",
  assembly="GRCh38"
)
  -> Returns experiments, e.g., ENCSR831JOY

Step 2: List available files

encode_list_files(
  experiment_accession="ENCSR831JOY",
  assembly="GRCh38"
)
  -> Returns all files: FASTQs, BAMs, peaks, signals
  -> Note the file accessions for the files you need

Step 3: Identify the right files

Filter to what you actually need — usually peaks + signal tracks:

encode_list_files(
  experiment_accession="ENCSR831JOY",
  assembly="GRCh38",
  preferred_default=True
)
  -> Returns ENCODE-recommended files only
  -> Typically: IDR peaks (bed) + fold change signal (bigWig)

Or be specific about output types:

encode_list_files(
  experiment_accession="ENCSR831JOY",
  output_type="IDR thresholded peaks",
  assembly="GRCh38"
)
  -> Returns only IDR peak files, e.g., ENCFF635JIA

Step 4: Download with MD5 verification

encode_download_files(
  file_accessions=["ENCFF635JIA", "ENCFF388RZD"],
  download_dir="/Users/you/data/encode/h3k27ac_pancreas",
  organize_by="experiment",
  verify_md5=True
)
  -> Downloads files with integrity verification
  -> Creates: download_dir/ENCSR831JOY/ENCFF635JIA.bed.gz
              download_dir/ENCSR831JOY/ENCFF388RZD.bigWig

Step 5: Verify the download results

Check the returned JSON for:

summary.successful — number of files downloaded
summary.failed — should be 0
summary.total_size_human — total bytes downloaded
Each file's md5_verified — should be True for all files

If any file fails MD5 verification, re-download that specific file. Do not proceed with a corrupted file.

Step 6: Log provenance

Track the experiment and log where the data came from:

encode_track_experiment(
  accession="ENCSR831JOY",
  notes="H3K27ac ChIP-seq, pancreas tissue, downloaded for enhancer analysis"
)
  -> Stores experiment metadata, publications, and pipeline info locally

If you create derived files later (e.g., filtered peaks), log them too:

encode_log_derived_file(
  file_path="/Users/you/data/encode/h3k27ac_pancreas/filtered_peaks.bed",
  source_accessions=["ENCSR831JOY", "ENCFF635JIA"],
  description="H3K27ac peaks filtered against ENCODE Blacklist v2",
  file_type="filtered_peaks",
  tool_used="bedtools intersect v2.31.0",
  parameters="bedtools intersect -v -a ENCFF635JIA.bed.gz -b hg38-blacklist.v2.bed"
)

Walkthrough: Batch Download for Multi-Experiment Analysis

Use encode_batch_download when downloading data across multiple experiments, such as collecting all H3K4me3 peaks across many tissues.

Step 1: Preview with dry run

Always start with dry_run=True to see what will be downloaded:

encode_batch_download(
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  output_type="IDR thresholded peaks",
  target="H3K4me3",
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  biosample_type="tissue",
  organize_by="experiment",
  dry_run=True
)
  -> Preview: 42 files, 180MB total, from 42 experiments
  -> Shows file list with accessions, sizes, and experiment info

Step 2: Review and confirm

Present the dry run results to the user:

Total file count and size
Breakdown by experiment or tissue
Any unexpected files (wrong assembly, archived status)

If the count is too large, narrow with additional filters (e.g., add organ="pancreas").

Step 3: Execute the download

encode_batch_download(
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  output_type="IDR thresholded peaks",
  target="H3K4me3",
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  biosample_type="tissue",
  organize_by="experiment",
  dry_run=False
)
  -> Downloads all 42 files with MD5 verification
  -> Creates: download_dir/ENCSR.../ENCFF....bed.gz (one per experiment)

Step 4: Handle failed downloads

If some files fail:

Check the errors array in the response for specific failure reasons
Network timeouts: retry the failed accessions with encode_download_files
MD5 mismatches: re-download the specific files
403/404 errors: the file may be restricted or withdrawn from ENCODE

encode_download_files(
  file_accessions=["ENCFF_FAILED_1", "ENCFF_FAILED_2"],
  download_dir="/Users/you/data/encode/h3k4me3_multi_tissue",
  organize_by="experiment",
  verify_md5=True
)

Organization strategies

Choose organize_by based on your analysis plan:

Strategy	Directory Structure	Best For
`"experiment"`	`download_dir/ENCSR.../files`	Comparing files within experiments
`"format"`	`download_dir/bed/files`, `download_dir/bigWig/files`	Running format-specific pipelines
`"experiment_format"`	`download_dir/ENCSR.../bed/files`	Large multi-format downloads
`"flat"`	`download_dir/files`	Small downloads, quick access

Gotchas

Assembly mismatch: Always specify assembly="GRCh38" for human or assembly="mm10" for mouse. Omitting this in batch downloads can produce a mix of GRCh38 and hg19 files that cannot be compared. There is no automated liftover in the download tools — you must use the liftover-coordinates skill separately if you need to convert between assemblies.
File status matters: Only status="released" files are fully validated by ENCODE. Archived files may have been superseded by newer processing. Revoked files had quality issues discovered after release. Always check file status before using data in analysis.
MD5 verification is not optional: Corrupted files produce silent errors in downstream analysis — wrong peak counts, shifted signal tracks, truncated alignments. The few extra seconds for MD5 verification prevents hours of debugging. Only disable with verify_md5=False if you are re-downloading a file you already verified.
Streaming for large files: BAM and FASTQ files are downloaded with streaming to avoid loading entire files into memory. The encode_batch_download tool handles this automatically. If a download is interrupted, re-running the same command will skip already-completed files (idempotent).
The 100-file safety limit: encode_batch_download defaults to limit=100 to prevent accidentally downloading thousands of files. If your query returns more than 100 files, narrow your filters or run multiple targeted batches. You can increase the limit explicitly if you have confirmed the download is intentional.
preferred_default may return nothing: Not all experiments have files marked as preferred_default=True. If this filter returns empty results, fall back to filtering by specific output_type and assembly instead.
Credential requirements: Files with status "in progress" or "submitted" require ENCODE DCC credentials. Use encode_manage_credentials(action="check") before attempting restricted downloads. Contact the ENCODE DCC for access to unreleased data.
Duplicate files across experiments: When downloading the same file type across many experiments, some control files (e.g., input ChIP-seq) may be shared between experiments. The download tool skips already-existing files, so shared controls will not be downloaded twice.

Code Examples

1. Smart download: "Download IDR thresholded peaks for H3K4me3 ChIP-seq in GRCh38"

Step 1: Preview with dry run
  encode_batch_download(
    download_dir="/Users/you/data/encode",
    output_type="IDR thresholded peaks",
    target="H3K4me3",
    assembly="GRCh38",
    assay_title="Histone ChIP-seq",
    dry_run=True
  )
  -> Shows: 18 files, 45MB total (peak files are small)
  -> Present file list to user for confirmation

Step 2: Confirm and download
  encode_batch_download(
    download_dir="/Users/you/data/encode",
    output_type="IDR thresholded peaks",
    target="H3K4me3",
    assembly="GRCh38",
    assay_title="Histone ChIP-seq",
    dry_run=False
  )
  -> Downloads with MD5 verification, skips already-downloaded files

2. Batch download with preview: "Download all ATAC-seq bigWig signal tracks for pancreas"

Step 1: Dry run to see what's available
  encode_batch_download(
    download_dir="/Users/you/data/encode/atac_pancreas",
    file_format="bigWig",
    assay_title="ATAC-seq",
    organ="pancreas",
    assembly="GRCh38",
    organize_by="experiment",
    dry_run=True
  )
  -> Review: 24 files, 8.3GB total, from 6 experiments

Step 2: Download after user confirms
  (same call with dry_run=False)

3. Organized download: "Download files organized by experiment and format"

Step 1: Download specific files with organization
  encode_download_files(
    file_accessions=["ENCFF635JIA", "ENCFF388RZD", "ENCFF901ABC"],
    download_dir="/Users/you/data/encode",
    organize_by="experiment_format"
  )
  -> Creates: download_dir/ENCSR.../bed/file.bed.gz
             download_dir/ENCSR.../bigWig/file.bigWig

4. ENCODE-recommended defaults: "Just get the recommended files for this experiment"

encode_batch_download(
  download_dir="/Users/you/data/encode/defaults",
  preferred_default=True,
  assembly="GRCh38",
  assay_title="Histone ChIP-seq",
  target="H3K27me3",
  organ="liver",
  organize_by="experiment",
  dry_run=True
)
  -> Returns only ENCODE-curated default files
  -> Typically the most useful subset for standard analyses

5. RNA-seq expression data: "Download gene quantification tables"

encode_batch_download(
  download_dir="/Users/you/data/encode/rnaseq_brain",
  output_type="gene quantifications",
  assay_title="total RNA-seq",
  organ="brain",
  assembly="GRCh38",
  organize_by="experiment",
  dry_run=True
)
  -> Preview: TSV files with TPM/FPKM values, typically 5-20MB each

Integration

This skill produces...	Feed into...	Purpose
Downloaded FASTQ files	pipeline-chipseq through pipeline-cutandrun	Raw data for pipeline processing
Downloaded BED peak files	peak-annotation	Peak files for gene assignment
Downloaded bigWig signals	visualization-workflow	Signal tracks for genome browser
MD5-verified files	data-provenance	Verified file acquisition for audit trail
Downloaded BED files	histone-aggregation	Peak files for cross-experiment merge
Downloaded methylation files	methylation-aggregation	CpG methylation data for aggregation
File download metadata	track-experiments	Record which files were downloaded
Downloaded reference data	bioinformatics-installer	Reference genomes and annotations

Presenting Results

When presenting download results to the user:

Show a download summary table: filename | size | format | MD5 status | path
For dry_run=True, present what WOULD be downloaded with total size estimate and file count
Report any failures separately with error messages
After successful downloads, suggest next steps:
- "Would you like to log these as tracked experiments?" (use encode_track_experiment)
- "Would you like to log any derived files for provenance?" (use encode_log_derived_file)
For large batch downloads, summarize by experiment and format rather than listing every file

Key Literature

ENCODE Phase 3: ENCODE Project Consortium 2020 (Nature, ~2,000 citations) DOI: 10.1038/s41586-020-2493-4 — Source catalog for all downloadable genomic data.
FAIR Principles: Wilkinson et al. 2016 (Scientific Data, ~5,000 citations) DOI: 10.1038/sdata.2016.18 — Findable, Accessible, Interoperable, Reusable data principles that ENCODE's download infrastructure supports.
IDR Framework: Li et al. 2011 (Annals of Applied Statistics) DOI: 10.1214/11-AOAS466 — Irreproducible Discovery Rate method used for peak thresholding in ENCODE.
ENCODE Blacklist: Amemiya et al. 2019 (Scientific Reports, ~1,400 citations) DOI: 10.1038/s41598-019-45839-z — Regions to exclude from downloaded peak files before analysis.

Related Skills

Skill	When to Use Instead/Additionally
`search-encode`	Finding experiments and files before downloading
`track-experiments`	Tracking downloaded experiments locally
`data-provenance`	Logging derived files created from downloaded data
`quality-assessment`	Evaluating experiment quality before downloading
`publication-trust`	Evaluating the provenance and trustworthiness of linked publications
`liftover-coordinates`	Converting between genome assemblies if you downloaded hg19 data
`batch-analysis`	Running analyses across multiple downloaded experiments

download-encode

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

download-encode

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Download ENCODE Files

When to Use

Download Strategy

Important Notes

Pitfalls & Edge Cases

File Type Guide

What to Download for Each Analysis

Assay-Specific Recommendations

File Selection Priority

Storage Estimates

Walkthrough: Downloading a Complete ChIP-seq Dataset

Step 1: Find the experiment

Step 2: List available files

Step 3: Identify the right files

Step 4: Download with MD5 verification

Step 5: Verify the download results

Step 6: Log provenance

Walkthrough: Batch Download for Multi-Experiment Analysis

Step 1: Preview with dry run

Step 2: Review and confirm

Step 3: Execute the download

Step 4: Handle failed downloads

Organization strategies

Gotchas

Code Examples

1. Smart download: "Download IDR thresholded peaks for H3K4me3 ChIP-seq in GRCh38"

2. Batch download with preview: "Download all ATAC-seq bigWig signal tracks for pancreas"

3. Organized download: "Download files organized by experiment and format"

4. ENCODE-recommended defaults: "Just get the recommended files for this experiment"

5. RNA-seq expression data: "Download gene quantification tables"

Integration

Presenting Results

Key Literature

Related Skills

For the request: "$ARGUMENTS"

Similar Skills

Download ENCODE Files

When to Use

Download Strategy

Important Notes

Pitfalls & Edge Cases

File Type Guide

What to Download for Each Analysis

Assay-Specific Recommendations

File Selection Priority

Storage Estimates

Walkthrough: Downloading a Complete ChIP-seq Dataset

Step 1: Find the experiment

Step 2: List available files

Step 3: Identify the right files

Step 4: Download with MD5 verification

Step 5: Verify the download results

Step 6: Log provenance

Walkthrough: Batch Download for Multi-Experiment Analysis

Step 1: Preview with dry run

Step 2: Review and confirm

Step 3: Execute the download

Step 4: Handle failed downloads

Organization strategies

Gotchas

Code Examples

1. Smart download: "Download IDR thresholded peaks for H3K4me3 ChIP-seq in GRCh38"

2. Batch download with preview: "Download all ATAC-seq bigWig signal tracks for pancreas"

3. Organized download: "Download files organized by experiment and format"

4. ENCODE-recommended defaults: "Just get the recommended files for this experiment"

5. RNA-seq expression data: "Download gene quantification tables"

Integration

Presenting Results