From encode-toolkit
Executes ENCODE ATAC-seq pipeline from FASTQ to peaks and signal tracks using Nextflow with Docker containers. Handles Tn5 offset correction, mitochondrial removal, nucleosome-free fragments, and TSS enrichment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:pipeline-atacseqThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to run an ATAC-seq processing pipeline from FASTQ to peaks and signal tracks
Execute the ENCODE ATAC-seq processing pipeline from raw FASTQ files through Tn5 offset correction, peak calling, IDR analysis, and signal track generation. This skill provides a complete Nextflow DSL2 implementation following ENCODE uniform analysis standards.
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) uses the Tn5 transposase to probe open chromatin regions. The ENCODE pipeline processes ATAC-seq data through quality control, alignment with Bowtie2, Tn5 insertion site correction (+4/-5 bp offset), mitochondrial read removal, nucleosome-free fragment selection, peak calling with MACS2, and IDR-based replicate consistency analysis.
Key differences from ChIP-seq: Bowtie2 aligner (optimized for short fragments), Tn5 transposase shift correction, aggressive mitochondrial read filtering (can be 30-80% of reads), nucleosomal fragment size distribution as a QC metric, and TSS enrichment score as the primary quality indicator.
| Reference | Journal | Year | DOI | Relevance |
|---|---|---|---|---|
| Buenrostro et al. "Transposition of native chromatin (ATAC-seq)" | Nature Methods | 2013 | 10.1038/nmeth.2688 | Original ATAC-seq method (~5,000 citations) |
| Corces et al. "An improved ATAC-seq protocol" | Nature Methods | 2017 | 10.1038/nmeth.4396 | Omni-ATAC improvements (~2,500 citations) |
| ENCODE Project Consortium "Expanded encyclopaedias" | Nature | 2020 | 10.1038/s41586-020-2493-4 | ENCODE Phase 3 standards |
| Amemiya et al. "ENCODE Blacklist" | Scientific Reports | 2019 | 10.1038/s41598-019-45839-z | Artifact regions (~1,372 citations) |
| Langmead & Salzberg "Fast gapped-read alignment with Bowtie 2" | Nature Methods | 2012 | 10.1038/nmeth.1923 | Aligner (~30,000 citations) |
| Yan et al. "From reads to insight: ATAC-seq analysis" | Genome Biology | 2020 | 10.1186/s13059-020-1929-3 | Analysis best practices |
FASTQ ──> FastQC / Trim Galore ──> Bowtie2 ──> Mito Removal + Tn5 Shift
│ │
│ ┌──────────────────────────────────────────┘
│ v
│ Picard MarkDup ──> Blacklist Filter ──> Size Selection
│ │
│ ┌─────────────────┬────────────┘
│ v v
│ NFR Fragments Mono-Nucleosome
│ │
│ v
│ MACS2 Peak Calling ──> IDR Analysis
│ │ │
│ v v
│ Signal Tracks QC Report (MultiQC + ataqv)
v
Raw QC Report
| Stage | Tool | Input | Output | Reference |
|---|---|---|---|---|
| 1. QC & Trimming | FastQC, Trim Galore | Raw FASTQ | Trimmed FASTQ | references/01-qc-trimming.md |
| 2. Alignment | Bowtie2 | Trimmed FASTQ | Sorted BAM | references/02-alignment.md |
| 3. Tn5 Shift & Filtering | Samtools, bedtools, Picard | Sorted BAM | Shifted, filtered BAM | references/03-tn5-filtering.md |
| 4. Peak Calling & IDR | MACS2, IDR | Filtered BAM | Peaks (narrowPeak) | references/04-peak-calling.md |
| 5. QC & Signal | deeptools, ataqv, MultiQC | Filtered BAM, Peaks | bigWig, QC report | references/05-qc-metrics.md |
sample_id,read1,read2,replicate
SAMPLE1_rep1,atac_R1.fq.gz,atac_R2.fq.gz,1
SAMPLE1_rep2,atac_R1.fq.gz,atac_R2.fq.gz,2
No input control needed: Unlike ChIP-seq, ATAC-seq does not require a separate input or IgG control. MACS2 calls peaks against a local background model.
The Tn5 transposase inserts sequencing adapters with a 9-bp duplication. To center reads on the actual cut site:
This correction is essential for accurate footprinting and motif analysis.
ATAC-seq produces a characteristic nucleosomal ladder pattern:
| Fragment Class | Size Range | Biological Meaning |
|---|---|---|
| Nucleosome-free (NFR) | <150 bp | Open chromatin / TF binding |
| Mono-nucleosome | 150-300 bp | Single nucleosome wrapping |
| Di-nucleosome | 300-500 bp | Two nucleosomes |
| Tri-nucleosome | 500-700 bp | Three nucleosomes |
For peak calling, use nucleosome-free reads (<150 bp) only.
| Metric | Threshold | Category | Source |
|---|---|---|---|
| Total sequenced reads | >=50M (recommended) | Read depth | ENCODE |
| Mapping rate | >80% | Alignment | ENCODE |
| Mitochondrial fraction | <20% (ideal <5%) | Sample quality | ENCODE |
| NRF (non-redundant fraction) | >=0.8 | Library complexity | ENCODE |
| PBC1 | >=0.8 | Library complexity | ENCODE |
| TSS enrichment score | >=5 | Signal quality | ENCODE standard |
| FRiP | >=0.3 | Peak quality | ENCODE |
| NFR fraction | >0.4 of fragments <150bp | Fragment distribution | Buenrostro 2013 |
| IDR optimal peaks | >50,000 | Reproducibility | ENCODE |
The TSS enrichment score measures the fold enrichment of ATAC-seq signal at transcription start sites compared to flanking regions. It is the single most informative QC metric for ATAC-seq:
| Score | Quality | Interpretation |
|---|---|---|
| >=7 | Excellent | High signal-to-noise |
| 5-7 | Good | Acceptable for most analyses |
| 3-5 | Marginal | Review other metrics carefully |
| <3 | Poor | Likely failed; consider re-doing |
nextflow run scripts/main.nf \
-profile local \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
nextflow run scripts/main.nf \
-profile slurm \
--reads 'fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir results/
nextflow run scripts/main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 'gs://bucket/results/'
nextflow run scripts/main.nf \
-profile aws \
--reads 's3://bucket/fastq/*_R{1,2}.fq.gz' \
--genome GRCh38 \
--outdir 's3://bucket/results/'
| Platform | Instance | Cost/Sample | Time/Sample | Notes |
|---|---|---|---|---|
| GCP | n1-standard-8 | ~$2-4 | 2-3 hours | Preemptible recommended |
| AWS | m5.2xlarge | ~$2-4 | 2-3 hours | Spot instances recommended |
| Local | 8 cores, 32GB | $0 | 3-5 hours | Docker required |
| SLURM | 8 cores, 32GB | Varies | 2-3 hours | Singularity recommended |
results/
fastqc/ # Raw and trimmed QC reports
trimmed/ # Trimmed FASTQ files
aligned/ # Sorted BAM files (pre-filtering)
filtered/
shifted/ # Tn5-corrected BAM files
nfr/ # Nucleosome-free fragments (<150 bp)
mononuc/ # Mono-nucleosome fragments (150-300 bp)
peaks/
narrow/ # MACS2 narrowPeak files
idr/ # IDR-filtered reproducible peaks
signal/ # bigWig signal tracks
qc/
tss_enrichment/ # TSS enrichment scores and plots
fragment_size/ # Fragment size distribution plots
ataqv/ # Comprehensive ATAC-seq QC (ataqv)
multiqc/ # Aggregated QC report
logs/ # Nextflow execution logs
Mitochondrial DNA lacks chromatin and is highly accessible, often capturing 30-80% of reads. This is the most common ATAC-seq quality issue. Filter chrM reads before analysis. If >50% mito, consider optimizing the cell lysis step.
Without the +4/-5 bp offset correction, cut-site positions are shifted by ~4.5 bp. This matters for footprinting and motif analysis but has minimal effect on peak calling. Always apply the shift for publication-quality results.
Bowtie2 handles the short fragments from ATAC-seq (especially NFR <150bp) better
than BWA-MEM. Use Bowtie2 with --very-sensitive for optimal ATAC-seq alignment.
Peak calling on all fragments mixes nucleosome-free signal (TF binding) with nucleosomal signal. Always size-select NFR (<150 bp) for peak calling.
TSS enrichment is the most informative single metric for ATAC-seq quality. A score <5 indicates a failed experiment regardless of other metrics.
| File | Description | Lines |
|---|---|---|
scripts/main.nf | Nextflow DSL2 pipeline | ~120 |
scripts/nextflow.config | Execution profiles (local/slurm/gcp/aws) | ~60 |
scripts/Dockerfile | Multi-stage Docker build with all tools | ~30 |
After running on your own data, compare with ENCODE reference:
# Find matching ENCODE ATAC-seq experiments
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
biosample_type="tissue"
)
# Download ENCODE peaks for comparison
encode_batch_download(
download_dir="/data/encode_reference/",
output_type="IDR thresholded peaks",
assay_title="ATAC-seq",
organ="pancreas",
assembly="GRCh38"
)
--nomodel --shift -100 --extsize 200 is standard for ATAC-seq. Do NOT use the ChIP-seq default MACS2 settings — they assume sonicated fragment distributions.Goal: Process raw ATAC-seq FASTQ files through the ENCODE pipeline to generate nucleosome-free region peaks and signal tracks for chromatin accessibility analysis. Context: ATAC-seq requires Tn5 transposase insertion site correction (+4/-5 bp shift) and nucleosomal fragment size filtering, handled by the ENCODE ATAC-seq pipeline.
encode_get_experiment(accession="ENCSR637ENO")
Expected output:
{
"accession": "ENCSR637ENO",
"assay_title": "ATAC-seq",
"biosample_summary": "GM12878",
"replicates": 2,
"status": "released"
}
encode_list_files(accession="ENCSR637ENO", file_format="fastq")
Expected output:
{
"files": [
{"accession": "ENCFF100ATQ", "output_type": "reads", "paired_end": "1", "biological_replicates": [1], "file_size_mb": 1800},
{"accession": "ENCFF101ATQ", "output_type": "reads", "paired_end": "2", "biological_replicates": [1], "file_size_mb": 1900}
]
}
nextflow run pipeline-atacseq/main.nf \
--fastq_r1 ENCFF100ATQ.fastq.gz \
--fastq_r2 ENCFF101ATQ.fastq.gz \
--genome GRCh38 \
--blacklist encode_blacklist_v2.bed \
--mitochondrial_chr chrM \
-profile docker
Key pipeline steps:
| Metric | Threshold | Purpose |
|---|---|---|
| TSS enrichment | >= 5 (GRCh38), >= 6 (hg19), >= 10 (mm10) | Signal enrichment at transcription start sites |
| Fragment size distribution | Nucleosomal ladder | ~200bp, ~400bp, ~600bp periodicity |
| Mitochondrial reads | < 20% | Excessive = failed library |
| FRiP | >= 0.2 | Fraction of reads in peaks |
encode_track_experiment(accession="ENCSR637ENO", notes="GM12878 ATAC-seq processed through ENCODE pipeline")
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas"
)
Expected output:
{
"total": 8,
"experiments": [
{
"accession": "ENCSR789PAN",
"assay_title": "ATAC-seq",
"biosample_summary": "pancreas tissue male adult (44 years)",
"status": "released"
}
]
}
encode_list_files(
accession="ENCSR789PAN",
file_format="fastq"
)
Expected output:
{
"total": 4,
"files": [
{
"accession": "ENCFF100ATQ",
"file_format": "fastq",
"read_length": 50,
"paired_end": "1",
"file_size_mb": 3200.1
}
]
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Accessible chromatin peaks | accessibility-aggregation | Cross-experiment union merge |
| Peak regions (BED) | motif-analysis | TF motif enrichment in open chromatin |
| Signal tracks (bigWig) | visualization-workflow | Genome browser accessibility display |
| Nucleosome-free peaks | regulatory-elements | Classify accessible regions as enhancers/promoters |
| Peak coordinates | variant-annotation | Identify variants in accessible chromatin |
| TSS enrichment scores | quality-assessment | Validate against ENCODE ATAC-seq standards |
| Pipeline parameters | data-provenance | Record Tn5 shift, fragment filters, tool versions |
| Peak files | jaspar-motifs | Scan accessible regions for known TF motifs |
When reporting ATAC-seq pipeline results:
motif-analysis for TF footprinting and de novo motif discovery, or visualization-workflow for genome browser session generationnpx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitExecutes ENCODE ATAC-seq pipeline from FASTQ to peaks and signal tracks using Nextflow with Docker containers. Handles Tn5 offset correction, mitochondrial removal, nucleosome-free fragments, and TSS enrichment.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Analyzes NGS data: BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks) for ChIP-seq, RNA-seq, ATAC-seq visualization.