Skill

pipeline-dnaseseq

Executes ENCODE DNase-seq pipeline from FASTQ to hotspots, DHS peaks, and TF footprints using Nextflow with Docker for local/cloud deployment.

Docker

data-engineering

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/encode-toolkit:pipeline-dnaseseq

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- User wants to run a DNase-seq processing pipeline from FASTQ to hotspots and footprints

Supporting Files

references/01-qc-trimming.mdreferences/02-alignment.mdreferences/03-filtering.mdreferences/04-hotspot-calling.mdreferences/05-footprinting.mdreferences/literature.mdscripts/Dockerfilescripts/main.nfscripts/nextflow.config

SKILL.md

424 lines · ~3.8k tokens

Stats

LanguagePython

Stars35

Forks5

MaintenanceExcellent

Last CommitJun 14, 2026

Actions

View Source View Plugin View on GitHub View README

ENCODE DNase-seq Pipeline: FASTQ to Hotspots and Footprints

When to Use

User wants to run a DNase-seq processing pipeline from FASTQ to hotspots and footprints
User asks about "DNase-seq pipeline", "DNase hypersensitive sites", "Hotspot2", "footprinting", or "DHS"
User needs to process DNase-seq data for chromatin accessibility and TF footprint analysis
Example queries: "process my DNase-seq FASTQs", "call DNase hypersensitive sites", "run footprinting analysis on DNase-seq"

Execute the ENCODE DNase-seq pipeline for chromatin accessibility profiling, producing DNase hypersensitive sites (DHSs) via Hotspot2 and transcription factor footprints.

Pipeline Overview

FASTQ -> Trim -> BWA-MEM align -> Filter/dedup -> Hotspot2 -> DHS peaks
                                       |                        |
                                    Signal track         Footprinting (HINT)

ENCODE Repository

GitHub: ENCODE-DCC/dnase-seq-pipeline
Container: encodedcc/dnase-seq-pipeline
WDL: Available for Cromwell execution
This skill: Nextflow DSL2 reimplementation for portability

Core Tools and Versions

Tool	Version	Purpose	Citation
BWA-MEM	0.7.17	Alignment	Li & Durbin 2009
samtools	1.19	BAM operations	Li et al. 2009
Picard	3.1.1	Duplicate marking	Broad Institute
Hotspot2	2.3.1	DHS calling (ENCODE standard)	John et al. 2011
bedtools	2.31.0	Genomic arithmetic	Quinlan & Hall 2010
HINT-ATAC	0.13.2	TF footprinting	Li et al. 2019
FastQC	0.12.1	Read quality	Andrews (Babraham)
MultiQC	1.21	Aggregated QC	Ewels et al. 2016

Key Literature

John et al. 2011 - "Chromatin accessibility pre-determines glucocorticoid receptor binding patterns" (Nature Genetics, ~600 citations) DOI: 10.1038/ng.759
Thurman et al. 2012 - "The accessible chromatin landscape of the human genome" (Nature, ~3,000 citations) DOI: 10.1038/nature11232
Vierstra et al. 2020 - "Global reference mapping of human transcription factor footprints" (Nature, ~600 citations) DOI: 10.1038/s41586-020-2528-x
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
Li et al. 2019 - "Identification of transcription factor binding sites using ATAC-seq" (Genome Biology) -- HINT-ATAC footprinting DOI: 10.1186/s13059-019-1642-2

Execution

Quick Start (Local)

nextflow run main.nf \
    -profile local \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --hotspot_index '/ref/hotspot2_index/' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

SLURM HPC

nextflow run main.nf \
    -profile slurm \
    --reads '/data/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index '/ref/bwa_index/genome.fa' \
    --chrom_sizes '/ref/hg38.chrom.sizes' \
    --hotspot_index '/ref/hotspot2_index/' \
    --blacklist '/ref/hg38-blacklist.v2.bed' \
    --outdir results/ \
    -resume

Cloud (GCP / AWS)

nextflow run main.nf \
    -profile gcp \
    --reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
    --bwa_index 'gs://bucket/ref/genome.fa' \
    --chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
    --hotspot_index 'gs://bucket/ref/hotspot2_index/' \
    --blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
    --outdir 'gs://bucket/results/' \
    -resume

Resource Requirements

Step	CPUs	RAM	Time (per sample)
BWA-MEM align	8	16 GB	1-2 hours
Filter/dedup	4	8 GB	30-60 min
Hotspot2	4	8 GB	30-60 min
Signal generation	2	4 GB	15-30 min
Footprinting	4	8 GB	1-2 hours
Total	8	16 GB	3-6 hours

Pipeline Parameters

Parameter	Default	Description
`--reads`	required	Glob pattern to paired FASTQ files
`--bwa_index`	required	Path to BWA genome index (.fa)
`--chrom_sizes`	required	Chromosome sizes file
`--hotspot_index`	required	Hotspot2 mappability index directory
`--blacklist`	required	ENCODE blacklist BED file
`--outdir`	`./results`	Output directory
`--fdr`	`0.05`	Hotspot2 FDR threshold
`--skip_footprint`	`false`	Skip footprinting analysis
`--motif_db`	`null`	JASPAR motif database for footprinting

Output Files

results/
  fastqc/                          # Raw read quality
  alignment/
    {sample}.filtered.bam          # Filtered, deduplicated BAM
    {sample}.filtered.bam.bai
  hotspots/
    {sample}.hotspots.fdr0.05.bed  # DHS peaks (primary output)
    {sample}.peaks.narrowPeak      # narrowPeak format
    {sample}.density.bw            # Signal track (bigWig)
    {sample}.allcalls.bed          # All hotspot calls (unfiltered)
    {sample}.SPOT.txt              # SPOT score
  footprints/
    {sample}.footprints.bed        # TF footprints
    {sample}.footprint_scores.txt  # Per-motif footprint scores
  qc/
    {sample}.flagstat.txt
    {sample}.insert_sizes.txt
  multiqc/
    multiqc_report.html

QC Thresholds (ENCODE Standards)

Metric	Pass	Warning	Fail
SPOT score (Signal Portion of Tags)	>0.4	0.2-0.4	<0.2
Hotspot count	>50,000	20,000-50,000	<20,000
Mapping rate	>80%	60-80%	<60%
Duplication rate	<30%	30-50%	>50%
NRF (Non-Redundant Fraction)	>0.8	0.7-0.8	<0.7
PBC1 (PCR Bottleneck Coefficient 1)	>0.9	0.7-0.9	<0.7
Insert size peak	50-150 bp	Variable	Abnormal

SPOT Score

The SPOT score (Signal Portion of Tags) is the fraction of reads falling within hotspots. It is the DNase-seq equivalent of FRiP for ChIP-seq.

Higher SPOT = more enrichment in accessible regions = better library quality.

Hotspot2 vs MACS2

IMPORTANT: ENCODE uses Hotspot2 for DNase-seq, NOT MACS2.

Feature	Hotspot2	MACS2
Designed for	DNase-seq	ChIP-seq
Background model	Local tag density + mappability	Dynamic Poisson
ENCODE standard	Yes (DNase-seq)	Yes (ChIP-seq/ATAC-seq)
Mappability correction	Built-in	Not available
Output	Hotspots + peaks	Peaks only

Hotspot2 accounts for mappability variation across the genome, which is critical for DNase-seq because DNase I cuts accessible chromatin regardless of whether it is uniquely mappable.

Critical Pitfalls

DNase-seq vs ATAC-seq

These are different assays measuring the same biology (chromatin accessibility):

DNase-seq: Uses DNase I enzyme, requires more input material
ATAC-seq: Uses Tn5 transposase, works on fewer cells
Analysis pipelines differ: Hotspot2 for DNase-seq, MACS2 for ATAC-seq
Data are largely concordant but not identical

Fragment Size Distribution

DNase-seq produces a characteristic fragment size distribution:

Peak at ~50-100 bp (sub-nucleosomal fragments at DHS)
Secondary peak at ~150-200 bp (mononucleosomal fragments)
Long tail of larger fragments
If distribution is abnormal, check library preparation protocol

Mappability Index

Hotspot2 requires a pre-computed mappability index. These are read-length and genome-build specific:

hg38 / 36 bp: Use ENCODE-provided index
hg38 / 76 bp: Use ENCODE-provided index
hg38 / 150 bp: May need to generate custom index
Wrong mappability index = incorrect peak calls

Blacklist Filtering

Always filter peaks against the ENCODE blacklist (Amemiya et al. 2019):

bedtools intersect -a hotspots.bed -b hg38-blacklist.v2.bed -v > hotspots_filtered.bed

Blacklist regions produce artifactual signal in accessibility assays.

Footprinting Analysis

Transcription factor footprinting detects bound TFs from DNase-seq signal:

HINT-ATAC Footprinting

rgt-hint footprinting \
    --atac-seq \
    --paired-end \
    --organism hg38 \
    --output-location footprints/ \
    sample.filtered.bam \
    hotspots.narrowPeak

Interpretation

Footprints are depressions in the DNase signal where a bound TF protects DNA
Requires deep sequencing (>100M reads) for reliable footprints
Sensitivity varies by TF: pioneer factors have shallow footprints
Vierstra et al. 2020 provides a global reference map for comparison

Provenance Integration

After pipeline completion, log all outputs:

encode_log_derived_file(
    file_path="/results/hotspots/sample1.hotspots.fdr0.05.bed",
    source_accessions=["ENCSR...", "ENCFF..."],
    description="DNase hypersensitive sites from ENCODE DNase-seq pipeline",
    file_type="DHS_peaks",
    tool_used="BWA 0.7.17 + Hotspot2 2.3.1",
    parameters="FDR 0.05, blacklist filtered, ENCODE hg38 mappability index"
)

Reference Files

Detailed step-by-step documentation is provided in the references/ directory:

01-qc-trimming.md -- Read QC and adapter trimming
02-alignment.md -- BWA-MEM alignment for DNase-seq
03-filtering.md -- BAM filtering, deduplication, blacklist removal
04-hotspot-calling.md -- Hotspot2 DHS detection and signal generation
05-footprinting.md -- TF footprint detection with HINT-ATAC

Walkthrough: Processing ENCODE DNase-seq from FASTQ to Hypersensitive Sites

Goal: Process raw DNase-seq FASTQ files through the ENCODE pipeline to generate DNase I hypersensitive site (DHS) peak calls. Context: DNase-seq identifies open chromatin via DNase I enzyme digestion. The pipeline uses BWA alignment and Hotspot2 for DHS identification.

Step 1: Find DNase-seq experiment

encode_search_experiments(assay_title="DNase-seq", biosample_term_name="K562", organism="Homo sapiens")

Expected output:

{
  "total": 8,
  "results": [
    {"accession": "ENCSR000DNS", "assay_title": "DNase-seq", "biosample_summary": "K562", "status": "released"}
  ]
}

Step 2: List and download FASTQ files

encode_list_files(accession="ENCSR000DNS", file_format="fastq")

Step 3: Run the DNase-seq pipeline

nextflow run pipeline-dnaseseq/main.nf \
  --fastq_r1 ENCFF700DN1.fastq.gz \
  --genome GRCh38 \
  --blacklist encode_blacklist_v2.bed \
  -profile docker

Key pipeline steps:

Quality trimming
BWA-MEM alignment
Duplicate removal
Hotspot2 DHS calling
Signal track generation
Footprint analysis (HINT-ATAC)

Step 4: Validate output quality

Metric	Threshold	Purpose
SPOT score	> 0.4	Signal portion of tags
Hotspot count	> 100,000	Sensitivity
Duplicate rate	< 30%	Library complexity

Step 5: Compare with ATAC-seq

encode_search_experiments(assay_title="ATAC-seq", biosample_term_name="K562", organism="Homo sapiens")

Interpretation: DNase-seq and ATAC-seq both measure accessibility but with different biases. Compare peaks from both assays -- concordant peaks are high confidence.

Integration with downstream skills

DHS peaks feed into -> accessibility-aggregation alongside ATAC-seq peaks
Footprint data feeds into -> motif-analysis for TF binding prediction
Signal tracks feed into -> visualization-workflow
Peaks integrate with -> regulatory-elements for cCRE classification

Code Examples

1. Survey DNase-seq availability

encode_get_facets(assay_title="DNase-seq", facet_field="organ", organism="Homo sapiens")

Expected output:

{
  "facets": {
    "organ": {"blood": 45, "brain": 30, "liver": 20, "heart": 15, "lung": 12}
  }
}

2. Check for existing DHS peaks

encode_list_files(accession="ENCSR000DNS", file_format="bed", output_type="peaks", assembly="GRCh38")

Expected output:

{
  "files": [
    {"accession": "ENCFF800DHS", "output_type": "peaks", "file_format": "bed narrowPeak", "file_size_mb": 1.5}
  ]
}

3. Track DNase-seq experiments

encode_track_experiment(accession="ENCSR000DNS", notes="K562 DNase-seq for accessibility comparison with ATAC-seq")

Expected output:

{
  "status": "tracked",
  "accession": "ENCSR000DNS",
  "notes": "K562 DNase-seq for accessibility comparison with ATAC-seq"
}

Integration

This skill produces...	Feed into...	Purpose
DHS peaks (narrowPeak)	accessibility-aggregation	Union merge with ATAC-seq peaks
TF footprints	motif-analysis	Validate motif predictions with footprint evidence
Signal tracks (bigWig)	visualization-workflow	Genome browser display
Accessible regions	regulatory-elements	cCRE classification
DHS coordinates	variant-annotation	Annotate variants in hypersensitive sites
QC metrics	quality-assessment	Validate SPOT score and sensitivity
Pipeline parameters	data-provenance	Record BWA/Hotspot2 versions
DHS peak regions	jaspar-motifs	Scan accessible sites for known TF motifs

Related Skills

pipeline-guide -- Parent skill with compute resource assessment and cloud setup
accessibility-aggregation -- Aggregate DHS data across samples/tissues
quality-assessment -- Evaluate pipeline output quality metrics
data-provenance -- Track all pipeline inputs, outputs, and parameters
download-encode -- Download ENCODE DNase-seq FASTQ files for pipeline input
publication-trust -- Verify literature claims backing analytical decisions

Presenting Results

When reporting DNase-seq pipeline results:

Hotspot counts: Report total Hotspot2 DHS calls at the specified FDR threshold and the number remaining after blacklist filtering
Signal-to-noise (SPOT score): Report the SPOT score prominently (>0.4 pass, 0.2-0.4 warning, <0.2 fail). This is the DNase-seq equivalent of FRiP
Footprint depth: If footprinting was performed, report the number of TF footprints detected and note the sequencing depth (>100M reads recommended for reliable footprints)
FRiP equivalent: Report the fraction of reads in hotspots as a complementary enrichment metric
Key QC metrics: Present mapping rate (>80%), duplication rate (<30%), NRF (>0.8), PBC1 (>0.9), and insert size peak in a summary table
Output paths: Provide paths to hotspot BED files, narrowPeak files, signal bigWig tracks, and footprint results
Mappability note: Confirm which Hotspot2 mappability index was used and that it matches the read length
Next steps: Suggest motif-analysis for TF motif enrichment in DHS peaks, or accessibility-aggregation for merging DHS data across samples

pipeline-dnaseseq

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

pipeline-dnaseseq

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

ENCODE DNase-seq Pipeline: FASTQ to Hotspots and Footprints

When to Use

Pipeline Overview

ENCODE Repository

Core Tools and Versions

Key Literature

Execution

Quick Start (Local)

SLURM HPC

Cloud (GCP / AWS)

Resource Requirements

Pipeline Parameters

Output Files

QC Thresholds (ENCODE Standards)

SPOT Score

Hotspot2 vs MACS2

Critical Pitfalls

DNase-seq vs ATAC-seq

Fragment Size Distribution

Mappability Index

Blacklist Filtering

Footprinting Analysis

HINT-ATAC Footprinting

Interpretation

Provenance Integration

Reference Files

Walkthrough: Processing ENCODE DNase-seq from FASTQ to Hypersensitive Sites

Step 1: Find DNase-seq experiment

Step 2: List and download FASTQ files

Step 3: Run the DNase-seq pipeline

Step 4: Validate output quality

Step 5: Compare with ATAC-seq

Integration with downstream skills

Code Examples

1. Survey DNase-seq availability

2. Check for existing DHS peaks

3. Track DNase-seq experiments

Integration

Related Skills

Presenting Results

For the request: "$ARGUMENTS"

Similar Skills

ENCODE DNase-seq Pipeline: FASTQ to Hotspots and Footprints

When to Use

Pipeline Overview

ENCODE Repository

Core Tools and Versions

Key Literature

Execution

Quick Start (Local)

SLURM HPC

Cloud (GCP / AWS)

Resource Requirements

Pipeline Parameters

Output Files

QC Thresholds (ENCODE Standards)

SPOT Score

Hotspot2 vs MACS2

Critical Pitfalls

DNase-seq vs ATAC-seq

Fragment Size Distribution

Mappability Index

Blacklist Filtering

Footprinting Analysis

HINT-ATAC Footprinting

Interpretation

Provenance Integration

Reference Files