From encode-toolkit
Executes CUT&RUN/CUT&Tag processing pipeline from FASTQ to peaks and signal tracks using Nextflow, Docker, SEACR peak calling, and spike-in normalization.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:pipeline-cutandrunThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to run a CUT&RUN or CUT&Tag processing pipeline from FASTQ to peaks
Execute the CUT&RUN/CUT&Tag processing pipeline for targeted chromatin profiling, producing peak calls with SEACR and spike-in normalized signal tracks.
FASTQ -> Trim -> Bowtie2 align (genome) -> Filter/dedup -> SEACR peaks
| | |
Bowtie2 align (spike-in) Spike-in normalize Signal tracks
|
Scale factor calculation
ENCODE-DCC/cutandrun-pipelineencodedcc/cutandrun-pipeline| Tool | Version | Purpose | Citation |
|---|---|---|---|
| Bowtie2 | 2.5.3 | Alignment (genome + spike-in) | Langmead & Salzberg 2012 |
| SEACR | 1.3 | Peak calling (CUT&RUN-specific) | Meers et al. 2019 |
| MACS2 | 2.2.9.1 | Alternative peak caller | Zhang et al. 2008 |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| bedtools | 2.31.0 | Genomic arithmetic | Quinlan & Hall 2010 |
| deepTools | 3.5.4 | Signal track generation | Ramirez et al. 2016 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Skene & Henikoff 2017 - "An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites" (eLife, ~1,500 citations) DOI: 10.7554/eLife.21856
Meers et al. 2019 - "Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling" (Epigenetics & Chromatin, ~800 citations) DOI: 10.1186/s13072-019-0287-4
Kaya-Okur et al. 2019 - "CUT&Tag for efficient epigenomic profiling of small samples and single cells" (Nature Communications, ~1,200 citations) DOI: 10.1038/s41467-019-09982-5
Nordin et al. 2023 - "The CUT&RUN suspect list of problematic regions" (Genome Biology) DOI: 10.1186/s13059-023-02960-3
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
nextflow run main.nf \
-profile local \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index '/ref/bowtie2_index/genome' \
--spikein_index '/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--blacklist '/ref/hg38-blacklist.v2.bed' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile slurm \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index '/ref/bowtie2_index/genome' \
--spikein_index '/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--blacklist '/ref/hg38-blacklist.v2.bed' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
--bowtie2_index 'gs://bucket/ref/bowtie2_index/genome' \
--spikein_index 'gs://bucket/ref/bowtie2_ecoli/ecoli' \
--chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
--blacklist 'gs://bucket/ref/hg38-blacklist.v2.bed' \
--outdir 'gs://bucket/results/' \
-resume
| Step | CPUs | RAM | Time (per sample) |
|---|---|---|---|
| Bowtie2 align (genome) | 8 | 8 GB | 30-60 min |
| Bowtie2 align (spike-in) | 4 | 4 GB | 10-20 min |
| Filter/dedup | 4 | 8 GB | 15-30 min |
| SEACR peaks | 2 | 4 GB | 10-20 min |
| Signal tracks | 4 | 8 GB | 15-30 min |
| Total | 8 | 8 GB | 1.5-3 hours |
| Parameter | Default | Description |
|---|---|---|
--reads | required | Glob pattern to paired FASTQ files |
--bowtie2_index | required | Bowtie2 genome index prefix |
--spikein_index | required | Bowtie2 E. coli spike-in index prefix |
--chrom_sizes | required | Chromosome sizes file |
--blacklist | required | ENCODE blacklist BED file |
--outdir | ./results | Output directory |
--seacr_mode | stringent | SEACR mode: stringent or relaxed |
--seacr_norm | norm | SEACR normalization: norm or non |
--control | null | IgG control BAM (if available) |
--peak_caller | seacr | Peak caller: seacr or macs2 or both |
--skip_spikein | false | Skip spike-in normalization |
results/
fastqc/ # Raw read quality
alignment/
{sample}.filtered.bam # Filtered, deduplicated BAM
{sample}.filtered.bam.bai
spikein/
{sample}.spikein_counts.txt # Spike-in read counts
{sample}.scale_factor.txt # Computed scale factor
peaks/
{sample}.seacr.stringent.bed # SEACR stringent peaks
{sample}.seacr.relaxed.bed # SEACR relaxed peaks
{sample}.macs2_peaks.narrowPeak # MACS2 peaks (if requested)
signal/
{sample}.normalized.bw # Spike-in normalized signal
{sample}.fragments.bed # Fragment BED file
qc/
{sample}.flagstat.txt
{sample}.fragment_sizes.txt
{sample}.frip.txt
multiqc/
multiqc_report.html
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Mapping rate (genome) | >80% | 60-80% | <60% |
| Spike-in reads | 1-10% of total | 0.1-1% or 10-30% | <0.1% or >30% |
| Duplication rate | <20% | 20-40% | >40% |
| FRiP (peaks) | >10% | 5-10% | <5% |
| Peak count | >5,000 | 1,000-5,000 | <1,000 |
| Fragment size | Nucleosomal pattern | Irregular | No pattern |
CUT&RUN produces a characteristic nucleosomal ladder:
Spike-in normalization is CRITICAL for CUT&RUN quantitative comparison.
Sample A: 200,000 spike-in reads -> scale = 1.0 (minimum)
Sample B: 400,000 spike-in reads -> scale = 0.5
Sample C: 100,000 spike-in reads -> scale = 2.0
Higher spike-in counts = less target enrichment = lower scale factor.
| Feature | SEACR | MACS2 |
|---|---|---|
| Designed for | CUT&RUN/CUT&Tag | ChIP-seq |
| Background model | Sparse enrichment | Dynamic Poisson |
| Control required | Optional (IgG) | Recommended |
| Low background | Handles well | May overcall |
| Stringent mode | Very conservative | Via q-value |
| ENCODE recommendation | Primary for CUT&RUN | Alternative |
SEACR is specifically designed for the sparse, low-background signal profile of CUT&RUN data. MACS2 may overcall peaks due to the low background.
Without spike-in normalization, quantitative comparisons between samples are unreliable. The amount of pA-MNase (or pA-Tn5) varies between experiments, and spike-in reads provide the internal calibration standard.
In addition to the ENCODE blacklist, filter CUT&RUN peaks against the suspect list (Nordin et al. 2023), which identifies regions with artifactual signal specific to CUT&RUN/CUT&Tag protocols:
# Download suspect list
wget https://github.com/Boyle-Lab/Blacklist/raw/master/lists/CUTandRUN.suspectlist.hg38.bed.gz
# Filter peaks
bedtools intersect \
-a peaks.bed \
-b hg38-blacklist.v2.bed CUTandRUN.suspectlist.hg38.bed \
-v \
> peaks_filtered.bed
Both protocols are supported by this pipeline. Differences:
After pipeline completion, log all outputs:
encode_log_derived_file(
file_path="/results/peaks/sample1.seacr.stringent.bed",
source_accessions=["ENCSR...", "ENCFF..."],
description="CUT&RUN peaks from ENCODE CUT&RUN pipeline",
file_type="CUT&RUN_peaks",
tool_used="Bowtie2 2.5.3 + SEACR 1.3",
parameters="stringent mode, spike-in normalized, blacklist + suspect list filtered"
)
Detailed step-by-step documentation is provided in the references/ directory:
01-qc-trimming.md -- Read QC and adapter trimming for CUT&RUN02-bowtie2-alignment.md -- Bowtie2 alignment to genome and spike-in03-filtering-spikein.md -- Filtering, dedup, and spike-in normalization04-seacr-peaks.md -- SEACR peak calling and MACS2 alternative05-qc-metrics.md -- Fragment sizes, FRiP, spike-in QCGoal: Process CUT&RUN/CUT&Tag FASTQ files through the ENCODE-compatible pipeline to generate peak calls with spike-in normalization. Context: CUT&RUN uses targeted MNase digestion (lower background than ChIP-seq) but requires different peak calling (SEACR instead of MACS2) and spike-in normalization for quantitative comparisons.
encode_search_experiments(assay_title="CUT&RUN", organism="Homo sapiens")
Expected output:
{
"total": 35,
"results": [
{"accession": "ENCSR900CUR", "assay_title": "CUT&RUN", "target": "H3K27me3", "biosample_summary": "K562", "status": "released"}
]
}
encode_list_files(accession="ENCSR900CUR", file_format="fastq")
Expected output:
{
"files": [
{"accession": "ENCFF900CR1", "output_type": "reads", "paired_end": "1", "file_size_mb": 800},
{"accession": "ENCFF901CR2", "output_type": "reads", "paired_end": "2", "file_size_mb": 850}
]
}
Interpretation: CUT&RUN yields smaller files than ChIP-seq (~800MB vs ~2.5GB) due to lower background.
nextflow run pipeline-cutandrun/main.nf \
--fastq_r1 ENCFF900CR1.fastq.gz \
--fastq_r2 ENCFF901CR2.fastq.gz \
--genome GRCh38 \
--spike_in_genome dm6 \
--target H3K27me3 \
--peak_caller seacr \
-profile docker
Key pipeline steps:
| Metric | Threshold | Purpose |
|---|---|---|
| Spike-in alignment | 0.5-5% of reads | Normalization calibration |
| Fragment size | < 150bp majority | CUT&RUN characteristic |
| FRiP (SEACR) | >= 5% | Higher than ChIP-seq due to lower background |
| Duplicate rate | < 20% | Library complexity |
Key difference from ChIP-seq: CUT&RUN has inherently lower background, so peak callers like MACS2 overfit. Use SEACR (Meers et al. 2019) instead.
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
Interpretation: CUT&RUN typically identifies fewer but higher-confidence peaks than ChIP-seq. Concordant peaks between both methods are the highest confidence.
encode_get_facets(assay_title="CUT&RUN", facet_field="target.label", organism="Homo sapiens")
Expected output:
{
"facets": {
"target.label": {"H3K27me3": 15, "H3K4me3": 12, "H3K27ac": 8, "CTCF": 5}
}
}
encode_search_experiments(assay_title="Histone ChIP-seq", biosample_term_name="K562", target="H3K27me3", organism="Homo sapiens")
Expected output:
{
"total": 5,
"results": [
{"accession": "ENCSR000CHI", "assay_title": "Histone ChIP-seq", "target": "H3K27me3", "biosample_summary": "K562"}
]
}
encode_track_experiment(accession="ENCSR900CUR", notes="K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq")
Expected output:
{
"status": "tracked",
"accession": "ENCSR900CUR",
"notes": "K562 H3K27me3 CUT&RUN - SEACR peaks for comparison with ChIP-seq"
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| SEACR peaks | histone-aggregation | Cross-experiment comparison (note: different caller than ChIP-seq) |
| Spike-in normalized signal | visualization-workflow | Quantitatively comparable browser tracks |
| Peak regions | regulatory-elements | Chromatin state classification |
| CUT&RUN-specific QC | quality-assessment | Validate with CUT&RUN-appropriate thresholds |
| Peak coordinates | motif-analysis | TF motif discovery at CUT&RUN peaks |
| Pipeline parameters | data-provenance | Record SEACR/spike-in normalization details |
| Peak files | variant-annotation | Identify variants in CUT&RUN peaks |
| Comparison with ChIP-seq | compare-biosamples | Cross-assay concordance analysis |
pipeline-guide -- Parent skill with compute resource assessment and cloud setuphistone-aggregation -- Aggregate histone mark data across samplesquality-assessment -- Evaluate pipeline output quality metricsdata-provenance -- Track all pipeline inputs, outputs, and parametersdownload-encode -- Download ENCODE CUT&RUN FASTQ files for pipeline inputpublication-trust -- Verify literature claims backing analytical decisionsWhen reporting CUT&RUN pipeline results:
peak-annotation for gene association of peaks, or visualization-workflow for genome browser session generationnpx claudepluginhub ammawla/encode-toolkitExecutes CUT&RUN/CUT&Tag processing pipeline from FASTQ to peaks and signal tracks using Nextflow, Docker, SEACR peak calling, and spike-in normalization.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Analyzes NGS data: BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks) for ChIP-seq, RNA-seq, ATAC-seq visualization.