From encode-toolkit
Executes ENCODE WGBS pipeline from FASTQ to methylation calls and bedMethyl files using Nextflow with Docker and cloud deployment. For bisulfite-seq data processing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:pipeline-wgbsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to run a WGBS/bisulfite sequencing pipeline from FASTQ to methylation calls
Execute the ENCODE DNA methylation pipeline for Whole Genome Bisulfite Sequencing data, producing per-CpG methylation levels in bedMethyl format.
FASTQ -> Trim adapters -> Bismark align -> Deduplicate -> MethylDackel extract -> bedMethyl
| | | | | |
QC Trim Galore Bismark/bwa-meth Picard Per-CpG calls Final output
ENCODE-DCC/dna-me-pipelineencodedcc/dna-me-pipeline| Tool | Version | Purpose | Citation |
|---|---|---|---|
| Trim Galore | 0.6.10 | Adapter + quality trimming (bisulfite-aware) | Krueger (Babraham) |
| Bismark | 0.24.2 | Bisulfite-aware alignment + methylation | Krueger & Andrews 2011 |
| bwa-meth | 0.2.7 | Alternative bisulfite aligner (faster) | Pedersen 2014 |
| MethylDackel | 0.6.1 | Methylation extraction from BAM | Ryan (GitHub) |
| Picard | 3.1.1 | Duplicate marking | Broad Institute |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| FastQC | 0.12.1 | Read quality assessment | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC reporting | Ewels et al. 2016 |
Krueger & Andrews 2011 - "Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications" (Bioinformatics, ~4,000 citations) DOI: 10.1093/bioinformatics/btr167
Lister et al. 2009 - "Human DNA methylomes at base resolution show widespread epigenomic differences" (Nature, ~5,000 citations) DOI: 10.1038/nature08514
Schultz et al. 2015 - "Human body epigenome maps reveal noncanonical DNA methylation variation" (Nature, ~1,500 citations) DOI: 10.1038/nature14248
Pedersen et al. 2014 - "Fast and accurate alignment of long bisulfite-seq reads" arXiv:1401.1129 (bwa-meth)
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
nextflow run main.nf \
-profile local \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--genome_dir '/ref/bismark_index' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile slurm \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--genome_dir '/ref/bismark_index' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
--genome_dir 'gs://bucket/ref/bismark_index' \
--outdir 'gs://bucket/results/' \
-resume
| Step | CPUs | RAM | Time (30x human) |
|---|---|---|---|
| Trim Galore | 4 | 4 GB | 1-2 hours |
| Bismark align | 8 | 48 GB | 8-16 hours |
| Deduplication | 2 | 16 GB | 1-2 hours |
| MethylDackel | 4 | 8 GB | 1-2 hours |
| Total | 8 | 48 GB | 12-24 hours |
| Parameter | Default | Description |
|---|---|---|
--reads | required | Glob pattern to paired FASTQ files |
--genome_dir | required | Path to Bismark genome index directory |
--outdir | ./results | Output directory |
--aligner | bismark | Aligner: bismark or bwameth |
--min_coverage | 5 | Minimum coverage for CpG reporting |
--no_overlap | true | Remove overlapping PE reads (avoid double-counting) |
--lambda_genome | null | Lambda genome index for conversion rate QC |
--skip_dedup | false | Skip deduplication (for RRBS data) |
results/
fastqc/ # Raw read quality
trim_galore/ # Trimmed reads + reports
bismark/
alignments/ # Sorted, deduplicated BAMs
dedup_reports/ # Duplication metrics
methylation/ # bedMethyl files (primary output)
{sample}.CpG.bedMethyl.gz
{sample}.CHG.bedMethyl.gz # Non-CpG contexts
{sample}.CHH.bedMethyl.gz
conversion_rate/ # Lambda/pUC19 conversion QC
coverage/
{sample}.coverage_stats.txt
multiqc/
multiqc_report.html
The primary output is per-CpG methylation in bedMethyl format:
chr1 10468 10470 . 1000 + 10468 10470 0,0,0 12 83.3
Columns: chr, start, end, name, score, strand, thickStart, thickEnd, color, coverage, methylation_percentage
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Bisulfite conversion rate | ≥98% | 95-98% | <95% |
| CpG coverage (genome-wide) | >10x | 5-10x | <5x |
| Mapping rate | >70% | 50-70% | <50% |
| Duplication rate | <30% | 30-50% | >50% |
| CpG sites covered (>=5x) | >80% | 60-80% | <60% |
| Lambda spike-in conversion | ≥98% | 95-98% | <95% |
RRBS (Reduced Representation) uses MspI digestion and covers ~10% of CpGs. WGBS covers the full genome. These are DIFFERENT protocols:
--skip_dedup true), different trimmingBismark reports methylation per strand by default. For most analyses, merge complementary CpG strands:
--mergeContext handles this automatically--mergeContext unless you need strand-specific dataConversion artifacts produce false methylation calls:
MethylDackel generates M-bias plots showing methylation level by read position:
--OT and --OB flags to trim affected positionsRegions with <5x coverage have unreliable methylation estimates:
--min_coverage 5 (default)--min_coverage 10After pipeline completion, log all outputs:
# Log derived bedMethyl files
encode_log_derived_file(
file_path="/results/bismark/methylation/sample1.CpG.bedMethyl.gz",
source_accessions=["ENCSR...", "ENCFF..."],
description="CpG methylation calls from ENCODE WGBS pipeline",
file_type="bedMethyl",
tool_used="Bismark 0.24.2 + MethylDackel 0.6.1",
parameters="bismark --genome /ref -1 R1.fq.gz -2 R2.fq.gz; MethylDackel extract --mergeContext --minDepth 5"
)
Detailed step-by-step documentation is provided in the references/ directory:
01-qc-trimming.md -- Bisulfite-specific adapter trimming with Trim Galore02-bismark-alignment.md -- Bismark alignment and bwa-meth alternative03-dedup-filtering.md -- Deduplication and BAM filtering04-methylation-calling.md -- MethylDackel extraction and bedMethyl generation05-qc-metrics.md -- Conversion rate QC, coverage stats, M-biasGoal: Process whole-genome bisulfite sequencing FASTQ files through the ENCODE pipeline to generate per-CpG methylation calls for epigenomic analysis. Context: WGBS requires bisulfite-aware alignment (Bismark) and per-CpG methylation extraction (MethylDackel), with ≥98% bisulfite conversion required.
encode_get_experiment(accession="ENCSR765JPC")
Expected output:
{
"accession": "ENCSR765JPC",
"assay_title": "WGBS",
"biosample_summary": "liver",
"replicates": 2,
"status": "released"
}
encode_list_files(accession="ENCSR765JPC", file_format="fastq")
Expected output:
{
"files": [
{"accession": "ENCFF300BS1", "output_type": "reads", "paired_end": "1", "file_size_mb": 45000},
{"accession": "ENCFF301BS2", "output_type": "reads", "paired_end": "2", "file_size_mb": 46000}
]
}
Interpretation: WGBS files are very large (~45GB per read file). Ensure adequate storage (>500GB for processing).
nextflow run pipeline-wgbs/main.nf \
--fastq_r1 ENCFF300BS1.fastq.gz \
--fastq_r2 ENCFF301BS2.fastq.gz \
--genome GRCh38 \
--bismark_index /ref/bismark_index/ \
-profile docker
Key pipeline steps:
| Metric | Threshold | Purpose |
|---|---|---|
| Bisulfite conversion | >= 98% | Library quality |
| CpG coverage | >= 10x for DMR calling | Statistical power |
| Mapping rate | > 40% (BS-aware) | Alignment success |
| Duplication rate | < 30% | Library complexity |
Feed per-CpG methylation into -> methylation-aggregation for cross-tissue comparison and HMR/UMR/PMD identification.
encode_search_experiments(
assay_title="WGBS",
organ="brain"
)
Expected output:
{
"total": 6,
"experiments": [
{
"accession": "ENCSR321BRN",
"assay_title": "WGBS",
"biosample_summary": "brain tissue female adult (53 years)",
"status": "released"
}
]
}
encode_search_files(
assay_title="WGBS",
organ="brain",
file_format="bed",
output_type="methylation state at CpG",
assembly="GRCh38"
)
Expected output:
{
"total": 3,
"files": [
{
"accession": "ENCFF567MET",
"output_type": "methylation state at CpG",
"assembly": "GRCh38",
"file_size_mb": 845.2
}
]
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Per-CpG methylation (bedMethyl) | methylation-aggregation | Cross-tissue methylation atlas |
| Differentially methylated regions | peak-annotation | Assign DMRs to nearest genes |
| Methylation at regulatory sites | regulatory-elements | Correlate methylation with cCRE activity |
| CpG methylation near variants | variant-annotation | Annotate variants affecting CpG methylation |
| Signal tracks (bigWig) | visualization-workflow | Display methylation signal in genome browser |
| QC metrics | quality-assessment | Validate conversion rate and coverage |
| Pipeline parameters | data-provenance | Record Bismark/MethylDackel versions |
| Methylation at promoters | gtex-expression | Correlate promoter methylation with gene expression |
pipeline-guide -- Parent skill with compute resource assessment and cloud setupmethylation-aggregation -- Aggregate methylation data across samples/tissuesquality-assessment -- Evaluate pipeline output quality metricsdata-provenance -- Track all pipeline inputs, outputs, and parametersdownload-encode -- Download ENCODE WGBS FASTQ files for pipeline inputpublication-trust -- Verify literature claims backing analytical decisionsWhen reporting WGBS pipeline results:
methylation-aggregation for cross-sample averaging and HMR/UMR/PMD identificationnpx claudepluginhub ammawla/encode-toolkitExecutes ENCODE WGBS pipeline from FASTQ to methylation calls and bedMethyl files using Nextflow with Docker and cloud deployment. For bisulfite-seq data processing.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Analyzes genomics and epigenomics data: DNA methylation (CpG, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq, histone modifications, chromatin state, and multi-omics integration using pandas/scipy/pysam computation.