From encode-toolkit
Executes ENCODE Hi-C pipeline from FASTQ to contact matrices, loop calls, and TADs using Nextflow with Docker and cloud deployment. For Hi-C data processing and 3D genome analysis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:pipeline-hicThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to run a Hi-C processing pipeline from FASTQ to contact matrices and loop calls
Execute the ENCODE Hi-C pipeline for chromatin conformation capture data, producing multi-resolution contact matrices, loop calls, and compartment annotations.
FASTQ -> Trim -> BWA (per-mate) -> pairtools parse -> dedup -> .pairs
|
+------------+------------+
| |
Juicer pre -> .hic cooler -> .mcool
| |
HiCCUPS loops Compartments
ENCODE-DCC/hic-pipelineencodedcc/hic-pipeline| Tool | Version | Purpose | Citation |
|---|---|---|---|
| BWA-MEM | 0.7.17 | Alignment (per-mate) | Li & Durbin 2009 |
| pairtools | 1.0.3 | Pair classification, dedup | Open2C |
| Juicer tools | 2.20.00 | .hic generation, HiCCUPS | Durand et al. 2016 |
| cooler | 0.9.3 | .cool/.mcool generation | Abdennur & Mirny 2020 |
| samtools | 1.19 | BAM operations | Li et al. 2009 |
| FastQC | 0.12.1 | Read quality | Andrews (Babraham) |
| MultiQC | 1.21 | Aggregated QC | Ewels et al. 2016 |
Rao et al. 2014 - "A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping" (Cell, ~5,000 citations) DOI: 10.1016/j.cell.2014.11.021
Lieberman-Aiden et al. 2009 - "Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome" (Science, ~6,000 citations) DOI: 10.1126/science.1181369
Durand et al. 2016 - "Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments" (Cell Systems, ~2,000 citations) DOI: 10.1016/j.cels.2016.07.002
Abdennur & Mirny 2020 - "Cooler: scalable storage for Hi-C data and other genomically labeled arrays" (Bioinformatics) DOI: 10.1093/bioinformatics/btz540
Amemiya et al. 2019 - "The ENCODE Blacklist" (Scientific Reports, ~1,372 citations) DOI: 10.1038/s41598-019-45839-z
nextflow run main.nf \
-profile local \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bwa_index '/ref/bwa_index/genome.fa' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile slurm \
--reads '/data/fastq/*_R{1,2}.fastq.gz' \
--bwa_index '/ref/bwa_index/genome.fa' \
--chrom_sizes '/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir results/ \
-resume
nextflow run main.nf \
-profile gcp \
--reads 'gs://bucket/fastq/*_R{1,2}.fastq.gz' \
--bwa_index 'gs://bucket/ref/genome.fa' \
--chrom_sizes 'gs://bucket/ref/hg38.chrom.sizes' \
--restriction_site 'GATC' \
--outdir 'gs://bucket/results/' \
-resume
| Step | CPUs | RAM | Time (2B contacts) |
|---|---|---|---|
| BWA alignment | 8 | 16 GB | 4-6 hours |
| pairtools parse | 4 | 8 GB | 2-3 hours |
| pairtools dedup | 4 | 16 GB | 1-2 hours |
| Juicer pre + hic | 4 | 64 GB | 2-4 hours |
| HiCCUPS | 4 | 16 GB (+ GPU optional) | 1-2 hours |
| Total | 8 | 64 GB | 8-16 hours |
| Parameter | Default | Description |
|---|---|---|
--reads | required | Glob pattern to paired FASTQ files |
--bwa_index | required | Path to BWA genome index (.fa with .bwt etc.) |
--chrom_sizes | required | Chromosome sizes file |
--restriction_site | GATC | Restriction enzyme site (GATC for MboI/DpnII) |
--outdir | ./results | Output directory |
--resolutions | 1000,5000,10000,25000,50000,100000,250000,500000,1000000 | Matrix resolutions |
--min_mapq | 30 | Minimum MAPQ for pair filtering |
--assembly | hg38 | Genome assembly name for .hic header |
results/
fastqc/ # Raw read quality
alignment/
{sample}.R1.bam # Per-mate alignments
{sample}.R2.bam
pairs/
{sample}.pairs.gz # Classified, deduplicated pairs
{sample}.dedup_stats.txt # Duplication metrics
{sample}.pair_stats.txt # Pair type classification
matrices/
{sample}.hic # Juicer .hic file (primary output)
{sample}.mcool # Cooler multi-resolution matrix
loops/
{sample}.hiccups_loops.bedpe # Called loops (HiCCUPS)
qc/
{sample}.contact_stats.txt # Contact statistics
multiqc/
multiqc_report.html
The .hic format (Juicer) stores multi-resolution contact matrices with
normalization vectors. Can be visualized in Juicebox and loaded by
hic-straw in Python/R.
The .mcool format (cooler) is an HDF5-based multi-resolution contact matrix.
Widely supported by cooler, cooltools, HiGlass, and FAN-C.
| Metric | Pass | Warning | Fail |
|---|---|---|---|
| Valid pair fraction | >40% | 25-40% | <25% |
| Cis contacts (>20kb) | >40% | 25-40% | <25% |
| Cis/trans ratio | >1.5 | 1.0-1.5 | <1.0 |
| Library complexity (unique/total) | >0.7 | 0.5-0.7 | <0.5 |
| Contacts per resolution | See below | - | - |
| Resolution | Minimum Contacts Needed | Typical Depth |
|---|---|---|
| 1 kb | >2 billion | Very deep |
| 5 kb | >500 million | Deep |
| 10 kb | >200 million | Standard |
| 25 kb | >50 million | Moderate |
| 100 kb | >10 million | Low |
pairtools classifies read pairs into categories:
| Category | Description | Use |
|---|---|---|
| UU | Both uniquely mapped | Valid contact |
| UR/RU | One unique, one rescued | Valid (rescued) |
| UX/XU | One unique, one unmapped | Not used |
| DD | Both duplicate | Removed |
| WW | Walk pair (same strand) | Indicates ligation artifact |
| NR | Null/rescue pair | Not used |
Only UU pairs (and optionally UR) are used for contact matrices.
The restriction enzyme determines fragment size and resolution:
Different normalization methods yield different results:
Do not call features at resolutions unsupported by sequencing depth:
Same-strand pairs (WW) indicate self-ligation or undigested fragments:
After pipeline completion, log all outputs:
encode_log_derived_file(
file_path="/results/matrices/sample1.hic",
source_accessions=["ENCSR...", "ENCFF..."],
description="Hi-C contact matrix from ENCODE Hi-C pipeline",
file_type="hic",
tool_used="BWA 0.7.17 + pairtools 1.0.3 + Juicer 2.20.00",
parameters="MboI digestion, KR normalization, resolutions 1kb-1Mb"
)
Detailed step-by-step documentation is provided in the references/ directory:
01-qc-trimming.md -- Read QC and adapter trimming02-alignment.md -- BWA per-mate alignment strategy03-pair-processing.md -- pairtools parse, sort, and dedup04-matrix-generation.md -- Juicer .hic and cooler .mcool generation05-loop-calling.md -- HiCCUPS loop detection and QCGoal: Process raw Hi-C FASTQ files through the ENCODE pipeline to generate contact matrices, TAD calls, and chromatin loop predictions. Context: Hi-C captures 3D chromatin organization. The pipeline uses BWA for chimeric read alignment, pairtools for pair processing, and Juicer/HiCCUPS for loop calling.
encode_get_experiment(accession="ENCSR000AKA")
Expected output:
{
"accession": "ENCSR000AKA",
"assay_title": "Hi-C",
"biosample_summary": "GM12878",
"replicates": 2,
"status": "released"
}
encode_list_files(accession="ENCSR000AKA", file_format="fastq")
Expected output:
{
"files": [
{"accession": "ENCFF500HI1", "output_type": "reads", "paired_end": "1", "file_size_mb": 35000},
{"accession": "ENCFF501HI2", "output_type": "reads", "paired_end": "2", "file_size_mb": 36000}
]
}
Interpretation: Hi-C paired-end reads represent chimeric ligation junctions. Each read pair captures a 3D contact.
nextflow run pipeline-hic/main.nf \
--fastq_r1 ENCFF500HI1.fastq.gz \
--fastq_r2 ENCFF501HI2.fastq.gz \
--genome GRCh38 \
--restriction_enzyme DpnII \
--resolution 5000,10000,25000 \
-profile docker
Key pipeline steps:
| Metric | Threshold | Purpose |
|---|---|---|
| Cis/trans ratio | > 60% cis | Library quality |
| Long-range cis | > 40% of cis | Useful contacts |
| Valid pairs | > 50% of mapped | Ligation success |
| Duplicate rate | < 30% | Library complexity |
Download loop calls for downstream analysis:
encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")
encode_search_experiments(
assay_title="Hi-C",
organ="heart"
)
Expected output:
{
"total": 4,
"experiments": [
{
"accession": "ENCSR654HRT",
"assay_title": "Hi-C",
"biosample_summary": "heart left ventricle tissue male adult (51 years)",
"status": "released"
}
]
}
encode_get_experiment(accession="ENCSR654HRT")
Expected output:
{
"accession": "ENCSR654HRT",
"assay_title": "Hi-C",
"replicates": 2,
"biosample_summary": "heart left ventricle tissue male adult (51 years)",
"files_count": 18,
"assembly": "GRCh38",
"audit": {"ERROR": 0, "WARNING": 1}
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Chromatin loops (BEDPE) | hic-aggregation | Cross-tissue loop catalog |
| TAD boundaries | regulatory-elements | Domain-level regulatory architecture |
| Loop anchors (BED) | peak-annotation | Assign genes to loop-connected enhancers |
| Contact matrices (.hic) | visualization-workflow | 3D genome visualization |
| Loop-disrupting coordinates | variant-annotation | Identify variants breaking chromatin contacts |
| QC metrics | quality-assessment | Validate Hi-C library quality |
| Pipeline parameters | data-provenance | Record BWA/pairtools/Juicer versions |
| Loop anchor regions | motif-analysis | Discover CTCF motifs at loop anchors |
pipeline-guide -- Parent skill with compute resource assessment and cloud setuphic-aggregation -- Aggregate Hi-C loops across samples/tissuesquality-assessment -- Evaluate pipeline output quality metricsdata-provenance -- Track all pipeline inputs, outputs, and parametersdownload-encode -- Download ENCODE Hi-C FASTQ files for pipeline inputpublication-trust -- Verify literature claims backing analytical decisionsWhen reporting Hi-C pipeline results:
hic-aggregation for cross-sample loop catalogs, or visualization-workflow for Juicebox/HiGlass session setupnpx claudepluginhub ammawla/encode-toolkitExecutes ENCODE Hi-C pipeline from FASTQ to contact matrices, loop calls, and TADs using Nextflow with Docker and cloud deployment. For Hi-C data processing and 3D genome analysis.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Analyzes NGS data: BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks) for ChIP-seq, RNA-seq, ATAC-seq visualization.