From encode-toolkit
Aggregates Hi-C loop calls (BEDPE) across ENCODE experiments, donors, and labs to build union chromatin contact maps. Handles resolution-aware anchor matching and quality metrics for tissue 3D contacts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:hic-aggregationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to build a comprehensive catalog of chromatin loops from multiple Hi-C experiments
Build a comprehensive catalog of chromatin loops for a tissue/cell type by merging BEDPE loop calls from multiple ENCODE Hi-C experiments.
The question: "What regions are in 3D physical contact in my tissue?"
Like histone marks and accessibility, chromatin loops are a detection question. If a loop between Region A and Region B is detected in one donor but not another, the contact is still real — individual variation, sequencing depth, and computational resolution explain absence. We want the union of all detected contacts.
Hi-C data measures pairwise chromatin interactions genome-wide. After processing:
.hic file): Genome-wide interaction frequencies at multiple resolutionsBEDPE format (Paired-End BED):
chr1 start1 end1 chr2 start2 end2 name score strand1 strand2
Each row represents a contact between two genomic anchor regions.
encode_search_experiments(
assay_title="Hi-C",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=100
)
Present a summary to the user:
Use encode_get_facets to check availability:
encode_get_facets(assay_title="Hi-C", organ="pancreas")
Note: Hi-C data is computationally expensive to produce, so there are typically fewer experiments per tissue than ChIP-seq or ATAC-seq. Even 2-3 experiments can be valuable for union catalogs.
encode_get_experiment(accession="ENCSR...")
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
For each experiment, get BEDPE loop calls:
# Search for loop/interaction files
encode_list_files(
experiment_accession="ENCSR...",
file_format="bedpe",
assembly="GRCh38"
)
# Also check for BED-formatted loop files
encode_list_files(
experiment_accession="ENCSR...",
output_type="chromatin interactions",
assembly="GRCh38"
)
# Or contact domains
encode_list_files(
experiment_accession="ENCSR...",
output_type="contact domains",
assembly="GRCh38"
)
File selection priority:
Prefer preferred_default=True files when available.
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/hic_loops",
organize_by="flat"
)
Hi-C loop anchors are binned regions, not precise positions. The resolution determines anchor size:
| Resolution | Anchor Width | Best For | Typical Loop Count |
|---|---|---|---|
| 5 kb | 5,000 bp | Fine-scale promoter-enhancer loops | More loops |
| 10 kb | 10,000 bp | Standard analysis | Moderate |
| 25 kb | 25,000 bp | Large-scale domain contacts | Fewer loops |
All loops being merged must be at the same resolution, or anchors must be harmonized to a common resolution.
If experiments have loops called at different resolutions:
# Expand 5kb anchors to 10kb resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
# Bin anchor 1
bin1_start = int($2/res) * res
bin1_end = bin1_start + res
# Bin anchor 2
bin2_start = int($5/res) * res
bin2_end = bin2_start + res
print $1, bin1_start, bin1_end, $4, bin2_start, bin2_end, $7, $8, $9, $10
}' fine_res_loops.bedpe > harmonized_loops.bedpe
Remove loops with anchors in artifact-prone regions (download from https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz):
# Filter loops where EITHER anchor overlaps a blocklist region
# First, extract anchor 1 and anchor 2 as separate BED files
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,NR}' sample.bedpe > anchors1.bed
awk 'BEGIN{OFS="\t"} {print $4,$5,$6,NR}' sample.bedpe > anchors2.bed
# Find anchor rows NOT in blocklist
bedtools intersect -a anchors1.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_1.txt
bedtools intersect -a anchors2.bed -b ENCODE_blocklist.bed -v | cut -f4 > clean_rows_2.txt
# Keep only rows where BOTH anchors pass
comm -12 <(sort clean_rows_1.txt) <(sort clean_rows_2.txt) > clean_rows.txt
awk 'NR==FNR{a[$1];next} FNR in a' clean_rows.txt sample.bedpe > sample.filtered.bedpe
Filter by interaction score/significance:
# If BEDPE has a score column (col 8), filter to significant interactions
# Keep top 75% by score (true distribution quantile, not range-based)
TOTAL=$(wc -l < sample.filtered.bedpe)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k8,8n sample.filtered.bedpe | awk -v line="$LINE_25" 'NR==line{print $8}')
awk -v t="$THRESHOLD" '$8 >= t' sample.filtered.bedpe > sample.qfiltered.bedpe
Loops where both anchors are very close are likely artifacts:
# Remove loops where anchors are on same chromosome and < 20kb apart
awk '{
if ($1 != $4) print $0; # inter-chromosomal: keep (rare but real)
else if (($5 - $3) >= 20000) print $0; # > 20kb apart: keep
}' sample.qfiltered.bedpe > sample.clean.bedpe
Unlike peaks (single regions), loops are pairs of regions. Two loops match if both anchors overlap:
Loop 1: [anchor1A]--------[anchor1B]
Loop 2: [anchor2A]------[anchor2B]
These should merge if anchor1A overlaps anchor2A AND anchor1B overlaps anchor2B.
# Concatenate all filtered loops
cat sample1.clean.bedpe sample2.clean.bedpe ... > all_loops.bedpe
# Sort by anchor 1 coordinates
sort -k1,1 -k2,2n -k4,4 -k5,5n all_loops.bedpe > all_loops.sorted.bedpe
# Use a custom merge approach:
# 1. Bin anchors to resolution, creating a loop ID
# 2. Group by loop ID
# 3. Count support
awk -v res=10000 'BEGIN{OFS="\t"} {
# Create binned anchor coordinates as loop identifier
a1_bin = $1 ":" int($2/res)*res
a2_bin = $4 ":" int($5/res)*res
# Canonical order (smaller coordinate first) to handle orientation
if (a1_bin < a2_bin) loop_id = a1_bin "-" a2_bin
else loop_id = a2_bin "-" a1_bin
print loop_id, $0
}' all_loops.sorted.bedpe | \
sort -k1,1 | \
awk 'BEGIN{OFS="\t"} {
if ($1 != prev_id) {
if (NR > 1) print chr1, start1, end1, chr2, start2, end2, count, max_score
prev_id = $1
chr1=$2; start1=$3; end1=$4; chr2=$5; start2=$6; end2=$7
count = 1; max_score = $9
} else {
count++
if ($9 > max_score) max_score = $9
# Expand anchors to encompass all overlapping calls
if ($3 < start1) start1 = $3
if ($4 > end1) end1 = $4
if ($6 < start2) start2 = $6
if ($7 > end2) end2 = $7
}
} END {
print chr1, start1, end1, chr2, start2, end2, count, max_score
}' > union_loops.bedpe
Following the Loop Catalog (Reyna et al. 2025) approach:
# Bin all loop anchors to a fixed resolution
awk -v res=10000 'BEGIN{OFS="\t"} {
a1_start = int($2/res) * res
a1_end = a1_start + res
a2_start = int($5/res) * res
a2_end = a2_start + res
# Canonical ordering
if ($1 < $4 || ($1 == $4 && a1_start <= a2_start))
print $1, a1_start, a1_end, $4, a2_start, a2_end
else
print $4, a2_start, a2_end, $1, a1_start, a1_end
}' all_loops.sorted.bedpe | \
sort -u | \
sort -k1,1 -k2,2n -k4,4 -k5,5n | \
uniq -c | \
awk 'BEGIN{OFS="\t"} {print $2,$3,$4,$5,$6,$7,$1}' > union_loops_binned.bedpe
# Columns: chr1, start1, end1, chr2, start2, end2, n_supporting_samples
mariner (R/Bioconductor):
library(mariner)
# Read BEDPE files as GInteractions
loops <- lapply(bedpe_files, read.table)
# Convert to GInteractions and merge
gi <- as_ginteractions(loops)
merged <- mergePairs(gi, radius = 10000) # 10kb tolerance
AQuA Tools (Python):
# BEDPE union with anchor overlap tolerance
aqua bedpe-union -i sample1.bedpe sample2.bedpe -o union.bedpe --slop 5000
Given N total experiments:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | Detected in >=50% of samples | Constitutive loop, present across individuals |
| Supported | Detected in 2+ samples | Likely real, some variation |
| Singleton | Detected in 1 sample only | May be individual-specific or depth-dependent |
awk -v N=4 '{
if ($7 >= N*0.5) conf="HIGH";
else if ($7 >= 2) conf="SUPPORTED";
else conf="SINGLETON";
print $0"\t"conf"\t"$7"/"N
}' union_loops_binned.bedpe > union_loops.annotated.bedpe
Context for singletons: Hi-C loop detection is very sensitive to sequencing depth. Many singletons may simply be under-powered in other samples rather than biologically absent. The Loop Catalog found that a union approach captures ~3x more loops than any individual experiment.
TAD boundaries and A/B compartments require different aggregation than loops:
TAD boundaries are single genomic positions. Aggregate like narrow peaks:
# Extract TAD boundary BED from contact domain files
# Each boundary is a narrow region
cat tad_boundaries_sample*.bed | \
bedtools sort -i - | \
bedtools merge -i - -d 40000 -c 1 -o count > union_tad_boundaries.bed
# 40kb gap tolerance because TAD boundaries are resolution-dependent
Compartment calls (eigenvector sign at each bin) should be aggregated by majority vote:
# For each resolution bin, assign A or B based on majority of samples
# This is more complex and typically done in R/Python
encode_log_derived_file(
file_path="/path/to/union_loops.annotated.bedpe",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Union chromatin loops across N pancreas Hi-C experiments",
file_type="aggregated_loops",
tool_used="bedtools + custom merge at 10kb resolution",
parameters="blocklist filtered, score >= 25th pctl, self-ligation >= 20kb removed, 10kb resolution binning"
)
Report to the user:
Resolution mismatch: Loop calls at 5kb vs 25kb resolution will have very different anchor sizes. Always harmonize to a common resolution before merging.
Sequencing depth sensitivity: Loop calling requires deep sequencing (400M+ valid pairs). Shallowly sequenced experiments will call far fewer loops — this is under-detection, not absence.
Algorithm differences are LARGE: Wolff et al. 2022 (GigaScience) found that HICCUPS, Mustache, Fit-Hi-C, and HiCExplorer loop callers intersect by ~50% at most. Mustache tends to recover more validated loops (Roayaei Ardakany et al. 2020). If mixing callers, note this in provenance — and this discordance is itself a reason to prefer the union approach.
Orientation matters: BEDPE anchors should be canonically ordered (anchor1 < anchor2 by genomic coordinate) before merging to avoid duplicate counting.
Inter-chromosomal contacts: These are rare but real. Handle separately — they cannot be distance-filtered.
Distance distribution: Most loops are 100kb-2Mb. Very short-range contacts (<20kb) are often noise from undigested chromatin. Very long-range (>10Mb) are rare.
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Hi-C resolution binning makes liftOver of loops particularly error-prone.
TADs vs loops: These are different features. TADs are domains (regions), loops are point contacts (pairs). Do not mix them in the same union.
Micro-C as complement: Micro-C achieves higher resolution than Hi-C and can detect sub-TAD loops. Treat Micro-C loops as compatible with Hi-C loops in a union (Mustache works on both).
Goal: Aggregate Hi-C chromatin loops across tissues to identify conserved and tissue-specific 3D contacts at the MYC gene locus. Context: Cancer research — MYC is regulated by distal enhancers via chromatin looping.
encode_search_experiments(assay_title="Hi-C", organism="Homo sapiens", limit=50)
Expected output:
{
"total": 89,
"results": [
{"accession": "ENCSR000AKA", "assay_title": "Hi-C", "biosample_summary": "GM12878", "status": "released"},
{"accession": "ENCSR489OCU", "assay_title": "Hi-C", "biosample_summary": "K562", "status": "released"},
{"accession": "ENCSR382RFU", "assay_title": "Hi-C", "biosample_summary": "liver", "status": "released"}
]
}
Interpretation: 89 Hi-C experiments available. Select 5–10 spanning diverse tissue types for cross-tissue comparison.
encode_list_files(accession="ENCSR000AKA", file_format="bedpe", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF001ABC", "output_type": "contact domains", "file_format": "bedpe", "file_size_mb": 2.4},
{"accession": "ENCFF002DEF", "output_type": "chromatin interactions", "file_format": "bedpe", "file_size_mb": 1.8}
]
}
Interpretation: Use "chromatin interactions" files for loop aggregation. Contact domains are TADs, not loops.
encode_download_files(accessions=["ENCFF002DEF", "ENCFF003GHI", "ENCFF004JKL"], download_dir="/data/hic_loops")
Expected output:
{
"downloaded": 3,
"total_size_mb": 5.6,
"md5_verified": true,
"files": ["/data/hic_loops/ENCFF002DEF.bedpe", "/data/hic_loops/ENCFF003GHI.bedpe", "/data/hic_loops/ENCFF004JKL.bedpe"]
}
Apply union merge across tissues:
# MYC locus: chr8:127,700,000-128,000,000
awk '$1=="chr8" && $2>=127700000 && $3<=128000000' union_loops.bedpe > myc_loops.bedpe
Interpretation: Loops anchored at the MYC promoter connecting to distal enhancers. Conserved loops (≥3 tissues) likely represent fundamental regulatory architecture; tissue-specific loops may drive context-dependent MYC activation.
encode_get_facets(assay_title="Hi-C", facet_field="organ", organism="Homo sapiens")
Expected output:
{
"facets": {
"organ": {"brain": 24, "heart": 12, "liver": 8, "lung": 6, "kidney": 4, "blood": 15}
}
}
encode_get_experiment(accession="ENCSR000AKA")
Expected output:
{
"accession": "ENCSR000AKA",
"assay_title": "Hi-C",
"biosample_summary": "GM12878",
"replicates": 2,
"status": "released",
"lab": "/labs/erez-lieberman-aiden/",
"audit": {"WARNING": 1, "ERROR": 0}
}
encode_compare_experiments(accession_1="ENCSR000AKA", accession_2="ENCSR489OCU")
Expected output:
{
"comparison": {
"shared": {"assay": "Hi-C", "organism": "Homo sapiens", "assembly": "GRCh38"},
"differences": {
"biosample": ["GM12878", "K562"],
"lab": ["/labs/erez-lieberman-aiden/", "/labs/erez-lieberman-aiden/"]
}
}
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Union loop catalog (BEDPE) | peak-annotation | Assign genes to loop anchors |
| Conserved loop coordinates | histone-aggregation | Overlay H3K27ac at anchors to find active enhancer-promoter loops |
| Tissue-specific loops | accessibility-aggregation | Check if loop anchors overlap open chromatin |
| Loop anchor BED intervals | variant-annotation | Find GWAS/clinical variants disrupting loop anchors |
| Loop anchor coordinates | liftover-coordinates | Convert hg19 loops to GRCh38 |
| Aggregated loop statistics | visualization-workflow | Generate loop frequency heatmaps |
| Loop-gene assignments | disease-research | Connect loop disruptions to disease phenotypes |
npx claudepluginhub ammawla/encode-toolkitAggregates Hi-C loop calls (BEDPE) across ENCODE experiments, donors, and labs to build union chromatin contact maps. Handles resolution-aware anchor matching and quality metrics for tissue 3D contacts.
Analyzes chromatin state, histone modifications, ATAC-seq accessibility, and TF binding from ENCODE, Roadmap Epigenomics, and ChIP-Atlas. Use for regulatory landscape mapping and cCRE annotations.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.