From encode-toolkit
Aggregates narrowPeak ChIP-seq data from multiple ENCODE experiments, donors, and labs into union histone mark maps with confidence annotations, handling batch effects and blacklists.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:histone-aggregationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to combine histone ChIP-seq peaks across multiple ENCODE experiments for a tissue or cell type
Build a comprehensive map of histone mark binding for a tissue/cell type by merging narrowPeak files from multiple ENCODE experiments into a union peak set.
The question: "Does my tissue have this histone mark, and at what genomic locations?"
This is a detection/cataloging question, not a differential one. Once a histone mark passes noise thresholds (ENCODE IDR, quality metrics), detection is binary — the mark is either bound or not. If detected in one donor but not another, that region is still a real binding site. Individual variation and technical differences (lab, depth, antibody lot) explain absence, not that presence is spurious.
Therefore: we want the UNION of all detections, not a consensus.
Search for all histone ChIP-seq data for the target mark and tissue:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K4me1", # or H3K27ac, H3K4me3, H3K27me3, etc.
organ="pancreas", # user's tissue of interest
biosample_type="tissue", # or "cell line", "primary cell"
limit=100
)
Present a summary table to the user showing:
Use encode_get_facets first if unsure what's available:
encode_get_facets(assay_title="Histone ChIP-seq", organ="pancreas")
For each experiment, check quality before including:
encode_get_experiment(accession="ENCSR...")
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
For each passing experiment, get the peak files:
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
File selection priority:
Prefer preferred_default=True files when available.
Download all selected files:
encode_download_files(
file_accessions=["ENCFF...", "ENCFF...", ...],
download_dir="/path/to/data/narrowpeaks",
organize_by="flat"
)
IMPORTANT: Filter BEFORE merging, not after.
Remove artifact-prone regions (centromeres, telomeres, rDNA repeats, satellite repeats):
# Download ENCODE blocklist for GRCh38 from:
# https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
# For mm10: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/mm10-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak
Filter each sample's peaks to retain those above the 25th percentile signalValue (column 7 in narrowPeak). The top 75% of peaks by signalValue are the most reliable across processing pipelines:
# Calculate the 25th percentile of the signalValue DISTRIBUTION for this sample
# (This is a true quantile, not 25% of the range)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak
Note: This is a per-sample filter. Each experiment has different signal distributions. Do NOT apply a universal threshold across samples.
Different histone marks have different peak characteristics:
| Mark Class | Examples | Peak Type | Merge Gap (-d) |
|---|---|---|---|
| Narrow/point | H3K4me3, H3K27ac, H3K4me1, H3K9ac | Sharp peaks | 0 (default, overlap only) |
| Broad/domain | H3K27me3, H3K9me3, H3K36me3 | Wide domains | 1000-5000bp |
CRITICAL: To count unique SAMPLES (not overlapping peaks), tag each peak with its sample ID before concatenation:
# Tag each sample's peaks with a unique sample ID (column 4 = name field)
awk -v sid="sample1" 'BEGIN{OFS="\t"} {$4=sid; print}' sample1.qfiltered.narrowPeak > sample1.tagged.bed
awk -v sid="sample2" 'BEGIN{OFS="\t"} {$4=sid; print}' sample2.qfiltered.narrowPeak > sample2.tagged.bed
# ... repeat for all samples
# Concatenate all tagged peaks
cat sample*.tagged.bed > all_peaks.bed
# Sort by coordinate
bedtools sort -i all_peaks.bed > all_peaks.sorted.bed
# Merge overlapping peaks, counting UNIQUE SAMPLES (not peaks)
# For NARROW marks (H3K4me3, H3K27ac, H3K4me1):
bedtools merge \
-i all_peaks.sorted.bed \
-c 4,7,9 \
-o count_distinct,max,max \
> union_peaks.bed
# Output columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue
# For BROAD marks (H3K27me3, H3K9me3, H3K36me3):
bedtools merge \
-i all_peaks.sorted.bed \
-d 1000 \
-c 4,7,9 \
-o count_distinct,max,max \
> union_peaks.bed
Why count_distinct matters: Without it, a sample with 3 overlapping peaks inflates the count to 3 instead of 1, making the confidence annotation wrong.
Note on broad marks: H3K27me3, H3K9me3, and H3K36me3 are typically called as broadPeak (not narrowPeak) in ENCODE. When downloading, check for output_type="replicated peaks" with file_type="bed broadPeak". BroadPeak has the same first 9 columns as narrowPeak minus the summit column.
If you want to know WHICH samples support each region:
bedtools multiIntersect \
-i sample1.qfiltered.narrowPeak sample2.qfiltered.narrowPeak ... \
-header \
-names sample1 sample2 ... \
> multi_intersect.bed
Annotate each merged region by number of supporting samples. Given N total samples:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | Detected in ≥50% of samples | Robust binding site, consistent across donors/labs |
| Supported | Detected in 2+ samples | Likely real, some individual/technical variation |
| Novel/singleton | Detected in 1 sample only | May be real but could be noise — keep but flag |
# Add confidence column (assuming N=6 total samples, column 4 = support count)
awk -v N=6 '{
if ($4 >= N*0.5) conf="HIGH";
else if ($4 >= 2) conf="SUPPORTED";
else conf="SINGLETON";
print $0"\t"conf"\t"$4"/"N
}' union_peaks.bed > union_peaks.annotated.bed
CRITICAL: Do NOT discard singletons. A peak detected in 1 of 6 donors is still a real binding event in that individual. The question is "can this mark bind here?" — and the answer is yes.
Record the entire analysis chain:
encode_log_derived_file(
file_path="/path/to/union_peaks.annotated.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Union H3K4me1 peaks across N pancreas samples with confidence annotation",
file_type="aggregated_peaks",
tool_used="bedtools merge v2.31.0",
parameters="blocklist filtered, signalValue >= 25th percentile per sample, bedtools merge -d 0 for narrow marks"
)
Report to the user:
Assembly mismatch: ALL files must use the same assembly (GRCh38 or hg19). Use encode_compare_experiments to verify.
Broad marks need gap tolerance: H3K27me3 domains can span 10-100kb. Adjacent peaks from different samples should be merged with -d 1000 or more.
Don't use consensus for cataloging: Requiring presence in N/M samples discards real biology. Consensus is for high-confidence subsets, not comprehensive catalogs.
Filter noise per-sample, not post-merge: SignalValue thresholds must be applied within each sample because signal distributions differ by library/sequencing depth.
Antibody lot variation: Even same-target experiments can have different peak profiles due to antibody batch effects. This is expected — it's why we use the union.
Peak width variation across labs: Different peak callers and parameters produce different peak widths. bedtools merge handles this naturally by collapsing overlaps.
ChIP-R as alternative: If the user wants a more statistical approach than simple union, recommend ChIP-R (rank-product method on narrowPeak files, no BAMs needed). But for the "is it bound anywhere?" question, union is more appropriate.
Peak summits are lost after merge: NarrowPeak column 10 encodes the peak summit position (offset from start). bedtools merge discards this. If downstream analysis requires summit positions (e.g., motif analysis), extract summits before merging and map back afterward.
CUT&RUN/CUT&Tag data need a separate suspect list: The ENCODE blacklist was designed for ChIP-seq. Nordin et al. 2023 (Genome Biology) showed CUT&RUN has its own problematic regions. If incorporating CUT&RUN/CUT&Tag data, apply both the ENCODE blacklist AND the CUT&RUN suspect list.
For detailed biological meaning of each histone mark (writers, erasers, readers, contradictions, cancer-specific states), consult the comprehensive reference at references/histone-marks-reference.md (co-located in this skill's references directory). This catalog covers 21 individual marks, ChromHMM combinatorial states, functional categories, and 37 key papers.
Key mark-type considerations for aggregation:
Goal: Aggregate H3K27ac peaks from 5 ENCODE experiments into a union peak set for pancreatic islets. Context: User wants to identify all genomic regions with active enhancer marks across multiple donors.
encode_search_experiments(
assay_title="Histone ChIP-seq",
organ="pancreas",
target="H3K27ac"
)
Expected output:
{
"total": 5,
"experiments": [
{"accession": "ENCSR111PAN", "biosample_summary": "pancreas tissue male adult (44 years)"},
{"accession": "ENCSR222PAN", "biosample_summary": "pancreas tissue female adult (51 years)"}
]
}
encode_search_files(
assay_title="Histone ChIP-seq",
organ="pancreas",
target="H3K27ac",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
Expected output:
{
"total": 5,
"files": [
{"accession": "ENCFF100PK1", "output_type": "IDR thresholded peaks", "file_size_mb": 1.2},
{"accession": "ENCFF200PK2", "output_type": "IDR thresholded peaks", "file_size_mb": 1.5}
]
}
cat *.bed | sort -k1,1 -k2,2n | bedtools merge -i - -c 4,5 -o count,mean > union_h3k27ac_pancreas.bed
Interpretation: Peaks present in ≥3 of 5 experiments are high-confidence tissue enhancers. Singleton peaks may reflect donor-specific or noise peaks.
encode_search_files(
assay_title="Histone ChIP-seq",
organ="liver",
target="H3K4me3",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
Expected output:
{
"total": 8,
"files": [{"accession": "ENCFF999LIV", "output_type": "IDR thresholded peaks", "assembly": "GRCh38"}]
}
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| Union peak set (BED) | Peak annotation with genes | peak-annotation skill |
| Consensus enhancer map | Chromatin state classification | regulatory-elements skill |
| Multi-donor confidence scores | Quality filtering | quality-assessment skill |
| Tissue histone mark catalog | Cross-tissue comparison | compare-biosamples skill |
| Peak coordinates for motif analysis | TF motif enrichment | motif-analysis skill |
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitAggregates narrowPeak ChIP-seq data from multiple ENCODE experiments, donors, and labs into union histone mark maps with confidence annotations, handling batch effects and blacklists.
Analyzes genomics and epigenomics data: DNA methylation (CpG, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq, histone modifications, chromatin state, and multi-omics integration using pandas/scipy/pysam computation.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.