From encode-toolkit
Aggregates ATAC-seq and DNase-seq narrowPeak data from multiple ENCODE experiments, donors, and labs into union chromatin accessibility maps for tissues. Handles cross-lab and platform differences with ENCODE blacklist filtering.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:accessibility-aggregationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to combine ATAC-seq or DNase-seq peaks across multiple experiments for a tissue
Build a comprehensive map of open chromatin for a tissue/cell type by merging ATAC-seq and/or DNase-seq narrowPeak files from multiple ENCODE experiments.
The question: "Where is chromatin accessible in my tissue?"
Like histone marks, chromatin accessibility is a detection question. An open chromatin region detected in one donor but not another is still a real accessible site — individual variation, sequencing depth, and technical factors explain absence. We want the union of all detections.
Both measure open chromatin but with different biases:
| Property | ATAC-seq | DNase-seq |
|---|---|---|
| Method | Tn5 transposase insertion | DNase I hypersensitivity |
| Input required | ~50K cells | ~1M cells |
| Resolution | High | High |
| GC bias | Moderate (Tn5 preference) | Low |
| Mitochondrial reads | High (filter needed) | None |
| ENCODE availability | Newer experiments | Extensive historical catalog |
| Comparability | Generally comparable at open regions |
Recommendation: If combining ATAC-seq and DNase-seq peaks, treat them as equivalent signal sources for accessibility. The union is appropriate because both detect the same biological signal (open chromatin) through different enzymatic mechanisms.
# ATAC-seq
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
biosample_type="tissue",
limit=100
)
# DNase-seq
encode_search_experiments(
assay_title="DNase-seq",
organ="pancreas",
biosample_type="tissue",
limit=100
)
Present a summary to the user:
Ask the user:
For a comprehensive accessibility catalog, combining both is scientifically justified.
encode_get_experiment(accession="ENCSR...")
Track all included experiments:
encode_track_experiment(accession="ENCSR...")
For each experiment:
# ATAC-seq — IDR thresholded peaks
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
# DNase-seq — may use different output types
encode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
output_type="peaks",
assembly="GRCh38"
)
Prefer preferred_default=True files.
encode_download_files(
file_accessions=["ENCFF...", ...],
download_dir="/path/to/data/accessibility",
organize_by="flat"
)
# Download from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
bedtools intersect -a sample.narrowPeak -b hg38-blacklist.v2.bed -v > sample.filtered.narrowPeak
Same logic as histone aggregation — filter per-sample to top 75% by signalValue (column 7):
# Per-sample: remove bottom 25% by signalValue (true distribution quantile)
TOTAL=$(wc -l < sample.filtered.narrowPeak)
LINE_25=$(echo "$TOTAL" | awk '{printf "%d", $1 * 0.25}')
THRESHOLD=$(sort -k7,7n sample.filtered.narrowPeak | awk -v line="$LINE_25" 'NR==line{print $7}')
awk -v t="$THRESHOLD" '$7 >= t' sample.filtered.narrowPeak > sample.qfiltered.narrowPeak
For ATAC-seq, very narrow peaks (<50bp) can be Tn5 insertion artifacts:
awk '($3-$2) >= 50' sample.qfiltered.narrowPeak > sample.clean.narrowPeak
Accessibility peaks are narrow/point-source (like H3K4me3). Use default merge (overlap only, no gap tolerance).
CRITICAL: Tag peaks by sample before concatenation to count unique SAMPLES, not overlapping peaks:
# Tag each sample's peaks with a unique sample ID
awk -v sid="atac_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' atac_sample1.qfiltered.narrowPeak > atac_s1.tagged.bed
awk -v sid="dnase_s1" 'BEGIN{OFS="\t"} {$4=sid; print}' dnase_sample1.qfiltered.narrowPeak > dnase_s1.tagged.bed
# ... repeat for all samples
# Concatenate all tagged peaks (ATAC + DNase combined or separate)
cat *.tagged.bed > all_accessibility.bed
# Sort
bedtools sort -i all_accessibility.bed > all_accessibility.sorted.bed
# Union merge — count UNIQUE SAMPLES (not peaks)
bedtools merge \
-i all_accessibility.sorted.bed \
-c 4,7,9 \
-o count_distinct,max,max \
> union_accessible_regions.bed
# Columns: chr, start, end, n_unique_samples, max_signalValue, max_qValue
To annotate whether peaks came from ATAC, DNase, or both:
# Add assay tag to each peak before concatenation
awk '{print $0"\tATAC"}' atac_peaks.bed > tagged.bed
awk '{print $0"\tDNase"}' dnase_peaks.bed >> tagged.bed
# After merge, use bedtools multiIntersect to track sources
bedtools multiIntersect \
-i atac_sample1.bed atac_sample2.bed dnase_sample1.bed ... \
-header \
-names ATAC_1 ATAC_2 DNase_1 ... \
> multi_intersect.bed
Same logic as histone aggregation. Given N total samples:
| Confidence | Criteria | Interpretation |
|---|---|---|
| High | ≥50% of samples | Constitutive accessible region |
| Supported | 2+ samples | Likely real, some variation |
| Singleton | 1 sample only | Keep — may be individual-specific or condition-specific |
awk -v N=8 '{
if ($4 >= N*0.5) conf="HIGH";
else if ($4 >= 2) conf="SUPPORTED";
else conf="SINGLETON";
print $0"\t"conf"\t"$4"/"N
}' union_accessible_regions.bed > union_accessible_regions.annotated.bed
encode_log_derived_file(
file_path="/path/to/union_accessible_regions.annotated.bed",
source_accessions=["ENCSR...", "ENCSR...", ...],
description="Union chromatin accessibility peaks (ATAC-seq + DNase-seq) across N pancreas samples",
file_type="aggregated_accessibility",
tool_used="bedtools merge v2.31.0",
parameters="blocklist filtered, signalValue >= 25th pctl per sample, ATAC min width 50bp, bedtools merge -d 0"
)
Report to the user:
ATAC mitochondrial reads: ENCODE pipeline removes these, but verify in QC metrics. High mitochondrial fraction indicates poor nuclear chromatin enrichment.
Tn5 sequence bias: ATAC-seq Tn5 has mild sequence preference. For union maps this is acceptable — bias affects peak intensity, not presence.
DNase hypersensitivity saturation: Deeply sequenced DNase-seq detects more sites. Shallowly sequenced samples contribute fewer peaks but are not wrong — they just miss weaker sites.
Promoter enrichment: Both assays are enriched at promoters. When comparing accessibility across tissues, note that promoter accessibility is largely constitutive while enhancer accessibility is tissue-specific.
Cell-type heterogeneity in tissue samples: Bulk ATAC/DNase from tissue captures accessibility across ALL cell types. A peak may represent a minor cell population. This is correct for a tissue-level map but important to note.
Do NOT mix assemblies: All files must be GRCh38 or all hg19. Use encode_compare_experiments to verify.
Peak summits lost after merge: NarrowPeak column 10 (summit offset) is discarded by bedtools merge. If you need summits for motif analysis, extract them before merging and map back afterward.
CUT&RUN/CUT&Tag accessibility data: If ENCODE adds CUT&RUN-based accessibility data in the future, apply the CUT&RUN suspect list (Nordin et al. 2023, Genome Biology) in addition to the ENCODE blacklist.
Goal: Merge ATAC-seq peaks from 4 brain cortex experiments into a union accessibility map. Context: User needs comprehensive open chromatin regions for regulatory element discovery.
encode_search_experiments(
assay_title="ATAC-seq",
organ="brain"
)
Expected output:
{
"total": 24,
"experiments": [
{"accession": "ENCSR001BRN", "biosample_summary": "brain cortex tissue male adult (53 years)"},
{"accession": "ENCSR002BRN", "biosample_summary": "brain cortex tissue female adult (49 years)"}
]
}
encode_search_files(
assay_title="ATAC-seq",
organ="brain",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
Expected output:
{
"total": 8,
"files": [{"accession": "ENCFF001ATQ", "output_type": "IDR thresholded peaks", "assembly": "GRCh38"}]
}
cat *.narrowPeak | sort -k1,1 -k2,2n | bedtools merge -i - -c 4,5 -o count,mean > union_atac_brain.bed
Interpretation: Union peaks represent all genomic positions where chromatin is accessible in brain cortex. Peaks found in all 4 donors are constitutive regulatory elements.
encode_get_facets(organ="pancreas", assay_title="ATAC-seq")
Expected output:
{
"facets": {"biosample_term_name": {"pancreas": 4, "pancreatic islet": 3}, "status": {"released": 6}}
}
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| Union open chromatin map (BED) | Enhancer identification | regulatory-elements skill |
| Accessible regions for motif analysis | TF motif discovery | motif-analysis skill |
| Tissue accessibility catalog | Cross-tissue comparison | compare-biosamples skill |
| Open chromatin at variant sites | Variant functional annotation | variant-annotation skill |
| Accessible peak coordinates | Visualization signal anchors | visualization-workflow skill |
npx claudepluginhub ammawla/encode-toolkitAggregates ATAC-seq and DNase-seq narrowPeak data from multiple ENCODE experiments, donors, and labs into union chromatin accessibility maps for tissues. Handles cross-lab and platform differences with ENCODE blacklist filtering.
Analyzes genomics and epigenomics data: DNA methylation (CpG, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq, histone modifications, chromatin state, and multi-omics integration using pandas/scipy/pysam computation.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.