From encode-toolkit
Compares ENCODE experiments across biosamples, tissues, or cell lines to identify tissue-specific regulatory patterns and constitutive elements. Useful for cross-tissue comparisons, batch effect detection, and data availability mapping.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:compare-biosamplesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to compare ENCODE experiments across different tissues, cell lines, or biosamples
Help the user systematically compare data availability and experiments across different biosamples to identify tissue-specific regulatory patterns, constitutive elements, and cross-tissue differences.
Cross-biosample comparison is the foundation of understanding tissue-specific gene regulation. Regulatory elements -- particularly enhancers -- are the primary drivers of cell-type identity, with promoters being largely shared across tissues. Comparing the same assay across multiple biosamples reveals which regulatory elements are constitutive (shared) versus tissue-specific (unique to one or few cell types).
The core question: "Which regulatory features distinguish tissue A from tissue B, and which are shared?"
This requires careful matching of datasets, awareness of batch effects, and understanding of the biosample hierarchy to avoid confounding biological signal with technical variation.
| # | Reference | Key Contribution |
|---|---|---|
| 1 | Roadmap Epigenomics Consortium 2015, Nature, DOI:10.1038/nature14248 (~5,810 cit) | Generated 111 reference epigenomes across tissues/cell types; established the framework for cross-tissue epigenomic comparison. Showed that enhancer chromatin states are the most tissue-variable elements. |
| 2 | ENCODE Phase 3 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | Expanded functional annotations to 1.3M candidate cis-regulatory elements (cCREs) across hundreds of biosamples; defined tissue-activity indices for regulatory elements. |
| 3 | Andersson et al. 2014, Nature, DOI:10.1038/nature12787 (~1,500 cit) | FANTOM5 atlas of active enhancers across 808 samples; demonstrated that only ~5% of enhancers are active across all tissues, with the majority being highly tissue-specific. |
| 4 | Heintzman et al. 2009, Nature, DOI:10.1038/nature07917 (~2,200 cit) | Showed histone modifications distinguish cell types: H3K4me1/H3K27ac at enhancers are the most discriminating tissue-specific marks, while H3K4me3 at promoters is largely shared. |
| 5 | Thurman et al. 2012, Nature, DOI:10.1038/nature11232 (~2,000 cit) | Mapped accessible chromatin across 125 cell types; demonstrated that DNase I hypersensitive sites define cell-type identity and that accessibility patterns cluster by tissue of origin. |
| 6 | Leek et al. 2010, Nat Rev Genet, DOI:10.1038/nrg2825 (~1,200 cit) | Comprehensive review of batch effects in genomic data; showed that lab, platform, and processing date can dominate biological variation if not properly controlled. |
| 7 | Forrest et al. 2014, Nature, DOI:10.1038/nature13182 (~1,100 cit) | FANTOM5 promoter-level expression atlas across 975 samples; demonstrated that promoter usage (not just gene expression) is tissue-specific and defines cell identity. |
Understanding what varies across tissues and what does not is essential before designing a comparison.
| Feature | Cross-Tissue Behavior | Implication for Comparison |
|---|---|---|
| Promoters (H3K4me3) | Largely shared (~70% active in most tissues) | Poor discriminators between tissues |
| Enhancers (H3K27ac + H3K4me1) | Highly tissue-specific (~5% shared across all tissues) | Best discriminators; focus comparison here |
| Chromatin accessibility (ATAC/DNase) | Moderate tissue-specificity (~20-30% shared) | Good secondary discriminator; clusters by tissue of origin |
| Polycomb repression (H3K27me3) | Tissue-specific (marks silenced developmental genes) | Useful for identifying repressed lineage programs |
| Gene expression (RNA-seq) | Moderate tissue-specificity | Housekeeping genes shared; tissue-specific TFs are key |
| CTCF binding | Largely constitutive (~70% conserved) | Defines structural boundaries; less tissue-variable |
| DNA methylation | Bimodal; enhancers show tissue-variable methylation | Hypomethylation at active enhancers is tissue-specific |
H3K27ac at enhancers is the single most informative mark for distinguishing tissues (Heintzman et al. 2009, Roadmap 2015). If the user can only compare one mark across tissues, H3K27ac should be the first choice, followed by chromatin accessibility (ATAC-seq or DNase-seq).
| Level | Description | Biological Relevance | Reproducibility | Caveats |
|---|---|---|---|---|
| Tissue | Primary tissue from donor (e.g., pancreas, liver) | Highest -- in vivo biology preserved | Lower -- donor variation, cell-type heterogeneity | Mixed cell populations; composition varies by donor age/sex/health |
| Primary cell | Cells isolated from tissue (e.g., hepatocytes, islets) | High -- enriched for cell type | Moderate -- isolation stress, limited passages | Isolation method alters phenotype; culture conditions matter |
| Cell line | Immortalized cells (e.g., K562, HepG2, GM12878) | Lower -- transformed phenotype | Highest -- clonal, reproducible | May not represent normal tissue biology; passage number matters |
| In vitro differentiated | Cells derived from stem cells (e.g., iPSC-derived cardiomyocytes) | Moderate -- model system | Moderate -- protocol-dependent | Differentiation efficiency varies; often immature phenotype |
| Organoid | 3D self-organizing structures | Moderate-high -- recapitulates tissue architecture | Lower -- heterogeneous | Emerging data type in ENCODE; limited coverage |
| Cell Line | Origin | Cancer/Normal | Best For |
|---|---|---|---|
| K562 | Chronic myelogenous leukemia | Cancer | Hematopoietic chromatin, TF binding, 3D genome |
| GM12878 | Lymphoblastoid (EBV-transformed B cells) | Transformed-normal | Immune regulation, 3D genome (Rao et al. 2014 Hi-C reference) |
| H1-hESC | Human embryonic stem cells | Normal | Developmental regulation, bivalent chromatin |
These three cell lines have the most complete multi-omic profiling in ENCODE. They are excellent positive controls for verifying comparison pipelines before applying to user-specific tissues.
Clarify the comparison type with the user. Each design has different requirements:
| Design | Description | Required Matching | Key Tools | Best File Types |
|---|---|---|---|---|
| Cross-tissue (same assay) | Same mark/assay in different organs | Same assay, same target, same assembly, same biosample_type | encode_search_experiments, encode_get_facets | IDR thresholded peaks, fold change over control |
| Multi-omic (same tissue) | Multiple assays in one biosample | Same biosample_term_name, same assembly | encode_get_facets, encode_search_experiments | Depends on assay |
| Disease vs normal | Pathological vs healthy tissue | Same organ, same assay, matched demographics | encode_search_experiments with biosample filter | IDR thresholded peaks, gene quantifications |
| Developmental time course | Same tissue at different life stages | Same organ, same assay, different life_stage | encode_search_experiments with life_stage filter | Signal tracks, gene quantifications |
| Cell line vs primary tissue | Transformed vs in vivo | Same organ of origin, same assay | encode_search_experiments, encode_compare_experiments | IDR thresholded peaks |
| Cross-species | Human vs mouse homologous tissues | Same organ, same assay, different organism | encode_search_experiments with organism filter | Requires liftOver; use signal tracks |
Ask the user: "What tissues/cell types are you comparing, and what assay are you focusing on?"
Use encode_get_facets to build an availability matrix before searching for specific experiments.
# For each tissue of interest, discover available assays and targets
encode_get_facets(organ="pancreas")
encode_get_facets(organ="liver")
encode_get_facets(organ="brain")
# See which organs have Histone ChIP-seq data
encode_get_facets(assay_title="Histone ChIP-seq")
# See which organs have ATAC-seq data
encode_get_facets(assay_title="ATAC-seq")
Present to the user a matrix like:
| Assay / Target | Pancreas tissue | Liver tissue | Brain tissue | K562 | GM12878 |
|---|---|---|---|---|---|
| H3K27ac ChIP-seq | 3 exp | 5 exp | 8 exp | 12 exp | 10 exp |
| H3K4me3 ChIP-seq | 2 exp | 4 exp | 6 exp | 11 exp | 9 exp |
| ATAC-seq | 1 exp | 3 exp | 5 exp | 4 exp | 3 exp |
| RNA-seq | 4 exp | 6 exp | 10 exp | 15 exp | 8 exp |
| WGBS | 0 | 2 exp | 3 exp | 2 exp | 2 exp |
Highlight gaps: "Pancreas has no WGBS data -- comparison of methylation patterns will be limited to liver and brain."
For a valid cross-tissue comparison, datasets must be matched on technical parameters. Search each tissue:
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="pancreas",
biosample_type="tissue",
limit=50
)
For each experiment pair across tissues, verify:
| Parameter | Must Match? | How to Check |
|---|---|---|
| Assay title | Yes | Search filter |
| Target (for ChIP) | Yes | Search filter |
| Genome assembly | Yes | File metadata; use GRCh38 |
| Biosample type | Recommended | Search filter |
| Organism | Yes | Search filter |
| Life stage | Recommended | Experiment metadata |
| Sex | Preferred | Experiment metadata |
| Pipeline version | Preferred | encode_get_experiment |
| Sequencing depth | Comparable (within 2x) | File metadata |
| Read length | Preferred | File metadata |
Track candidate experiments and use encode_compare_experiments for each cross-tissue pair:
# Track experiments from each tissue
encode_track_experiment(accession="ENCSR_pancreas")
encode_track_experiment(accession="ENCSR_liver")
# Check compatibility
encode_compare_experiments(
accession1="ENCSR_pancreas",
accession2="ENCSR_liver"
)
The compatibility tool checks:
For cross-tissue comparison: Biosample mismatch is expected -- it is the variable of interest. Focus on ensuring all other parameters match.
Batch effects are the most common confounder in cross-biosample comparisons. When experiments come from different labs, platforms, or processing dates, technical variation can dominate biological signal.
| Source | Impact | Detection Method |
|---|---|---|
| Lab of origin | High -- different protocols, antibodies, cell handling | Check lab field; PCA of signal should not cluster by lab |
| Sequencing platform | Moderate -- read quality, GC bias | Check platform in file metadata |
| Library preparation date | Moderate -- reagent lots, operator variation | Check experiment date_released |
| Antibody lot | High for ChIP-seq -- different enrichment profiles | Check antibody_lot_reviews in experiment metadata |
| Pipeline version | Low-moderate -- different peak calling parameters | Check analysis pipeline version |
| Read length | Low-moderate -- affects mappability | Check read_length in file metadata |
For each matched experiment, select files that are directly comparable:
# Get the recommended files for each experiment
encode_list_files(
experiment_accession="ENCSR...",
preferred_default=True,
assembly="GRCh38"
)
ENCODE Blacklist filtering (required before comparison): Before any cross-tissue comparison, remove peaks and signal in ENCODE Blacklist regions (Amemiya et al. 2019, Scientific Reports, 1,372 citations). Blacklisted regions produce artifactual signal that appears consistent across tissues, inflating the count of "constitutive" elements. They can also show variable signal due to copy number differences between cell lines, creating false tissue-specific hits. Filter before comparison:
hg38-blacklist.v2.bed.gz from Boyle-Lab/Blacklistmm10-blacklist.v2.bed.gzbedtools intersect -v -a peaks.bed -b blacklist.bed > peaks.filtered.bed| Comparison Goal | File Type | Output Type | Why |
|---|---|---|---|
| Peak overlap / tissue-specific peaks | bed narrowPeak | IDR thresholded peaks | Binary: present or absent in each tissue |
| Quantitative signal comparison | bigWig | fold change over control | Normalized signal; comparable across experiments |
| Differential expression | tsv | gene quantifications | TPM/FPKM for cross-tissue expression comparison |
| Chromatin state annotation | bed narrowPeak | All histone marks | Required for ChromHMM/chromatin state analysis |
| Visualization / heatmaps | bigWig | signal of unique reads | Raw signal for deepTools or genome browser |
Do NOT mix IDR thresholded peaks from one tissue with pseudoreplicated peaks from another. This introduces systematic differences in peak number and stringency that confound biological comparison.
Assemble a structured metadata table for all experiments in the comparison:
| Biosample | Organ | Type | Assay | Target | Accession | Audit | Depth | Lab | Pipeline |
|-----------|-------|------|-------|--------|-----------|-------|-------|-----|----------|
| Pancreas | pancreas | tissue | Histone ChIP | H3K27ac | ENCSR... | clean | 22M | Bernstein | v2.1 |
| Liver | liver | tissue | Histone ChIP | H3K27ac | ENCSR... | warn | 18M | Snyder | v2.1 |
| Brain | brain | tissue | Histone ChIP | H3K27ac | ENCSR... | clean | 25M | Bernstein | v2.1 |
Use encode_summarize_collection after tracking all experiments for a bird's-eye view:
# After tracking all experiments
encode_summarize_collection()
Flag potential issues in the matrix:
Peaks (or signals) present in one tissue but absent in others. The canonical approach:
bedtools intersect -v to find peaks unique to each tissuePeaks present across ALL tissues compared:
bedtools multiintersect across all tissue peak filesFor continuous signal comparison across tissues:
multiBigwigSummary to compute genome-wide signal matrixplotHeatmap at tissue-specific peak sets to visualize differencesencode_search_experiments with treatment or biosample filtersRecord the entire comparison design and results:
# Track all experiments in the comparison
encode_track_experiment(accession="ENCSR_tissue1", notes="Cross-tissue H3K27ac comparison - pancreas")
encode_track_experiment(accession="ENCSR_tissue2", notes="Cross-tissue H3K27ac comparison - liver")
# Log any derived comparison files
encode_log_derived_file(
file_path="/path/to/tissue_specific_peaks.bed",
source_accessions=["ENCSR_tissue1", "ENCSR_tissue2"],
description="Pancreas-specific H3K27ac peaks not found in liver",
file_type="differential_peaks",
tool_used="bedtools intersect v2.31.0",
parameters="bedtools intersect -a pancreas.bed -b liver.bed -v"
)
# Link relevant publications
encode_link_reference(
experiment_accession="ENCSR_tissue1",
reference_type="doi",
reference_id="10.1038/nature14248",
description="Roadmap Epigenomics reference for cross-tissue comparison methodology"
)
Confounding batch with biology: If all pancreas experiments come from Lab A and all liver experiments from Lab B, you cannot distinguish tissue differences from lab effects. Check lab metadata before interpreting any cross-tissue difference. Leek et al. (2010) showed that batch effects can dominate over 50% of total variation.
Mixing biosample types: Comparing K562 (cell line) H3K27ac with primary liver tissue H3K27ac conflates transformation-driven changes with tissue-specific regulation. Always compare within the same biosample type when possible.
Assembly mismatch: ALL files in a comparison must use the same genome assembly. GRCh38 and hg19 coordinates are NOT compatible. Use encode_compare_experiments to catch this.
Ignoring cell-type heterogeneity: Bulk tissue samples contain mixed cell populations. A "pancreas-specific" peak might actually be present in a minority cell type (e.g., delta cells). Single-cell data (scATAC-seq, scRNA-seq) can deconvolve this, but is not available for all tissues in ENCODE.
Depth-driven false differences: A deeply sequenced tissue will have more peaks called than a shallowly sequenced one, even if the underlying biology is identical. Always check sequencing depth and prefer normalized signal (fold change over control) for quantitative comparisons.
Incomplete panel comparison: Comparing 5 histone marks in tissue A but only 3 in tissue B produces a biased view. Document which marks are available in each tissue and restrict comparison to the intersection of available assays.
Goal: Systematically compare ENCODE epigenomic data between normal tissue and cancer cell lines to identify disease-specific regulatory changes. Context: Comparing biosamples reveals which regulatory elements are gained or lost in disease states.
encode_search_experiments(assay_title="Histone ChIP-seq", organ="liver", target="H3K27ac", organism="Homo sapiens")
Expected output:
{
"total": 8,
"results": [
{"accession": "ENCSR100LIV", "biosample_summary": "liver", "target": "H3K27ac"},
{"accession": "ENCSR200HEP", "biosample_summary": "HepG2", "target": "H3K27ac"}
]
}
encode_compare_experiments(accession_1="ENCSR100LIV", accession_2="ENCSR200HEP")
Expected output:
{
"comparison": {
"shared": {"assay": "Histone ChIP-seq", "target": "H3K27ac", "organism": "Homo sapiens"},
"differences": {
"biosample": ["liver", "HepG2"],
"biosample_type": ["tissue", "cell line"]
}
}
}
encode_download_files(accessions=["ENCFF100LIV", "ENCFF200HEP"], download_dir="/data/comparison")
bedtools intersect -v -a liver_peaks.bed -b hepg2_peaks.bed > liver_specific.bed
bedtools intersect -v -a hepg2_peaks.bed -b liver_peaks.bed > hepg2_specific.bed
bedtools intersect -a liver_peaks.bed -b hepg2_peaks.bed > shared_peaks.bed
Interpretation: HepG2-specific H3K27ac peaks mark cancer-gained enhancers. Liver-specific peaks mark enhancers lost in cancer.
encode_compare_experiments(accession_1="ENCSR100LIV", accession_2="ENCSR200HEP")
Expected output:
{
"comparison": {
"shared": {"assay": "Histone ChIP-seq", "target": "H3K27ac"},
"differences": {"biosample": ["liver", "HepG2"]}
}
}
encode_get_facets(assay_title="Histone ChIP-seq", facet_field="biosample_ontology.term_name", target="H3K27ac", organism="Homo sapiens")
Expected output:
{
"facets": {
"biosample_ontology.term_name": {"K562": 8, "GM12878": 7, "HepG2": 5, "liver": 4, "brain": 3}
}
}
encode_track_experiment(accession="ENCSR100LIV", notes="Liver H3K27ac - normal tissue control for HepG2 comparison")
Expected output:
{"status": "tracked", "accession": "ENCSR100LIV", "notes": "Liver H3K27ac - normal tissue control for HepG2 comparison"}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Differential peak sets | peak-annotation | Assign biosample-specific peaks to genes |
| Biosample-specific enhancers | disease-research | Identify disease-gained/lost regulatory elements |
| Shared regulatory elements | regulatory-elements | Define constitutive vs. tissue-specific cCREs |
| Cell composition context | cellxgene-context | Deconvolve tissue heterogeneity effects |
| Expression differences | gtex-expression | Validate regulatory changes with expression data |
| Comparison metadata | data-provenance | Document biosample comparison analysis |
| Differential regions | variant-annotation | Find variants in biosample-specific regulatory elements |
npx claudepluginhub ammawla/encode-toolkitCompares ENCODE experiments across biosamples, tissues, or cell lines to identify tissue-specific regulatory patterns and constitutive elements. Useful for cross-tissue comparisons, batch effect detection, and data availability mapping.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.