From encode-toolkit
Leverages ENCODE functional genomics data to connect GWAS variants to regulatory elements, annotate disease-associated loci, identify therapeutic targets, and cross-reference clinical trials.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:disease-researchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to connect GWAS variants to ENCODE regulatory elements for disease mechanism research
Leverage ENCODE's 926,535 cCREs and multi-layer functional data to understand disease mechanisms, interpret disease-associated variants, identify therapeutic targets, and connect genomic findings to clinical applications.
The question: "How can ENCODE functional genomics help me understand a disease's molecular mechanisms and identify actionable targets?"
Over 90% of disease-associated variants from GWAS fall in non-coding regions (Maurano et al. 2012). They disrupt regulatory elements controlling gene expression, not protein sequences. ENCODE provides the most comprehensive catalog of these elements across hundreds of cell types and tissues. This skill connects (1) genetic association data, (2) ENCODE functional annotations, and (3) clinical/pharmacological databases for druggable targets.
| Reference | Year | Journal | Key Contribution | Citations | DOI |
|---|---|---|---|---|---|
| ENCODE Phase 3 | 2020 | Nature | 926,535 human cCREs across 400+ biosamples, SCREEN portal | ~1,656 | 10.1038/s41586-020-2493-4 |
| Maurano et al. | 2012 | Science | Disease variants enriched in regulatory DNA; DNase hotspots explain 76.6% of GWAS SNPs | ~3,500 | 10.1126/science.1222794 |
| Finucane et al. | 2015 | Nat Genet | S-LDSC partitions heritability into functional annotations; ENCODE categories explain disproportionate heritability | ~2,253 | 10.1038/ng.3404 |
| Nasser et al. | 2021 | Nature | ABC model links enhancers to genes in 131 cell types; connected 5,036 GWAS signals to 2,249 genes | ~468 | 10.1038/s41586-021-03446-x |
| Roadmap Epigenomics | 2015 | Nature | 111 reference epigenomes; tissue-specific chromatin states; disease variant enrichment in tissue-specific marks | ~5,810 | 10.1038/nature14248 |
| Visscher et al. | 2017 | Am J Hum Genet | GWAS review — 10 years of discoveries, statistical frameworks, shift toward functional interpretation | ~2,500 | 10.1016/j.ajhg.2017.06.005 |
| Buniello et al. | 2019 | Nucleic Acids Res | GWAS Catalog — curated repository; >250,000 SNP-trait associations | ~3,000 | 10.1093/nar/gky1120 |
| Ochoa et al. | 2021 | Nucleic Acids Res | Open Targets Platform — integrates GWAS, functional genomics, drugs for systematic target identification | ~600 | 10.1093/nar/gkaa1027 |
ENCODE regulatory elements are highly tissue-specific. Correct tissue mapping is the single most important decision.
| Disease Category | Primary Tissues | Key Cell Types | ENCODE Cell Lines | Example Diseases |
|---|---|---|---|---|
| Neurological | brain (cortex, hippocampus, cerebellum) | neurons, astrocytes, microglia | SK-N-SH, SK-N-DZ, BE2C | Alzheimer's, Parkinson's, schizophrenia |
| Cardiovascular | heart, aorta, blood vessels | cardiomyocytes, endothelial, smooth muscle | HUVEC, HCASMC | coronary artery disease, heart failure |
| Metabolic | pancreas, liver, adipose, muscle | beta cells, hepatocytes, adipocytes | HepG2, Panc1 | type 2 diabetes, NAFLD, obesity |
| Cancer | tissue of origin | tumor cells, microenvironment | K562, HepG2, MCF-7, A549, HCT116, PC-3 | leukemia, breast cancer, lung cancer |
| Autoimmune | blood, immune organs, thymus | T cells, B cells, macrophages | GM12878, Jurkat | RA, lupus, MS, type 1 diabetes |
| Respiratory | lung, trachea | alveolar epithelial, bronchial | A549, IMR-90 | asthma, COPD, pulmonary fibrosis |
| Renal | kidney | podocytes, tubular epithelial | HEK293 | CKD, IgA nephropathy, FSGS |
| Hepatic | liver, bile duct | hepatocytes, cholangiocytes | HepG2, Hep3B | NAFLD, cirrhosis, hepatitis |
| Endocrine | thyroid, adrenal, pituitary, pancreas | thyrocytes, adrenal cortical, beta cells | — | hypothyroidism, Cushing's |
| Musculoskeletal | bone, cartilage, skeletal muscle | osteoblasts, chondrocytes, myocytes | — | osteoarthritis, osteoporosis |
| Gastrointestinal | intestine, colon, stomach | epithelial, goblet, Paneth cells | HCT116, Caco-2 | IBD, Crohn's, celiac |
| Hematological | blood, bone marrow | HSCs, erythrocytes, megakaryocytes | K562, GM12878, CD34+ | sickle cell, thalassemia, AML |
Check ENCODE availability:
encode_get_facets(organ="pancreas")
encode_get_facets(organ="brain")
If tissue has limited data: Use Tier 1 cell lines (K562, GM12878, H1-hESC) or Roadmap Epigenomics as proxies. Document the mismatch explicitly.
Open chromatin and active enhancers in disease tissue (Maurano et al. 2012):
encode_search_experiments(assay_title="ATAC-seq", organ="...", biosample_type="tissue")
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="...")
encode_search_experiments(assay_title="DNase-seq", organ="...", biosample_type="tissue")
Then use the variant-annotation skill to overlap variants with functional elements.
ENCODE cancer cell lines (NOT tumors — see Cancer Epigenomics section below):
encode_search_experiments(biosample_term_name="K562") # CML
encode_search_experiments(biosample_term_name="HepG2") # Hepatocellular carcinoma
encode_search_experiments(biosample_term_name="MCF-7") # ER+ breast cancer
encode_search_experiments(biosample_term_name="A549") # Lung adenocarcinoma
encode_search_experiments(perturbed=True, organ="...")
encode_search_experiments(assay_title="CRISPR screen", organ="...")
search_articles(query="[DISEASE] AND (ENCODE OR regulatory element OR enhancer)")
Track experiments and link papers:
encode_track_experiment(accession="ENCSR...", notes="Disease research - [disease]")
encode_get_citations(accession="ENCSR...")
encode_link_reference(experiment_accession="ENCSR...", reference_type="pmid", reference_id="12345678")
search_trials(condition="[DISEASE]", intervention="[TARGET_GENE or DRUG]", status=["RECRUITING"])
Link trials: encode_link_reference(experiment_accession="ENCSR...", reference_type="nct_id", reference_id="NCT...")
search_entities(query_strings=["[GENE_NAME]"])
query_open_targets_graphql(
query_string="query target($ensemblId: String!) { target(ensemblId: $ensemblId) { approvedSymbol knownDrugs { rows { drug { name } phase mechanismOfAction } } } }",
variables={"ensemblId": "ENSG..."}
)
search_preprints(category="genetics", recent_days=90)
encode_link_reference(experiment_accession="ENCSR...", reference_type="preprint_doi", reference_id="10.1101/...")
| Overlap Category | ENCODE Data Type | Interpretation |
|---|---|---|
| cCRE-PLS | DNase + H3K4me3 near TSS | Variant in promoter-like element |
| cCRE-dELS | DNase + H3K27ac >2kb from TSS | Variant in distal enhancer |
| cCRE-pELS | DNase + H3K27ac within 2kb of TSS | Variant in proximal enhancer |
| cCRE-CTCF | DNase + CTCF | Variant may disrupt insulator/boundary |
| TF ChIP peak | TF ChIP-seq | Variant at TF binding site |
| No overlap | — | Not active in tested tissues |
Retrieve functional data files:
encode_list_files(experiment_accession="ENCSR...", file_format="bed", output_type="IDR thresholded peaks", assembly="GRCh38", preferred_default=True)
From GWAS hits to candidate causal variants to target genes:
This reduces a typical locus from 50-200 LD variants to 1-5 high-confidence candidates.
6-layer model combining ENCODE data types:
encode_search_experiments(assay_title="ATAC-seq", organ="...", biosample_type="tissue")
encode_search_experiments(assay_title="Histone ChIP-seq", target="H3K27ac", organ="...")
encode_search_experiments(assay_title="TF ChIP-seq", organ="...")
encode_search_experiments(assay_title="Hi-C", organ="...")
encode_search_experiments(assay_title="total RNA-seq", organ="...", biosample_type="tissue")
Connect ENCODE regulatory elements to druggable targets (Ochoa et al. 2021).
The variant-to-drug pipeline:
search_entities(query_strings=["[GENE_NAME]"])
query_open_targets_graphql(
query_string="query target($ensemblId: String!) { target(ensemblId: $ensemblId) { approvedSymbol tractability { label modality value } knownDrugs { count rows { drug { name } phase status } } } }",
variables={"ensemblId": "ENSG..."}
)
Targets with both ENCODE regulatory evidence AND GWAS genetic association are the strongest candidates — orthogonal lines of support.
encode_track_experiment(accession="ENCSR...", notes="Validation - independent replicate")
encode_compare_experiments(accession1="ENCSR...", accession2="ENCSR...")
| Level | Method | Strength | In ENCODE? |
|---|---|---|---|
| Strongest | CRISPRi/CRISPRa perturbation | Direct causal test | Yes (ENCODE 4) |
| Strong | MPRA / STARR-seq | High-throughput activity | Yes (ENCODE 4) |
| Moderate | Reporter assays / eQTL coloc | Element activity / statistical | External |
| Supportive | Chromatin annotation / conservation | Biochemical / evolutionary | Yes / External |
encode_search_experiments(perturbed=True, organ="...")
encode_search_experiments(assay_title="CRISPR screen", organ="...")
search_trials(condition="[DISEASE]", intervention="[GENE or DRUG]", status=["RECRUITING"])
search_trials(condition="[DISEASE]", phase=["PHASE2", "PHASE3"])
search_by_sponsor(sponsor_name="[PHARMA]", condition="[DISEASE]", phase=["PHASE3"])
analyze_endpoints(condition="[DISEASE]", phase=["PHASE3"])
Link trials to tracked experiments:
encode_link_reference(experiment_accession="ENCSR...", reference_type="nct_id", reference_id="NCT...", description="Trial targeting [gene] from ENCODE analysis")
encode_track_experiment(accession="ENCSR...", notes="Disease research - [disease] - [layer]")
encode_log_derived_file(
file_path="/path/to/disease_model.bed",
source_accessions=["ENCSR...", "ENCSR..."],
description="Disease regulatory model - active enhancers overlapping GWAS loci",
file_type="disease_model", tool_used="bedtools intersect + ABC model",
parameters="GRCh38, IDR thresholded peaks, ABC score > 0.015"
)
encode_get_citations(export_format="bibtex")
encode_export_data(format="csv")
S-LDSC (Finucane et al. 2015) quantifies how much disease heritability is attributable to ENCODE-defined annotation categories. This is the most rigorous validation that ENCODE elements are disease-relevant.
Why it matters: If heritability is enriched in tissue-specific enhancers (e.g., 15x in pancreatic islet enhancers for T2D), this confirms correct tissue mapping and that regulatory variation drives disease risk.
Requirements: Full GWAS summary statistics, LD score files, ENCODE BED annotations, ldsc software.
Interpretation: Enrichment > 1 = more heritability than expected by annotation size. Compare across tissues to confirm disease-relevant tissue. Significant enrichment (p < 0.05/n_annotations) validates the ENCODE-based disease model.
| Cell Line | Cancer Type | Key Features | Caveats |
|---|---|---|---|
| K562 | CML | Tier 1 — most ENCODE data; BCR-ABL | Transformed hematopoietic |
| HepG2 | Hepatocellular carcinoma | Tier 2 — extensive liver regulatory data | Well-differentiated |
| MCF-7 | ER+ breast cancer | Hormone-responsive model | ER+ only |
| A549 | Lung adenocarcinoma | Epithelial lung cancer model | KRAS-mutant |
| HCT116 | Colorectal carcinoma | MSI-H model | Not MSS CRC |
| PC-3 | Prostate adenocarcinoma | Androgen-independent | Late-stage only |
Cell line vs. tumor caveats: Cell lines are clonal, immortalized, and lack tumor heterogeneity. Regulatory landscapes diverge from primary tumors. High-passage lines accumulate epigenetic drift. Use for hypothesis generation; validate against TCGA/primary tumor data. Comparing cancer cell line vs. normal tissue counterpart (e.g., HepG2 vs. primary liver) can reveal cancer-specific regulatory changes.
Annotating variants with wrong-tissue data is the most common error. Verify tissue selection against published heritability enrichment (Finucane 2015; Roadmap 2015).
Tier 1 cell lines have deepest data but are transformed. Prioritize: primary tissue > primary cells > cell lines. State cell line limitations explicitly.
A lead SNP is NOT the causal variant. Always expand to LD proxies (r2 > 0.8) or use fine-mapped credible sets. Annotating only lead SNPs misses the causal variant in most cases.
Regulatory element overlap is necessary but not sufficient. The variant must DISRUPT the element. CRISPRi/MPRA validation is required for proof. Chromatin overlap is context, not mechanism.
ENCODE is biased toward well-studied tissues. Rare disease-relevant tissues may lack coverage. Absence of data does NOT mean absence of regulatory elements.
Multiple LD variants may overlap regulatory elements by chance. Enrichment testing (S-LDSC, GARFIELD) and fine-mapping distinguish real signal from LD-driven overlap.
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| GWAS hits in ENCODE regulatory elements | Drug target identification | cross-reference → Open Targets |
| Disease-tissue regulatory map | Enhancer annotation | regulatory-elements skill |
| Candidate therapeutic targets | Clinical trial search | cross-reference → ClinicalTrials.gov |
| Variant-regulatory element intersections | Variant annotation | variant-annotation + clinvar-annotation |
| Tissue-specific regulatory network | Expression correlation | gtex-expression skill |
Goal: Use ENCODE regulatory data combined with disease databases to identify candidate regulatory mechanisms underlying Alzheimer's disease risk variants. Context: Alzheimer's GWAS has identified 75+ risk loci, most in non-coding regions. ENCODE maps the regulatory landscape to interpret these variants.
encode_get_facets(facet_field="assay_title", organ="brain", organism="Homo sapiens")
Expected output:
{
"facets": {
"assay_title": {"Histone ChIP-seq": 120, "TF ChIP-seq": 45, "ATAC-seq": 32, "RNA-seq": 28, "DNase-seq": 22, "Hi-C": 8}
}
}
Interpretation: Extensive brain epigenomic data available. Histone ChIP-seq (120 experiments) and ATAC-seq (32) provide the strongest regulatory element maps.
encode_search_experiments(assay_title="Histone ChIP-seq", organ="brain", target="H3K27ac", organism="Homo sapiens")
Expected output:
{
"total": 24,
"results": [
{"accession": "ENCSR100BRN", "assay_title": "Histone ChIP-seq", "target": "H3K27ac", "biosample_summary": "brain"},
{"accession": "ENCSR101CTX", "assay_title": "Histone ChIP-seq", "target": "H3K27ac", "biosample_summary": "frontal cortex"}
]
}
Using → gwas-catalog skill:
# Query GWAS Catalog for Alzheimer's disease associations
GET https://www.ebi.ac.uk/gwas/rest/api/associations?efoTrait=EFO_0000249
Intersect GWAS variants with ENCODE brain H3K27ac peaks to identify risk variants in active brain enhancers.
Using → variant-annotation pipeline:
For each Alzheimer's risk variant in a brain enhancer:
encode_summarize_collection()
Interpretation: A comprehensive disease research report links ENCODE regulatory data → GWAS variants → annotated variants → candidate genes → therapeutic targets, creating an end-to-end discovery pipeline.
encode_get_facets(facet_field="assay_title", organ="brain", organism="Homo sapiens")
Expected output:
{
"facets": {
"assay_title": {"Histone ChIP-seq": 120, "ATAC-seq": 32, "RNA-seq": 28, "Hi-C": 8}
}
}
encode_search_experiments(assay_title="ATAC-seq", biosample_term_name="microglia", organism="Homo sapiens")
Expected output:
{
"total": 4,
"results": [
{"accession": "ENCSR200MIC", "assay_title": "ATAC-seq", "biosample_summary": "microglia", "status": "released"}
]
}
encode_get_citations(accession="ENCSR100BRN")
Expected output:
{
"citations": [
{"pmid": "31234567", "title": "Brain enhancer atlas reveals Alzheimer regulatory mechanisms", "year": 2023}
]
}
variant-annotation — Detailed variant-level functional annotation with ENCODE dataregulatory-elements — Discovering and characterizing regulatory elements independent of disease contextquality-assessment — Evaluating quality of ENCODE experiments used in disease researchcross-reference — Connecting ENCODE data to PubMed, bioRxiv, ClinicalTrials.govdata-provenance — Tracking the full chain from ENCODE source data to derived disease modelsepigenome-profiling — Build tissue epigenomic profiles as the foundation for disease regulatory modelssingle-cell-encode — Cell type-resolved data for dissecting disease mechanisms in heterogeneous tissueshistone-aggregation — Aggregate histone peaks across donors before disease variant annotationaccessibility-aggregation — Aggregate ATAC-seq peaks across donors for improved sensitivitymulti-omics-integration — Combine RNA, ATAC, histone, and TF data for comprehensive disease regulatory landscapespipeline-guide — Guidance for S-LDSC, fine-mapping, and other disease genomics pipelinesgnomad-variants — Population frequency filtering and gene constraint for disease gene prioritizationensembl-annotation — VEP annotation and phenotype associations from Ensemblucsc-browser — UCSC Genome Browser tracks for disease locus visualizationgeo-connector — Find complementary disease expression datasets in GEOclinvar-annotation — Clinical significance of disease variants from ClinVargwas-catalog — GWAS trait associations and risk alleles from NHGRI-EBI cataloggtex-expression — Tissue expression context for disease genes across 54 GTEx tissuespublication-trust — Verify literature claims backing analytical decisionsnpx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitLeverages ENCODE functional genomics data to connect GWAS variants to regulatory elements, annotate disease-associated loci, identify therapeutic targets, and cross-reference clinical trials.
Interprets non-coding/regulatory variants using GWAS, GTEx eQTL, ENCODE chromatin, RegulomeDB/CADD scoring, and TF-binding disruption.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.