From encode-toolkit
Cross-references ENCODE data with PubMed, bioRxiv, ClinicalTrials.gov, Open Targets, GTEx, ClinVar, GWAS Catalog, gnomAD, Ensembl for publications, trials, drug targets, and translational pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:cross-referenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to connect ENCODE data to publications, clinical trials, or other databases
Help the user connect ENCODE genomics data to the broader scientific literature, clinical research, drug target discovery, and variant interpretation. This skill is the central hub for all multi-database workflows in the ENCODE Toolkit.
encode_get_citations or encode_get_references to get PMIDssearch_articles, get_article_metadata, find_related_articles)encode_link_referenceencode_get_references to find linked NCT IDsencode_link_referencesearch_entitiesdbxrefs for GEO accessions (format: GEO:GSExxxxx)encode_link_referencegeo-connector skill for detailed GEO API usageRaw sequencing reads for ENCODE experiments are deposited in NCBI SRA. GEO records link to SRA via E-utilities elink. For reprocessing ENCODE data or accessing raw reads not on the ENCODE Portal, query SRA via:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gds&db=sra&id=GDS_UID&tool=encode_mcp&email=YOUR_EMAIL"
Use the Ensembl REST API to cross-reference ENCODE targets and regulatory elements with Ensembl annotations. See ensembl-annotation skill for VEP, Regulatory Build, and gene lookup endpoints.
| Identifier | Format | Example | Database | MCP Tool / Skill |
|---|---|---|---|---|
| PMID | numeric string | "35486828" | PubMed | get_article_metadata |
| DOI | 10.xxxx/... | "10.1038/s41586-020-2493-4" | CrossRef | convert_article_ids |
| NCT ID | NCT + 8 digits | "NCT04567890" | ClinicalTrials.gov | get_trial_details |
| GEO Series | GSE + digits | "GSE123456" | GEO/NCBI | geo-connector skill |
| GEO Sample | GSM + digits | "GSM1234567" | GEO/NCBI | geo-connector skill |
| bioRxiv DOI | 10.1101/YYYY.MM.DD.xxx | "10.1101/2024.06.15.598765" | bioRxiv | get_preprint |
| ENCODE Experiment | ENCSR + 6 alphanum | "ENCSR123ABC" | ENCODE | encode_get_experiment |
| ENCODE File | ENCFF + 6 alphanum | "ENCFF001AAA" | ENCODE | encode_get_file_info |
| Ensembl Gene | ENSG + 11 digits | "ENSG00000102974" | Ensembl | ensembl-annotation skill |
| Ensembl Transcript | ENST + 11 digits | "ENST00000264010" | Ensembl | ensembl-annotation skill |
| SRA Study | SRP + digits | "SRP123456" | SRA/NCBI | E-utilities |
| SRA Run | SRR + digits | "SRR1234567" | SRA/NCBI | E-utilities |
| ROR ID | 9 chars | "021nxhr62" | ROR | search_by_funder |
| ChEMBL ID | CHEMBL + digits | "CHEMBL25" | Open Targets | search_entities |
| rsID | rs + digits | "rs7903146" | dbSNP/ClinVar | clinvar-annotation skill |
| ClinVar Accession | RCV + digits | "RCV000012345" | ClinVar | clinvar-annotation skill |
| JASPAR Matrix | MA + digits + version | "MA0139.1" | JASPAR | jaspar-motifs skill |
Step 1: Track the experiment to extract publications
encode_track_experiment(accession="ENCSR133RZO", fetch_publications=True)
-> Extracts PMIDs from experiment metadata
Step 2: Get the PMIDs
encode_get_references(experiment_accession="ENCSR133RZO", reference_type="pmid")
-> Returns: [{"reference_id": "32728249", "reference_type": "pmid"}]
Step 3: Fetch full article metadata from PubMed
get_article_metadata(pmids=["32728249"])
-> Returns title, authors, journal, abstract, DOI
Step 4: Find related papers
find_related_articles(pmids=["32728249"], link_type="pubmed_pubmed", max_results=10)
-> Returns similar articles for further reading
Step 1: Get experiment details
encode_get_experiment(accession="ENCSR000AKS")
-> Note the target (e.g., H3K27me3), biosample, and any linked DOIs
Step 2: Search bioRxiv for related preprints
search_preprints(category="genomics", recent_days=90, limit=20)
-> Review abstracts for mentions of the target or ENCODE accession
Step 3: Link discovered preprint
encode_link_reference(
experiment_accession="ENCSR000AKS",
reference_type="preprint_doi",
reference_id="10.1101/2024.06.15.598765",
description="Preprint analyzing H3K27me3 patterns in same tissue"
)
Step 1: Check experiment metadata for GEO cross-references
encode_get_experiment(accession="ENCSR133RZO")
-> Look in dbxrefs for "GEO:GSExxxxxx"
Step 2: Link the GEO accession
encode_link_reference(
experiment_accession="ENCSR133RZO",
reference_type="geo_accession",
reference_id="GSE125066",
description="GEO submission with same raw data and supplementary files"
)
Step 1: Get the target from an ENCODE experiment
encode_get_experiment(accession="ENCSR...")
-> Target: "TP53" (a TF ChIP-seq experiment)
Step 2: Search ClinicalTrials.gov for trials involving the target
search_trials(condition="cancer", intervention="TP53", status=["RECRUITING"])
-> Returns active trials targeting TP53
Step 3: Link a relevant trial
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="nct_id",
reference_id="NCT04567890",
description="Phase 2 trial targeting TP53 in solid tumors"
)
Goal: Use ENCODE TF ChIP-seq data to identify drug targets via Open Targets.
Biological motivation: Transcription factors mapped by ENCODE ChIP-seq may be druggable targets themselves or may regulate druggable pathways. Open Targets aggregates genetic association, somatic mutation, expression, and drug evidence to score target-disease relationships. By chaining ENCODE binding data with Open Targets, you can move from "where does this TF bind?" to "what diseases is it linked to, and are there drugs in development?"
Step 1: Find ENCODE experiments for a transcription factor
encode_search_experiments(
assay_title="TF ChIP-seq",
target="CTCF",
organism="Homo sapiens"
)
-> Returns experiments where CTCF binding is mapped genome-wide
-> Note the accessions, biosamples, and labs
Step 2: Look up CTCF in Open Targets
search_entities(query_strings=["CTCF"])
-> Returns: {"id": "ENSG00000102974", "entity": "target"}
-> This Ensembl Gene ID is required for all Open Targets GraphQL queries
Step 3: Query Open Targets for disease associations
query_open_targets_graphql(
query_string='query { target(ensemblId: "ENSG00000102974") {
approvedName
approvedSymbol
associatedDiseases(page: {index: 0, size: 10}) {
rows {
disease { id name }
score
datasourceScores { componentId score }
}
}
} }'
)
-> Returns diseases ranked by association score
-> CTCF disruption is linked to various cancers and developmental disorders
-> The datasourceScores breakdown shows which evidence types drive the association
Step 4: Query for drugs targeting CTCF-regulated pathways
query_open_targets_graphql(
query_string='query { target(ensemblId: "ENSG00000102974") {
knownDrugs(size: 10) {
rows { drug { name mechanismOfAction } disease { name } phase status }
}
tractability { label modality value }
} }'
)
-> Tractability scores indicate whether the target is druggable
-> knownDrugs lists any compounds in clinical development
-> Note: many TFs have low tractability -- pathway-level interventions may be more relevant
Step 5: Find and link relevant clinical trials
search_trials(condition="cancer", intervention="CTCF", status=["RECRUITING"])
-> If no direct CTCF trials, broaden to pathway:
search_trials(condition="cancer", intervention="chromatin remodeling", status=["RECRUITING"])
-> Link the most relevant trial:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="nct_id",
reference_id="NCT...",
description="Trial targeting chromatin regulation pathway relevant to CTCF"
)
Interpretation notes: Most transcription factors are not directly druggable. When Open Targets returns low tractability, shift strategy: (a) look for druggable genes regulated by the TF using the ENCODE binding data, (b) explore pathway-level interventions, or (c) consider the TF as a biomarker rather than a direct drug target.
Goal: Check if genes near ENCODE enhancers are expressed in the relevant tissue, validating that regulatory elements are functionally active.
Biological motivation: An H3K27ac peak marks a putative active enhancer, but not every enhancer is active in every cell type within a tissue. By cross-referencing ENCODE histone ChIP-seq with GTEx bulk RNA-seq, you can assess whether the nearest gene is transcribed in that tissue, strengthening the case that the regulatory element is functional.
Step 1: Find ENCODE H3K27ac experiments in liver
encode_search_experiments(
assay_title="Histone ChIP-seq",
target="H3K27ac",
organ="liver",
organism="Homo sapiens"
)
-> Returns liver H3K27ac experiments
-> Prefer experiments with status="released" and audit_category free of ERROR
Step 2: Download the peak file
encode_list_files(
accession="ENCSR...",
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
-> Select the preferred_default file if available
-> Download with encode_download_files for local analysis
Step 3: Annotate peaks with nearest genes
-> Use the peak-annotation skill or run:
bedtools closest -a peaks.bed -b gencode.v43.annotation.bed -d
-> For each peak, record the nearest gene name, Ensembl ID, and distance
-> Peaks overlapping a gene promoter (within 2 kb of TSS) are most interpretable
Step 4: Check expression of nearby genes in GTEx liver
-> Use the gtex-expression skill:
-> GTEx REST API: GET /expression/medianTranscriptExpression
?gencodeId=ENSG00000...&tissueSiteDetailId=Liver
-> For each annotated gene, retrieve median TPM in liver
-> Classification:
TPM > 10 = highly expressed (strong evidence the enhancer is active)
TPM 1-10 = moderately expressed
TPM < 1 = low/absent expression in bulk tissue
TPM = 0 = gene is not expressed in this tissue context
Step 5: Validate and interpret
-> High H3K27ac + high GTEx expression = strong evidence for active enhancer
driving the nearest gene in liver
-> High H3K27ac + low GTEx expression = possible enhancer for a rare cell type
within the liver (e.g., cholangiocytes, stellate cells, immune infiltrates)
Consider single-cell data from CellxGene to resolve cell-type specificity
-> Low H3K27ac + high GTEx expression = gene is transcribed but this specific
enhancer may not be the primary driver; check for other active enhancers nearby
GTEx tissue ID mapping: GTEx uses specific tissueSiteDetailId values that differ from ENCODE organ names. Common mappings:
| ENCODE organ | GTEx tissueSiteDetailId |
|---|---|
| liver | Liver |
| heart | Heart_Left_Ventricle, Heart_Atrial_Appendage |
| brain | Brain_Cortex, Brain_Cerebellum (13 subregions) |
| kidney | Kidney_Cortex, Kidney_Medulla |
| lung | Lung |
| pancreas | Pancreas |
| intestine | Small_Intestine_Terminal_Ileum, Colon_Transverse, Colon_Sigmoid |
| stomach | Stomach |
| skin of body | Skin_Sun_Exposed_Lower_leg, Skin_Not_Sun_Exposed_Suprapubic |
Always check the GTEx tissue list for exact IDs. Note the case sensitivity and underscores.
Goal: Determine if a disease-associated variant falls in an ENCODE regulatory element.
Biological motivation: Over 90% of GWAS hits fall in non-coding regions. Many of these variants likely disrupt regulatory elements mapped by ENCODE. By intersecting GWAS/ClinVar variants with ENCODE peak files and cCREs, you can prioritize variants that may have functional consequences through altered transcription factor binding, enhancer activity, or chromatin accessibility.
Step 1: Get variants from the GWAS Catalog
-> Use the gwas-catalog skill to find variants for a trait
-> Example: type 2 diabetes (EFO_0001360)
-> Retrieve lead SNPs with their coordinates (chr, pos, ref, alt)
-> Note the reported p-value, odds ratio, and mapped genes
-> Key variants for T2D: rs7903146 (TCF7L2), rs1801282 (PPARG), rs5219 (KCNJ11)
Step 2: Check ClinVar for clinical significance
-> Use the clinvar-annotation skill: query ClinVar API with rsID or coordinates
-> Classifications to note:
Pathogenic / Likely pathogenic = strong clinical evidence
Risk factor = population-level association (common for GWAS hits)
Uncertain significance (VUS) = insufficient evidence to classify
Benign / Likely benign = not disease-causing
-> ClinVar also provides the condition and review status (star rating)
Step 3: Find ENCODE data overlapping the variant position
encode_search_experiments(
assay_title="Histone ChIP-seq",
organ="pancreas",
organism="Homo sapiens"
)
-> Download relevant peak files:
encode_search_files(
file_format="bed",
output_type="IDR thresholded peaks",
assembly="GRCh38"
)
-> Check if the variant coordinates fall within any peak
-> Overlap with H3K27ac = active enhancer; H3K4me1 = poised/active enhancer;
H3K4me3 = promoter; H3K27me3 = repressed
Step 4: Check ENCODE cCREs via UCSC
-> Use the ucsc-browser skill to query ENCODE cCRE tracks at the variant position
-> UCSC REST API: GET /getData/track?genome=hg38&track=encodeCcreCombined
&chrom=chr10&start=112998580&end=112998600
-> Determine the cCRE classification:
PLS = promoter-like signature (high H3K4me3, high DNase)
pELS = proximal enhancer-like signature (high H3K27ac, within 2 kb of TSS)
dELS = distal enhancer-like signature (high H3K27ac, distal from TSS)
CTCF-only = CTCF binding without enhancer/promoter marks
DNase-H3K4me3 = DNase signal with H3K4me3 but not matching PLS criteria
Step 5: Link everything back to the tracked experiment
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="other",
reference_id="rs7903146",
description="T2D GWAS lead SNP in TCF7L2 locus, overlapping H3K27ac peak in pancreatic islets"
)
-> Also link the ClinVar accession:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="other",
reference_id="RCV000012345",
description="ClinVar: risk factor for type 2 diabetes"
)
Variant interpretation matrix:
| GWAS signal | ClinVar classification | ENCODE overlap | Interpretation |
|---|---|---|---|
| Significant | Pathogenic | Active enhancer | High-confidence regulatory variant |
| Significant | Risk factor | Active enhancer | Strong candidate for functional follow-up |
| Significant | VUS | CTCF-only | May affect chromatin architecture |
| Significant | Benign | No overlap | Likely in LD with true causal variant |
| Significant | Not in ClinVar | Poised enhancer | Novel candidate -- needs experimental validation |
Goal: Find the most-cited, highest-quality papers about a specific ENCODE assay type for literature reviews and grant writing.
Biological motivation: When writing methods sections, grants, or reviews, you need authoritative references for ENCODE data types. The Consensus academic search tool searches 200M+ papers and provides citation counts and journal quality metrics, helping identify the definitive references.
Step 1: Search Consensus for systematic reviews
search(
query="ATAC-seq chromatin accessibility systematic review",
year_min=2020,
sjr_max=2
)
-> Returns high-quality reviews of ATAC-seq methodology from Q1/Q2 journals
-> Look for papers with high citation counts as authoritative references
Step 2: Find papers that cite ENCODE ATAC-seq data specifically
search(
query="ENCODE project ATAC-seq chromatin accessibility analysis pipeline"
)
-> Identifies how the community uses ENCODE accessibility data
-> Note which analysis tools and pipelines are most commonly applied
Step 3: Search for benchmarking studies comparing methods
search(
query="ATAC-seq peak calling comparison benchmark ENCODE",
year_min=2021,
sjr_max=1
)
-> Finds methodological comparisons useful for justifying analysis choices
-> These papers often include ENCODE data as ground truth or benchmarks
Step 4: Link key papers to tracked experiments
-> Extract PMIDs from the Consensus results
-> For each relevant paper:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="pmid",
reference_id="...",
description="Methodological reference for ATAC-seq analysis"
)
Step 5: Export a complete bibliography
encode_get_citations(
experiment_accession="ENCSR...",
format="bibtex"
)
-> Generates a BibTeX file with all linked references
-> Import directly into reference managers (Zotero, Mendeley, EndNote)
Search strategy by use case:
| Goal | Consensus query pattern | Filters |
|---|---|---|
| Definitive method paper | "[assay] protocol methodology" | sjr_max=1, year_min=2018 |
| Quality metrics reference | "[assay] quality control metrics benchmarks" | sjr_max=2 |
| Comparison with other assays | "[assay1] vs [assay2] comparison" | human=True |
| Disease application | "ENCODE [assay] [disease] regulatory" | year_min=2020 |
| Tool/pipeline reference | "[tool name] [assay] analysis pipeline" | -- |
Goal: Complete translational pipeline from a GWAS variant to clinical application. This is the most advanced cross-referencing workflow, chaining eight databases together.
Biological motivation: Translational genomics aims to connect population genetic associations to therapeutic interventions. A single GWAS hit triggers a cascade of questions: Is the variant in a regulatory element? What gene does it regulate? Is the gene expressed in the relevant tissue? Is it druggable? Are there active clinical trials? This walkthrough answers all of them.
Step 1: Start with a GWAS hit
-> GWAS Catalog: rs7903146 (TCF7L2 locus)
-> This is the strongest known common variant association with type 2 diabetes
-> Coordinates: chr10:112998590 (GRCh38)
-> Risk allele: T, odds ratio ~1.4, p-value < 1e-200
Step 2: Check ENCODE regulatory landscape at the locus
encode_search_experiments(
assay_title="Histone ChIP-seq",
organ="pancreas",
organism="Homo sapiens"
)
-> Find H3K27ac and H3K4me1 experiments covering the TCF7L2 locus
-> Also search for ATAC-seq / DNase-seq for chromatin accessibility:
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
organism="Homo sapiens"
)
-> Download peak files and check if rs7903146 falls in:
(a) an active enhancer (H3K27ac peak)
(b) an accessible chromatin region (ATAC-seq peak)
(c) a CTCF binding site (CTCF ChIP-seq peak)
-> The variant falls in an islet-specific enhancer marked by H3K27ac and H3K4me1
Step 3: Identify the target gene
-> Peak annotation: which gene does this enhancer regulate?
-> TCF7L2 is the nearest protein-coding gene -- a WNT pathway transcription factor
-> Check GTEx for expression:
GTEx API: tissueSiteDetailId=Pancreas, gencodeId=ENSG00000148737
-> TCF7L2 is expressed in pancreatic islets (median TPM > 5)
-> Also check expression in other T2D-relevant tissues: adipose, liver, skeletal muscle
-> TCF7L2 is broadly expressed but has islet-specific regulatory patterns
Step 4: Assess population frequency and clinical significance
-> gnomAD (via gnomad-variants skill):
rs7903146 T allele frequency: ~30% in European populations, ~25% global
This is a common variant, consistent with a GWAS hit
-> ClinVar (via clinvar-annotation skill):
rs7903146 is classified as "risk factor" for type 2 diabetes
Review status: multiple submitters with criteria provided
-> Also check Ensembl VEP for predicted functional consequences:
Ensembl REST: GET /vep/human/id/rs7903146
-> Regulatory region variant, falls in ENSR00000... (Ensembl Regulatory Build)
Step 5: Evaluate as a drug target via Open Targets
search_entities(query_strings=["TCF7L2"])
-> Returns: {"id": "ENSG00000148737", "entity": "target"}
query_open_targets_graphql(
query_string='query { target(ensemblId: "ENSG00000148737") {
approvedName approvedSymbol
tractability { label modality value }
associatedDiseases(page: {index: 0, size: 5}) {
rows { disease { name } score }
}
knownDrugs(size: 10) {
rows { drug { name } disease { name } phase status }
}
} }'
)
-> Check tractability: is TCF7L2 druggable with small molecules, antibodies, or other modalities?
-> Review disease associations: T2D should appear as a top association
-> If TCF7L2 is not directly druggable, consider WNT pathway targets instead:
search_entities(query_strings=["WNT3A", "CTNNB1", "GSK3B"])
-> Query druggability for upstream/downstream WNT pathway members
Step 6: Find active clinical trials
search_trials(
condition="type 2 diabetes",
intervention="WNT",
status=["RECRUITING"]
)
-> Also search broader:
search_trials(
condition="type 2 diabetes",
phase=["PHASE2", "PHASE3"],
status=["RECRUITING"]
)
-> Look for trials targeting pathways regulated by TCF7L2
-> Link the most relevant trial:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="nct_id",
reference_id="NCT...",
description="Phase 3 T2D trial targeting WNT/TCF7L2 regulatory pathway"
)
Step 7: Document the full chain with provenance
-> Track all ENCODE experiments used:
encode_track_experiment(accession="ENCSR...", fetch_publications=True)
-> Link all references:
encode_link_reference(reference_type="other", reference_id="rs7903146",
description="GWAS lead SNP, TCF7L2 locus, T2D")
encode_link_reference(reference_type="pmid", reference_id="...",
description="Original GWAS discovery paper")
encode_link_reference(reference_type="nct_id", reference_id="NCT...",
description="Relevant clinical trial")
-> Log any derived analysis files:
encode_log_derived_file(
experiment_accession="ENCSR...",
description="Intersection of rs7903146 with H3K27ac peaks in pancreatic islets",
file_path="/path/to/variant_enhancer_overlap.bed"
)
-> Export the complete reference list:
encode_get_citations(experiment_accession="ENCSR...", format="bibtex")
Databases used in this workflow:
| Database | Purpose | Access method |
|---|---|---|
| GWAS Catalog | Trait-variant associations | gwas-catalog skill |
| ENCODE | Regulatory element maps | ENCODE Toolkit tools |
| GTEx | Tissue gene expression | gtex-expression skill |
| ClinVar | Clinical variant classification | clinvar-annotation skill |
| gnomAD | Population allele frequencies | gnomad-variants skill |
| Ensembl | VEP, Regulatory Build | ensembl-annotation skill |
| Open Targets | Drug targets, tractability | query_open_targets_graphql |
| ClinicalTrials.gov | Active trials | search_trials |
| PubMed | Publication links | get_article_metadata |
PMID format mismatches: PMIDs are plain numeric strings (e.g., "35486828"), not prefixed with "PMID:" or "pmid:". When passing PMIDs from ENCODE references to PubMed MCP tools, strip any prefix. The encode_get_references tool returns clean PMIDs.
DOI format inconsistency: Normalize DOIs to the bare format "10.xxxx/..." without the "https://doi.org/" prefix. ENCODE sometimes stores DOIs with the URL prefix; strip it before passing to encode_link_reference. Both formats work for lookups but the bare form is canonical.
Preprint vs published DOI: bioRxiv preprint DOIs (format: "10.1101/YYYY.MM.DD.xxxxxx") differ from the final journal DOIs. A paper may have both. Track both using separate encode_link_reference calls with reference_type="preprint_doi" and reference_type="doi" respectively. Use search_published_preprints on bioRxiv MCP to find if a preprint has been published.
GEO accession confusion: GSE is a series (collection of samples), GSM is a single sample. ENCODE experiments typically map to GSE accessions, not GSM. When linking, use the GSE identifier. You can find GEO accessions in the experiment's dbxrefs field via encode_get_experiment.
NCT ID format: Must be exactly "NCT" followed by 8 digits (e.g., "NCT04567890"). ClinicalTrials.gov tools require this exact format. Do not include spaces or hyphens.
ENCODE dbxrefs parsing: GEO accessions in experiment dbxrefs are prefixed with "GEO:" (e.g., "GEO:GSE125066"). Strip the "GEO:" prefix before passing to GEO tools. Similarly, other dbxref prefixes include "IHEC:" for International Human Epigenome Consortium and "Roadmap:" for Roadmap Epigenomics accessions.
PubMed article vs. preprint: Some ENCODE experiments link to the preprint DOI in their publication field, not the published paper. Always check if a published version exists using search_published_preprints from the bioRxiv MCP or convert_article_ids from PubMed to map between DOI and PMID. The published version is the citable record.
Open Targets gene IDs: Open Targets uses Ensembl Gene IDs (ENSG format), not gene symbols. Always resolve gene symbols to Ensembl IDs first using search_entities(query_strings=["GENE_SYMBOL"]). Passing a gene symbol directly to a GraphQL query will fail silently or return no data.
GTEx tissue naming: GTEx tissue IDs use specific formatting (e.g., "Pancreas", "Heart_Left_Ventricle") that does not match ENCODE organ names (e.g., "pancreas", "heart"). Normalize case and check the GTEx tissue list. Some ENCODE organs map to multiple GTEx tissues (e.g., ENCODE "brain" covers 13 GTEx brain subregions).
Clinical trial linking scope: Not all ENCODE targets have clinical trials. Transcription factors are rarely direct drug targets and often lack clinical trials. In these cases, search for pathway-level interventions (e.g., "WNT pathway" instead of "TCF7L2") or consider the target as a biomarker. Kinases, receptors, and enzymes are far more likely to have active trials.
Assembly coordinate mismatches: When cross-referencing variant coordinates between databases, verify the genome assembly. ENCODE uses GRCh38 for human; GWAS Catalog uses GRCh38 and GRCh37; ClinVar provides both. Never mix assemblies. Use the liftover-coordinates skill to convert between GRCh37 and GRCh38 when needed.
Rate limiting across databases: When chaining multiple external API calls (Open Targets, GTEx, ClinVar, GWAS Catalog), be aware that each service has its own rate limits. If queries fail, add delays between requests. PubMed E-utilities allow 3 requests/second without an API key and 10 requests/second with one.
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| PubMed links | cite-encode | Publication citations for ENCODE experiments |
| GEO dataset links | geo-connector | Connect to complementary GEO data |
| ClinicalTrials.gov links | disease-research | Connect ENCODE data to clinical studies |
| Cross-database identifiers | data-provenance | Full audit trail of external references |
| Publication metadata | publication-trust | Verify cited publications |
| Linked references | scientific-writing | Include cross-references in methods sections |
| External database IDs | track-experiments | Annotate tracked experiments with external links |
When presenting cross-reference results to the user:
encode_get_citations)| Skill | When to Use Instead/Additionally |
|---|---|
track-experiments | Tracking and managing local experiment collection |
data-provenance | Logging derived files and analysis chains |
cite-encode | Exporting citations for tracked experiments |
disease-research | Connecting ENCODE data to disease biology |
publication-trust | Evaluating the provenance and trustworthiness of linked publications |
gtex-expression | Cross-reference with GTEx tissue expression data across 54 tissues |
clinvar-annotation | Cross-reference with ClinVar clinical variant annotations |
gwas-catalog | Cross-reference with NHGRI-EBI GWAS Catalog trait associations |
cellxgene-context | Cross-reference with CellxGene single-cell atlas data |
gnomad-variants | Population allele frequencies and gene constraint scores |
ensembl-annotation | VEP variant annotation, Regulatory Build, coordinate liftover |
ucsc-browser | ENCODE cCRE tracks, TF binding profiles, sequence retrieval |
jaspar-motifs | Transcription factor binding motif analysis at variant positions |
variant-annotation | Comprehensive variant effect prediction workflow |
functional-screen-analysis | CRISPR/MPRA/STARR-seq functional validation of regulatory elements |
geo-connector | Detailed GEO dataset discovery and metadata retrieval |
peak-annotation | Annotating peaks with genes, cCREs, and functional categories |
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitCross-references ENCODE data with PubMed, bioRxiv, ClinicalTrials.gov, Open Targets, GTEx, ClinVar, GWAS Catalog, gnomAD, Ensembl for publications, trials, drug targets, and translational pipelines.
Searches the NHGRI-EBI GWAS Catalog for SNP-trait associations. Supports search by rs ID, disease/trait, and gene, retrieving p-values and summary statistics. Useful for genetic epidemiology and polygenic risk scores.
Queries the NHGRI-EBI GWAS Catalog REST API for SNP-trait associations by rs ID, disease/trait, or gene, returning p-values and summary statistics.