From encode-toolkit
Searches, queries, and cross-references NCBI GEO datasets with ENCODE experiments. Use to find complementary data, download metadata/series matrices, link accessions, or track provenance.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:geo-connectorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to find complementary datasets in NCBI GEO to supplement ENCODE data
Query the Gene Expression Omnibus programmatically to find complementary datasets, cross-reference ENCODE experiments, and download metadata.
The question: "What additional expression or epigenomic datasets exist in GEO that complement my ENCODE analysis?"
GEO hosts >200,000 series across all organisms and assay types. Many ENCODE experiments are deposited in GEO as secondary archives (ENCODE Portal is primary). GEO also contains vast amounts of non-ENCODE data — disease cohorts, perturbation experiments, time courses — that complement ENCODE's reference epigenomes.
dbxrefs field as GEO:GSExxxxxSeries (GSE) — An experiment/study
├── Sample (GSM) — Individual measurements
│ ├── references → Platform (GPL)
│ ├── has → Supplementary files (raw data)
│ └── has → Data table (normalized values)
│
└── curated into → DataSet (GDS) [not all GSE get curated]
└── generates → Profiles (gene-level summaries)
ENCODE experiments may have GEO cross-references in their metadata. After tracking an experiment:
encode_track_experiment(accession="ENCSR...")
Check the experiment's dbxrefs field for GEO:GSExxxxx entries. If found, link it:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="geo_accession",
reference_id="GSE12345"
)
Search GEO for ENCODE-deposited data:
# Via NCBI E-utilities
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=ENCODE[KEYWORD]+AND+gse[ETYP]&retmax=100&usehistory=y&tool=encode_mcp&email=YOUR_EMAIL"
Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
Required parameters: db=gds, term=QUERY, tool=encode_mcp, email=YOUR_EMAIL
Rate limit: 3 req/sec without API key, 10 req/sec with key. Get a key at https://www.ncbi.nlm.nih.gov/account/
| Qualifier | Purpose | Example |
|---|---|---|
[ETYP] | Entry type | gse[ETYP], gds[ETYP] |
[ORGN] | Organism | "Homo sapiens"[ORGN] |
[PDAT] | Publication date | 2024[PDAT] |
[ACCN] | Accession | GPL96[ACCN] |
[suppFile] | Supplementary file type | bed[suppFile], bw[suppFile] |
# Human pancreas ATAC-seq datasets with BED files
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=pancreas+AND+ATAC-seq+AND+%22Homo+sapiens%22[ORGN]+AND+gse[ETYP]+AND+bed[suppFile]&retmax=50&tool=encode_mcp&email=YOUR_EMAIL"
# ChIP-seq datasets from a specific year
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=ChIP-seq+AND+H3K27ac+AND+gse[ETYP]+AND+2024[PDAT]&retmax=50&tool=encode_mcp&email=YOUR_EMAIL"
# Datasets associated with a PubMed ID
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=gds&id=PMID&tool=encode_mcp&email=YOUR_EMAIL"
# Step 1: Search (returns UIDs, NOT accessions)
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=GSE12345[ACCN]&tool=encode_mcp&email=YOUR_EMAIL"
# Step 2: Get summary (use UID from step 1)
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gds&id=UID&version=2.0&tool=encode_mcp&email=YOUR_EMAIL"
# Get full SOFT-format record
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=self&view=full&form=text"
# Get XML (MINiML) format
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=self&view=full&form=xml"
# Get all sample metadata for a series
curl "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12345&targ=gsm&view=brief&form=text"
GEO uses a "nnn" directory pattern: replace last 3 digits with "nnn".
| Accession | FTP Path |
|---|---|
| GSE12345 | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/ |
| GSM575 | ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM575/ |
| Content | Path Under Series Directory |
|---|---|
| Series matrix (expression table) | matrix/GSE12345_series_matrix.txt.gz |
| SOFT metadata | soft/GSE12345_family.soft.gz |
| MINiML (XML) | miniml/GSE12345_family.xml.tgz |
| All supplementary files | suppl/GSE12345_RAW.tar |
| Individual supplementary | suppl/FILENAME.gz |
# Download series matrix (fastest for expression data)
wget "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/matrix/GSE12345_series_matrix.txt.gz"
# Download all supplementary files
wget "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12345/suppl/GSE12345_RAW.tar"
| Use Case | Format | Speed |
|---|---|---|
| Expression matrix analysis | Series matrix | Fastest (10-100x vs SOFT) |
| Full metadata extraction | SOFT | Complete but slow |
| XML processing | MINiML | Good for programmatic parsing |
| Peak/BED files | Supplementary | Direct download |
| Raw sequencing reads | SRA (not GEO) | Use SRA Toolkit |
1. Find ENCODE experiments of interest:
encode_search_experiments(assay_title="total RNA-seq", organ="pancreas")
2. For each experiment, check for GEO accession:
encode_get_experiment(accession="ENCSR...")
→ Look in dbxrefs for "GEO:GSExxxxx"
3. If GEO accession found, link it:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="geo_accession",
reference_id="GSE12345"
)
4. Search GEO for complementary non-ENCODE datasets:
E-utils search for same tissue + different assay or condition
5. Download GEO metadata for comparison:
acc.cgi or E-utils esummary
6. Log the cross-reference:
encode_log_derived_file(
file_path="/path/to/comparison.tsv",
source_accessions=["ENCSR...", "GSE12345"],
description="ENCODE-GEO cross-tissue comparison"
)
For sequencing data, raw reads are in SRA, not GEO:
# Link GEO to SRA
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gds&db=sra&id=GDS_UID&tool=encode_mcp&email=YOUR_EMAIL"
Python alternative using pysradb:
from pysradb.search import SraSearch
# Convert GSE to SRP
pysradb gse-to-srp GSE12345
# Get all SRR run accessions
pysradb gsm-to-srr GSM12345
dbxrefs field may be empty. ENCODE Portal is always the canonical source.usehistory=y and paginate with retstart for large result sets.*_series_matrix-1.txt.gz, *_series_matrix-2.txt.gz).tool= and email= parameters on every request.Goal: Identify Gene Expression Omnibus (GEO) datasets that complement ENCODE epigenomic experiments, enabling integrative analysis of gene expression with regulatory elements. Context: ENCODE provides epigenomic maps (ChIP-seq, ATAC-seq), while GEO hosts vast RNA-seq expression datasets. Combining them links regulatory elements to transcriptional output.
encode_get_experiment(accession="ENCSR123HEP")
Expected output:
{
"accession": "ENCSR123HEP",
"assay_title": "Histone ChIP-seq",
"target": "H3K27ac",
"biosample_summary": "HepG2",
"organism": "Homo sapiens",
"status": "released"
}
Interpretation: This is H3K27ac ChIP-seq in HepG2 (liver cancer cell line). We need matching RNA-seq data from the same cell line.
Using NCBI E-utilities (via skill guidance):
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=HepG2[cell+line]+AND+RNA-seq[strategy]+AND+Homo+sapiens[organism]&retmax=10
Expected response:
{
"esearchresult": {
"count": "142",
"idlist": ["200156789", "200145678", "200134567"]
}
}
encode_link_reference(accession="ENCSR123HEP", reference_type="geo", reference_id="GSE156789", notes="HepG2 RNA-seq for enhancer-expression integration")
Expected output:
{
"status": "linked",
"accession": "ENCSR123HEP",
"reference_type": "geo",
"reference_id": "GSE156789"
}
encode_search_experiments(assay_title="total RNA-seq", biosample_term_name="HepG2", organism="Homo sapiens")
Expected output:
{
"total": 15,
"results": [
{"accession": "ENCSR456RNA", "assay_title": "RNA-seq", "biosample_summary": "HepG2", "status": "released"}
]
}
Interpretation: ENCODE already has 15 HepG2 RNA-seq experiments! Always check ENCODE first before going to GEO. GEO becomes essential when ENCODE lacks expression data for your specific biosample or experimental condition (e.g., drug treatment, knockdown).
encode_link_reference(
accession="ENCSR000AKA",
reference_type="geo_accession",
reference_id="GSE76079",
notes="Complementary RNA-seq from same lab"
)
Expected output:
{
"status": "linked",
"accession": "ENCSR000AKA",
"reference_type": "geo_accession",
"reference_id": "GSE76079"
}
encode_get_references(accession="ENCSR000AKA")
Expected output:
{
"accession": "ENCSR000AKA",
"references": [
{"type": "geo_accession", "id": "GSE76079", "notes": "Complementary RNA-seq"},
{"type": "pmid", "id": "27429435", "notes": "Primary publication"}
]
}
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| GEO accession (GSE/GSM) | Cross-reference link | encode_link_reference(reference_type="geo_accession") |
| Supplementary expression data | Differential expression | integrative-analysis skill |
| Complementary replicates from GEO | Expanded sample size | download-encode + batch-analysis |
| SRA run accessions | Raw data download | bioinformatics-installer (sra-tools) |
| GEO metadata | Publication cross-reference | cite-encode skill |
| Skill | When to Use Instead/Additionally |
|---|---|
cross-reference | General external reference linking (PubMed, DOI, GEO, NCT) |
data-provenance | Logging derived files from ENCODE+GEO combined analyses |
search-encode | Finding ENCODE experiments (primary source) |
download-encode | Downloading ENCODE files (preferred over GEO for ENCODE data) |
track-experiments | Local experiment tracking with GEO cross-references |
cite-encode | Getting citations for ENCODE experiments found via GEO |
ucsc-browser | Querying aggregated ENCODE tracks at UCSC |
publication-trust | Verify literature claims backing analytical decisions |
npx claudepluginhub ammawla/encode-toolkitSearches, queries, and cross-references NCBI GEO datasets with ENCODE experiments. Use to find complementary data, download metadata/series matrices, link accessions, or track provenance.
Searches and downloads NCBI GEO gene expression datasets (microarray/RNA-seq) by GSE, GSM, or GPL accession, retrieving SOFT and series matrix files for differential-expression analysis.
Searches and retrieves gene expression/genomics data from NCBI GEO (GSE, GSM, GPL, GDS). Downloads SOFT/Matrix files for transcriptomics and expression analysis.