From encode-toolkit
Finds and analyzes ENCODE single-cell genomics data including scRNA-seq and scATAC-seq. Useful for cell type annotation, clustering, deconvolution of bulk signals, and multimodal integration.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:single-cell-encodeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to find or analyze single-cell data (scRNA-seq, scATAC-seq, snRNA-seq) from ENCODE
Help the user find and work with ENCODE single-cell genomics data, understand quality limitations relative to bulk assays, and integrate single-cell with bulk ENCODE profiles for cell-type-resolved regulatory analysis.
| # | Reference | Key Contribution |
|---|---|---|
| 1 | Mawla & Huising 2019, Endocrinology, DOI:10.1210/en.2018-01037 (~200 cit) | Cross-study scRNA-seq meta-analysis revealing that only ~1-2% of heterogeneity-driving genes replicate across studies; TIN-based quality assessment; detection-limit awareness framework. PMC6609986. |
| 2 | Regev et al. 2017, eLife, DOI:10.7554/eLife.27041 (~1,200 cit) | Human Cell Atlas white paper defining the vision for comprehensive single-cell reference maps of all human cells. Establishes community standards for cell atlas construction. |
| 3 | Stuart et al. 2019, Cell, DOI:10.1016/j.cell.2019.05.031 (~7,000 cit) | Seurat v3 — CCA-based anchor identification for cross-dataset integration. The most widely used scRNA-seq integration framework. |
| 4 | Luecken & Theis 2019, Mol Syst Biol, DOI:10.15252/msb.20188746 (~1,500 cit) | Current best practices for scRNA-seq analysis: QC, normalization, batch correction, feature selection, dimensionality reduction, clustering, and differential expression. |
| 5 | Buenrostro et al. 2015, Nature, DOI:10.1038/nature14590 (~1,800 cit) | Single-cell ATAC-seq method. Established that individual cells yield the same nucleosomal fragment size ladder as bulk ATAC-seq, enabling chromatin accessibility profiling at single-cell resolution. |
| 6 | Granja et al. 2021, Nat Genet, DOI:10.1038/s41588-021-00790-6 (~1,000 cit) | ArchR — scalable framework for scATAC-seq analysis including peak calling, gene activity scoring, trajectory inference, and integration with scRNA-seq. |
| 7 | Luecken et al. 2022, Nat Methods, DOI:10.1038/s41592-021-01336-8 (~800 cit) | Benchmarking atlas-level integration methods across tasks, metrics, and scalability. Establishes evaluation framework (kBET, LISI, ARI, NMI) for comparing integration quality. |
| 8 | Hao et al. 2021, Cell, DOI:10.1016/j.cell.2021.04.048 (~5,000 cit) | Seurat v4 — weighted nearest neighbors (WNN) for multimodal integration of RNA + ATAC (or CITE-seq). Defines the standard for joint profiling analysis. |
| 9 | ENCODE Project Consortium 2020, Nature, DOI:10.1038/s41586-020-2493-4 (~1,656 cit) | ENCODE Phase 3; registry of candidate cis-regulatory elements (cCREs) providing the bulk reference against which single-cell data can be compared. |
| Assay | What It Measures | Key Outputs | Typical Files in ENCODE |
|---|---|---|---|
| scRNA-seq | Single-cell gene expression | Cell-type-specific transcriptomes | FASTQ, gene quantifications (TSV), filtered count matrices, h5ad |
| scATAC-seq | Single-cell chromatin accessibility | Cell-type-specific regulatory elements | FASTQ, fragments (TSV), aggregate peaks (BED), cell-barcode assignments |
Search for scRNA-seq and scATAC-seq experiments in the tissue of interest:
# Single-cell RNA-seq
encode_search_experiments(
assay_title="scRNA-seq",
organ="pancreas", # user's tissue of interest
biosample_type="tissue",
limit=50
)
# Single-cell ATAC-seq
encode_search_experiments(
assay_title="snATAC-seq",
organ="pancreas",
biosample_type="tissue",
limit=50
)
If no results, try broader search terms:
encode_search_experiments(search_term="single cell RNA", organ="pancreas", limit=50)
encode_search_experiments(search_term="single cell ATAC", organ="pancreas", limit=50)
Check facets first to understand what organs have single-cell data:
encode_get_facets(assay_title="scRNA-seq")
encode_get_facets(assay_title="snATAC-seq")
Present a summary to the user showing:
Use encode_list_files to see what is available per experiment:
encode_list_files(
experiment_accession="ENCSR...",
assembly="GRCh38",
preferred_default=True
)
Typical file hierarchy:
output_type="reads"): Raw sequencing reads with cell barcodes and UMIsoutput_type="gene quantifications", format TSV): Count matrices (genes x cells) after ENCODE uniform pipeline processingoutput_type="filtered feature barcode matrix"): Post-QC cell-filtered matrices ready for analysisencode_list_files(
experiment_accession="ENCSR...",
file_format="bed",
assembly="GRCh38"
)
Typical file hierarchy:
output_type="reads"): Raw reads with cell barcodesKey difference: scATAC-seq data is extremely sparse at the single-cell level. Most analyses operate on the fragment file, not on per-cell peak calls.
ENCODE Blacklist filtering (required for scATAC-seq): Before any downstream analysis of scATAC-seq peaks or fragments, remove reads/peaks overlapping ENCODE Blacklist regions (Amemiya et al. 2019, Scientific Reports, 1,372 citations). These regions produce artifactual signal in chromatin accessibility assays and inflate per-cell quality metrics (TSS enrichment, FRiP). Both ArchR and Signac apply blacklist filtering by default when provided, but verify it is active. Download blacklists from Boyle-Lab/Blacklist:
hg38-blacklist.v2.bed.gzmm10-blacklist.v2.bed.gzCheck experiment-level quality:
encode_get_experiment(accession="ENCSR...")
| Metric | 10X Chromium | Smart-seq2 | Red Flag |
|---|---|---|---|
| Genes per cell (median) | 1,500-4,000 | 4,000-8,000 | <500 |
| UMIs per cell (median) | 3,000-15,000 | N/A (no UMIs) | <1,000 |
| Mitochondrial % | <10-15% | <10-15% | >25% |
| Doublet rate (estimated) | 2-8% (cell-count dependent) | <2% (plate-based) | >10% |
| Mapping rate | >80% | >80% | <60% |
| Saturation | >40% | N/A | <20% |
| Metric | Acceptable | Red Flag |
|---|---|---|
| Unique fragments per cell | >3,000 | <1,000 |
| TSS enrichment per cell | >5 | <2 |
| Fraction in peaks | >20% | <10% |
| Nucleosomal banding | Clear mono/di/tri pattern | Absent or noisy |
| Doublet rate | <5% | >10% |
Apply the same audit hierarchy as bulk data:
Track passing experiments:
encode_track_experiment(accession="ENCSR...", notes="scRNA-seq, [tissue], [platform]")
This is the most critical section. Cross-study scRNA-seq comparisons are fraught with technical confounders (Mawla & Huising 2019). Before combining datasets, understand these fundamental limitations:
| Feature | 10X Chromium | Smart-seq2 | Drop-seq |
|---|---|---|---|
| Genes per cell | 1,500-4,000 | 4,000-8,000 | 1,000-3,000 |
| Throughput (cells) | 500-10,000 | 96-384 per plate | 500-5,000 |
| Coverage | 3' biased (polyA capture) | Full-length | 3' biased |
| UMI support | Yes | No | Yes |
| Cost per cell | Low ($0.05-0.10) | High ($1-5) | Low ($0.05-0.15) |
| Transcript detection | Lower per cell, more cells | Higher per cell, fewer cells | Lower per cell |
| Splice variant detection | No (3' only) | Yes (full-length) | No (3' only) |
Never treat 10X and Smart-seq2 cells as equivalent without batch correction. Gene detection rates differ 2-3x and coverage patterns are fundamentally different.
Single-cell findings should always be validated against bulk ENCODE data from the same tissue. This is where ENCODE's bulk catalog becomes invaluable:
# Find bulk RNA-seq for same tissue
encode_search_experiments(
assay_title="total RNA-seq",
organ="pancreas",
biosample_type="tissue",
limit=50
)
# Find bulk ATAC-seq for same tissue
encode_search_experiments(
assay_title="ATAC-seq",
organ="pancreas",
biosample_type="tissue",
limit=50
)
CCA/RPCA (Seurat): Canonical correlation analysis finds shared correlation structure across datasets. Use RPCA (reciprocal PCA) for large datasets (>100k cells) — faster and more memory-efficient. Best when cell types are shared across datasets.
Harmony: Fast iterative soft clustering in PCA space. Recommended as first-line approach by Tran et al. 2020. Works well across platforms.
Evaluation metrics: After integration, assess quality using kBET (batch mixing), iLISI (integration LISI, higher = better mixing), cLISI (cell-type LISI, should remain separated), ARI and NMI (cluster agreement with known labels).
ArchR: The standard framework for scATAC-seq analysis and integration:
Signac (Seurat ecosystem): Alternative for users in the Seurat workflow. Uses LSI dimensionality reduction and integrates with Seurat's anchor-based methods.
For 10X Multiome (joint RNA + ATAC profiling from same cell):
Weighted Nearest Neighbors (WNN): Constructs a joint graph from both modalities, weighting each modality per cell based on its informativeness. A cell in a region with distinctive chromatin but generic expression will weight ATAC more heavily, and vice versa.
# Conceptual Seurat v4/v5 WNN workflow:
# 1. Process RNA: NormalizeData -> FindVariableFeatures -> ScaleData -> RunPCA
# 2. Process ATAC: RunTFIDF -> FindTopFeatures -> RunSVD (LSI)
# 3. Joint: FindMultiModalNeighbors(reduction.list = list("pca", "lsi"), dims.list = list(1:30, 2:30))
# 4. Cluster on WNN graph: FindClusters(graph.name = "wsnn")
# 5. UMAP: RunUMAP(nn.name = "weighted.nn")
For unpaired RNA and ATAC (from different cells/experiments):
addGeneIntegrationMatrix to transfer RNA labels to ATAC cellsFindTransferAnchors between RNA reference and ATAC query (using gene activity scores)Use scRNA-seq references to deconvolve bulk ENCODE data — this assigns bulk signal to cell types without running single-cell experiments:
The ultimate value of ENCODE single-cell data is connecting cell-type-resolved expression to the deep bulk epigenomic catalog:
Assign bulk ChIP-seq peaks to cell types: Overlap bulk histone mark peaks with scATAC-seq cell-type accessibility. A bulk H3K27ac peak that overlaps a beta-cell-specific scATAC peak is likely a beta-cell enhancer.
Build cell-type regulatory networks: Combine scRNA-seq (which genes) with scATAC-seq (which regulatory elements) and bulk TF ChIP-seq (which factors bind) to reconstruct cell-type-specific GRNs.
Validate cell-type-specific regulomes: Check that predicted enhancers (from scATAC) overlap expected histone marks in bulk (H3K4me1 + H3K27ac = active enhancer; H3K4me1 + H3K27me3 = poised enhancer).
Chromatin state annotation per cell type: Use ENCODE cCRE registry (ENCODE Phase 3) as the reference and assign each cCRE to contributing cell types based on scATAC-seq accessibility.
# Find ENCODE cCREs for comparison
encode_search_files(
output_type="candidate Cis-Regulatory Elements",
assembly="GRCh38",
organ="pancreas"
)
Log all analyses for provenance:
encode_track_experiment(accession="ENCSR...", notes="scRNA-seq [tissue], included in single-cell analysis")
encode_log_derived_file(
file_path="/path/to/sc_analysis/integrated_atlas.h5ad",
source_accessions=["ENCSR...", "ENCSR..."],
description="Integrated scRNA-seq atlas of [tissue], N donors, N cells, [platform(s)]",
file_type="integrated_atlas",
tool_used="Seurat v5 / ArchR / Scanpy",
parameters="Integration method, HVGs, resolution, batch variables"
)
Link any relevant publications or external datasets:
encode_link_reference(
experiment_accession="ENCSR...",
reference_type="pmid",
reference_id="31305906",
description="Mawla & Huising 2019 - cross-study scRNA-seq reproducibility framework"
)
scATAC-seq produces the same nucleosomal fragment size ladder as bulk ATAC-seq (Buenrostro et al. 2015):
The presence of this banding pattern per cell is a key quality indicator. Cells without clear banding should be filtered.
Per-cell TSS enrichment measures signal pileup at transcription start sites. Minimum threshold of 4-5 is standard (ArchR default = 4). Low TSS enrichment indicates poor signal-to-noise in that cell.
The peak-cell matrix is extremely sparse (~2-5% non-zero entries). This is fundamentally different from scRNA-seq sparsity:
This extreme sparsity motivates:
Before combining any single-cell datasets, verify each item:
Over-interpreting clusters: Not every cluster represents a biologically distinct cell type. Clusters can be driven by cell cycle, dissociation stress, or ambient RNA contamination. Validate against known markers and orthogonal methods (immunohistochemistry, FISH).
Batch-driven clustering: If UMAP clusters correspond to studies rather than cell types, integration has failed or is insufficient. Always color UMAP by both cell type AND study origin. Technical variables (platform, donor) should not predict cluster membership.
Low gene detection masquerading as heterogeneity: With only 2,000-6,000 genes per cell and TIN <20 for most genes, apparent "heterogeneous expression" may reflect operating at the detection limit. Validate by checking whether detection fraction correlates with average expression (Mawla & Huising 2019).
3' bias in droplet platforms: 10X Chromium and Drop-seq capture only the 3' end of transcripts. Genes with alternative 3' UTRs, short 3' UTRs, or non-polyadenylated transcripts (some lncRNAs, histone mRNAs) may be systematically underdetected or absent. Do not interpret their absence as biological.
Platform-specific artifacts: Smart-seq2 is susceptible to length bias (longer genes detected more readily). 10X is susceptible to ambient RNA contamination and barcode swapping. Drop-seq has higher multiplet rates at high cell loading. Know your platform's failure modes.
Doublets corrupting cluster identity: Doublets can create false intermediate populations or inflate rare cell type counts. Apply computational doublet detection (Scrublet, DoubletFinder) before integration. Expected doublet rates for 10X: ~0.8% per 1,000 cells loaded.
Reference dataset quality propagates: If using scRNA-seq as a reference for deconvolution or label transfer to scATAC-seq, errors in the reference annotations propagate to all downstream analyses. Use well-validated references with marker gene support, not just unsupervised clustering labels.
Goal: Discover and analyze ENCODE single-cell experiments (scRNA-seq, scATAC-seq) to identify cell-type-specific regulatory programs within complex tissues. Context: ENCODE's single-cell experiments decompose bulk tissue signals into cell-type-specific profiles, revealing regulatory heterogeneity masked in bulk assays.
encode_get_facets(assay_title="scRNA-seq", facet_field="organ", organism="Homo sapiens")
Expected output:
{
"facets": {
"organ": {"brain": 22, "blood": 15, "lung": 8, "heart": 6, "liver": 4, "kidney": 3}
}
}
Interpretation: Brain has the most scRNA-seq data (22 experiments). Brain's cellular diversity makes it ideal for single-cell analysis.
encode_search_experiments(assay_title="snATAC-seq", organ="brain", organism="Homo sapiens")
Expected output:
{
"total": 12,
"results": [
{"accession": "ENCSR700SCA", "assay_title": "scATAC-seq", "biosample_summary": "brain", "status": "released"},
{"accession": "ENCSR701CTX", "assay_title": "scATAC-seq", "biosample_summary": "cerebral cortex", "status": "released"}
]
}
encode_list_files(accession="ENCSR700SCA", file_format="h5ad", assembly="GRCh38")
Expected output:
{
"files": [
{"accession": "ENCFF800H5A", "output_type": "gene quantifications", "file_format": "h5ad", "file_size_mb": 320}
]
}
encode_search_experiments(assay_title="ATAC-seq", organ="brain", organism="Homo sapiens")
Expected output:
{
"total": 32,
"results": [
{"accession": "ENCSR800BLK", "assay_title": "ATAC-seq", "biosample_summary": "brain", "status": "released"}
]
}
Interpretation: Compare scATAC-seq cell-type clusters with bulk ATAC-seq peaks. Cell types that dominate the tissue (e.g., neurons) will contribute most to the bulk signal. Rare cell types (e.g., microglia) may have unique regulatory elements invisible in bulk data.
Use → cellxgene-context to access the CellxGene Census for additional single-cell reference data. Use → scrna-meta-analysis for cross-study integration.
encode_get_facets(assay_title="snATAC-seq", facet_field="organ", organism="Homo sapiens")
Expected output:
{
"facets": {
"organ": {"brain": 12, "blood": 8, "lung": 5, "heart": 3}
}
}
encode_compare_experiments(accession_1="ENCSR700SCA", accession_2="ENCSR800BLK")
Expected output:
{
"comparison": {
"shared": {"organ": "brain", "organism": "Homo sapiens"},
"differences": {
"assay": ["scATAC-seq", "ATAC-seq"],
"resolution": ["single-cell", "bulk"]
}
}
}
encode_track_experiment(accession="ENCSR700SCA", notes="Brain scATAC-seq for cell-type-specific regulatory analysis")
Expected output:
{
"status": "tracked",
"accession": "ENCSR700SCA",
"notes": "Brain scATAC-seq for cell-type-specific regulatory analysis"
}
| This skill produces... | Feed into... | Purpose |
|---|---|---|
| Cell-type-specific peaks | regulatory-elements | Classify cCREs by cell type |
| Cell-type marker genes | cellxgene-context | Cross-reference with CellxGene atlas data |
| Cell-type peak sets | peak-annotation | Assign cell-type-specific regulatory elements to genes |
| Cross-study cell annotations | scrna-meta-analysis | Integrate ENCODE scRNA-seq across studies |
| Bulk vs. single-cell comparison | compare-biosamples | Quantify cell-type contributions to bulk signal |
| Cell-type accessibility profiles | motif-analysis | Discover cell-type-specific TF motifs |
| scATAC-seq fragment files | visualization-workflow | Generate cell-type-resolved browser tracks |
| Single-cell QC metrics | quality-assessment | Validate single-cell data quality |
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitFinds and analyzes ENCODE single-cell genomics data including scRNA-seq and scATAC-seq. Useful for cell type annotation, clustering, deconvolution of bulk signals, and multimodal integration.
Queries the ENCODE Portal REST API to retrieve regulatory genomics data: TF ChIP-seq, ATAC-seq, histone marks, RNA-seq metadata, BED/bigWig files, and SCREEN cCREs. Use for variant annotation, open chromatin analysis, and peak file download.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.