From encode-toolkit
Tracks exact provenance for every operation on ENCODE data — tool versions, references, parameters, and timestamps — and auto-generates publication-ready methods sections from the log.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:data-provenanceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User wants to track the full analysis chain from ENCODE download through processing to publication figure
Track every operation on ENCODE data with exact tool versions, reference files, scripts, parameters, and timestamps to enable publication-ready methods sections.
The question: "What exactly was done to this data, and can someone else reproduce it identically?"
Reproducibility is the foundation of science. Yet the "Methods" sections of most genomics papers are vague — "reads were aligned with STAR" tells you nothing about which STAR version, which genome index, which parameters, or which annotation version was used. The difference between GENCODE v38 and v39 gene annotations can change thousands of gene assignments.
This skill implements a documentation standard where every operation records:
bedtools v2.31.0, not just "bedtools")This creates a complete audit trail such that a methods section can be auto-generated with zero ambiguity.
Consider a simple liftover operation. A vague log says "coordinates were lifted from hg19 to hg38." A comprehensive provenance log says:
"Genomic coordinates were lifted from GRCh37/hg19 to GRCh38/hg38 using UCSC liftOver (v377, Kent et al. 2002, PMID: 12045153). The chain file hg19ToHg38.over.chain.gz was obtained from UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/, accessed 2024-01-15, MD5: abc123...). Of 45,231 input regions, 44,892 (99.25%) were successfully converted; 339 regions (0.75%) failed to map and were excluded. Unmapped regions were logged to unmapped.bed."
The second version can be reproduced exactly. The first cannot.
At the start of any analysis session, create an experiment log:
project_dir/
├── experiment_log.json # Machine-readable provenance log
├── scripts/ # All scripts used in this analysis
│ ├── 001_download.sh
│ ├── 002_filter_peaks.sh
│ └── 003_merge_samples.R
├── reference_files/ # Reference files used (or symlinks)
│ ├── GRCh38.chrom.sizes
│ └── gencode.v44.annotation.gtf
├── data/ # ENCODE downloads
│ └── (organized by experiment)
├── derived/ # All derived files
│ ├── filtered_peaks/
│ └── merged_results/
└── methods/ # Auto-generated methods text
└── methods_draft.md
{
"project": "H3K27ac analysis in human pancreas",
"created": "2024-01-15T10:30:00Z",
"analyst": "Dr. A. Mawla",
"organism": "Homo sapiens",
"assembly": "GRCh38",
"gene_annotation": "GENCODE v44",
"operations": [],
"encode_experiments": [],
"software_environment": {},
"reference_files": []
}
encode_track_experiment(accession="ENCSR...", notes="Experiment log entry")
For each experiment, record in the log:
| Field | Example | Source |
|---|---|---|
| Accession | ENCSR133RZO | ENCODE portal |
| Assay | Histone ChIP-seq | encode_get_experiment |
| Target | H3K27ac | encode_get_experiment |
| Biosample | pancreas tissue | encode_get_experiment |
| Lab | Bing Ren, UCSD | encode_get_experiment |
| Replicates | 2 biological | encode_get_experiment |
| Sequencer | Illumina HiSeq 4000 | encode_get_experiment |
| Read length | 76bp PE | encode_get_experiment |
| Read count | 42.3M per rep | File metadata |
| Library | TruSeq ChIP | encode_get_experiment |
| Batch/date | 2019-06-15 | encode_get_experiment |
Every processing step creates a log entry with these fields:
{
"operation_id": "op_003",
"timestamp": "2024-01-15T14:22:00Z",
"description": "Filter H3K27ac peaks by signalValue",
"category": "filtering",
"tool": {
"name": "bedtools",
"version": "2.31.0",
"citation": "Quinlan & Hall 2010, Bioinformatics, DOI:10.1093/bioinformatics/btq033"
},
"command": "awk '$7 >= 4.5' ENCFF123ABC.bed | bedtools intersect -a stdin -b blacklist.bed -v > filtered_peaks.bed",
"script_path": "scripts/002_filter_peaks.sh",
"inputs": [
{
"file": "ENCFF123ABC.bed",
"accession": "ENCFF123ABC",
"type": "IDR thresholded peaks",
"md5": "abc123..."
}
],
"reference_files": [
{
"file": "hg38-blacklist.v2.bed.gz",
"source": "ENCODE Blacklist v2 (Amemiya et al. 2019, Sci Rep, DOI:10.1038/s41598-019-45839-z)",
"url": "https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz",
"md5": "def456..."
}
],
"parameters": {
"signalValue_threshold": 4.5,
"blacklist_filter": "exclude overlapping regions"
},
"outputs": [
{
"file": "derived/filtered_peaks/H3K27ac_pancreas_filtered.bed",
"description": "H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
"regions_count": 34521,
"md5": "ghi789..."
}
],
"statistics": {
"input_regions": 45231,
"output_regions": 34521,
"filtered_out": 10710,
"filter_rate": "23.7%"
}
}
encode_download_files(
file_accessions=["ENCFF..."],
download_dir="/path/to/data/",
organize_by="experiment",
verify_md5=True
)
Log: file accession, download URL, MD5 verification result, file size, download timestamp.
Log: liftOver version, chain file (source URL, date accessed), input count, output count, unmapped count, unmapped file location.
Log: filter criteria (signalValue threshold, p-value cutoff), blacklist used and version, input/output region counts, what was removed.
Log: merge tool + version, merge distance parameter, input files (all accessions), sample tagging method, output count, overlap statistics.
{
"tool": {
"name": "DESeq2",
"version": "1.42.0",
"r_version": "4.3.2",
"bioconductor_version": "3.18",
"citation": "Love et al. 2014, Genome Biology, DOI:10.1186/s13059-014-0550-8"
},
"command": "DESeq2::results(dds, contrast=c('condition','treated','control'), alpha=0.05)",
"parameters": {
"design_formula": "~ batch + condition",
"contrast": ["condition", "treated", "control"],
"alpha": 0.05,
"lfcThreshold": 0
}
}
{
"tool": {
"name": "scanpy",
"version": "1.9.6",
"python_version": "3.11.5",
"anndata_version": "0.10.3",
"citation": "Wolf et al. 2018, Genome Biology, DOI:10.1186/s13059-017-1382-0"
}
}
At the start of each analysis, capture the full environment:
sessionInfo()
# Or more detailed:
devtools::session_info()
Log: R version, platform, attached packages with versions, loaded namespaces.
import pkg_resources
{pkg.key: pkg.version for pkg in pkg_resources.working_set}
Log: Python version, all installed packages with versions, virtual environment path.
For each tool used, record the version:
bedtools --version # bedtools v2.31.0
samtools --version # samtools 1.19
STAR --version # 2.7.11a
macs2 --version # macs2 2.2.9.1
liftOver # Kent tools (note: no --version flag; record binary date)
uname -a # OS and kernel
nproc # CPU cores
free -h # Memory (Linux)
sysctl -n hw.memsize # Memory (macOS)
nvidia-smi # GPU info (if applicable)
Every custom script used in the analysis should be stored in the scripts/ directory with sequential numbering:
scripts/
├── 001_download_encode_data.sh
├── 002_filter_peaks.sh
├── 003_merge_samples.R
├── 004_chromhmm_segmentation.sh
├── 005_differential_analysis.R
└── 006_visualization.py
Every stored script should include a header:
#!/bin/bash
# Script: 002_filter_peaks.sh
# Project: H3K27ac analysis in human pancreas
# Date: 2024-01-15
# Author: Generated by ENCODE Connector
# Description: Filter H3K27ac peaks by signalValue and remove blacklisted regions
# Dependencies: bedtools v2.31.0, awk (GNU Awk 5.2.1)
# Input: ENCFF123ABC.bed (IDR thresholded peaks, GRCh38)
# Output: derived/filtered_peaks/H3K27ac_pancreas_filtered.bed
# Reference: hg38-blacklist.v2.bed.gz (Amemiya et al. 2019)
After each operation, register the derived file:
encode_log_derived_file(
file_path="/path/to/output.bed",
source_accessions=["ENCSR...", "ENCFF..."],
description="H3K27ac peaks in pancreas, signalValue >= 4.5, blacklist-filtered",
file_type="filtered_peaks",
tool_used="bedtools v2.31.0 + awk",
parameters="awk '$7 >= 4.5' | bedtools intersect -v blacklist"
)
Verify the provenance chain:
encode_get_provenance(file_path="/path/to/output.bed")
If the user tries different parameters or approaches:
{
"operation_id": "op_003a",
"description": "Filter peaks - signalValue >= 4.5",
"status": "alternative",
"note": "Less stringent threshold, more peaks retained"
},
{
"operation_id": "op_003b",
"description": "Filter peaks - signalValue >= 7.0",
"status": "selected",
"note": "More stringent, user chose this for final analysis"
}
When the user requests methods writing, read the experiment log and generate publication-ready text.
Data Acquisition
[Assay] data for [biosample] were obtained from the ENCODE Project (ENCODE Project Consortium 2020) via the ENCODE portal (https://www.encodeproject.org). [N] experiments were included (accessions: [list]). All experiments used [sequencer] with [read length] [SE/PE] reads, generating [N]M reads per replicate across [N] biological replicates. Data were processed by the ENCODE Uniform Processing Pipeline (version [X]).
File Selection
[Output type] files aligned to [assembly] were selected for downstream analysis. Files were selected using ENCODE's preferred default designation. IDR thresholded peaks (Li et al. 2011) were used for [ChIP-seq/ATAC-seq] to ensure replicate concordance.
Quality Assessment
Experiments were assessed for quality using ENCODE audit flags. Experiments with ERROR-level audits were excluded. ChIP-seq quality was evaluated using FRiP (≥[X]%), NSC (>[X]), RSC (>[X]), and NRF (≥[X]) metrics.
Processing Steps For each operation in the log, generate a sentence:
[Description]. [Tool] (version [X]; [citation]) was used with the following parameters: [parameters]. [Reference files] were obtained from [source] (version [X], accessed [date]). Of [N] input [regions/reads], [N] ([%]) passed filtering.
Data Availability
All source data are available from the ENCODE portal under accessions [list]. Derived files, analysis scripts, and the complete provenance log are available at [repository URL]. Software versions: [list all tools and versions used].
Methods sections MUST follow these principles for every computational step:
Precision over approximation
Complete tool attribution
Full reference specification
Experimental context
Show your filtering work
Statistical rigor
Data accessibility
Orthogonal validation
When generating methods, include proper citations:
| Tool | Citation |
|---|---|
| bedtools | Quinlan & Hall 2010, Bioinformatics |
| samtools | Li et al. 2009, Bioinformatics |
| STAR | Dobin et al. 2013, Bioinformatics |
| featureCounts | Liao et al. 2014, Bioinformatics |
| edgeR | Robinson et al. 2010, Bioinformatics |
| MACS2 | Zhang et al. 2008, Genome Biology |
| DESeq2 | Love et al. 2014, Genome Biology |
| Seurat | Stuart et al. 2019, Cell |
| SCTransform | Hafemeister & Satija 2019, Genome Biology |
| CellRanger | 10x Genomics (cite version used) |
| Scanpy | Wolf et al. 2018, Genome Biology |
| ChromHMM | Ernst & Kellis 2012, Nature Methods |
| liftOver | Kent et al. 2002, Genome Research |
| HOMER | Heinz et al. 2010, Molecular Cell |
| deepTools | Ramirez et al. 2016, Nucleic Acids Research |
| Harmony | Korsunsky et al. 2019, Nature Methods |
| IDR | Li et al. 2011, Annals of Applied Statistics |
| WGCNA | Langfelder & Horvath 2008, BMC Bioinformatics |
| CibersortX | Newman et al. 2019, Nature Biotechnology |
| GSEA | Subramanian et al. 2005, PNAS |
| Gviz | Hahne & Ivanek 2016, Methods in Molecular Biology |
| GraphPad Prism | GraphPad Software (cite version) |
| DAVID | Huang et al. 2009, Nature Protocols |
| Enrichr | Kuleshov et al. 2016, Nucleic Acids Research |
| DEGAS | Li et al. 2022, Genome Biology |
| RRHO | Plaisier et al. 2010, Nucleic Acids Research |
Generate supplementary tables following the scientific documentation standards above:
| Accession | Assay | Target | Biosample | Lab | Replicates | Sequencer | Read Length | Read Count | Library |
|---|---|---|---|---|---|---|---|---|---|
| ENCSR... | Histone ChIP-seq | H3K27ac | pancreas | Ren | 2 bio | HiSeq 4000 | 76bp PE | 42.3M | TruSeq |
| File Accession | Experiment | Format | Output Type | Assembly | Pipeline | Size | MD5 |
|---|---|---|---|---|---|---|---|
| ENCFF... | ENCSR... | bed narrowPeak | IDR thresholded peaks | GRCh38 | ENCODE v2.1 | 1.2MB | abc... |
| Step | Description | Tool | Version | Input | Output | Parameters | Reference Files |
|---|---|---|---|---|---|---|---|
| 1 | Peak filtering | bedtools | 2.31.0 | ENCFF... | filtered.bed | signalValue≥4.5 | blacklist v2 |
| Software | Version | Citation |
|---|---|---|
| R | 4.3.2 | R Core Team 2023 |
| Bioconductor | 3.18 | Huber et al. 2015 |
| DESeq2 | 1.42.0 | Love et al. 2014 |
Export using:
encode_export_data(format="csv") # For Table S1
encode_get_citations(export_format="bibtex") # For bibliography
bedtools v2.30 may produce different results than v2.31Goal: Document every step of an ENCODE analysis pipeline with full provenance — from raw data acquisition through processing, analysis, and derived outputs — enabling reproducibility and publication-ready methods. Context: Reproducibility requires knowing exactly what data, tools, parameters, and versions produced each result. This skill automates provenance tracking.
encode_track_experiment(accession="ENCSR000AKA", notes="H3K27ac ChIP-seq in GM12878 for enhancer analysis")
Expected output:
{
"status": "tracked",
"accession": "ENCSR000AKA",
"notes": "H3K27ac ChIP-seq in GM12878 for enhancer analysis",
"tracked_at": "2025-03-08T10:00:00Z"
}
encode_download_files(accessions=["ENCFF001ABC"], download_dir="/data/chipseq")
Expected output:
{
"downloaded": 1,
"md5_verified": true,
"files": ["/data/chipseq/ENCFF001ABC.bed.gz"]
}
encode_log_derived_file(
source_accessions=["ENCFF001ABC", "ENCFF002DEF"],
derived_file="/data/analysis/gm12878_enhancers_filtered.bed",
description="Filtered H3K27ac peaks: removed blacklist regions, merged within 500bp, filtered signalValue > 5",
tool="bedtools v2.31.0",
parameters="intersect -v (blacklist), merge -d 500, filter signalValue > 5"
)
Expected output:
{
"status": "logged",
"derived_file": "/data/analysis/gm12878_enhancers_filtered.bed",
"source_count": 2,
"logged_at": "2025-03-08T11:00:00Z"
}
encode_get_provenance(file_path="/data/analysis/gm12878_enhancers_filtered.bed")
Expected output:
{
"file": "/data/analysis/gm12878_enhancers_filtered.bed",
"description": "Filtered H3K27ac peaks",
"tool": "bedtools v2.31.0",
"sources": [
{"accession": "ENCFF001ABC", "type": "encode_file"},
{"accession": "ENCFF002DEF", "type": "encode_file"}
],
"logged_at": "2025-03-08T11:00:00Z"
}
encode_get_tracking_summary()
Interpretation: The complete provenance chain enables automatic generation of methods sections: "H3K27ac ChIP-seq peaks (ENCFF001ABC) were filtered using ENCODE blacklist v2 (Amemiya et al. 2019) with bedtools v2.31.0..."
encode_track_experiment(accession="ENCSR000AKA", notes="GM12878 H3K27ac for enhancer catalog")
Expected output:
{
"status": "tracked",
"accession": "ENCSR000AKA",
"notes": "GM12878 H3K27ac for enhancer catalog"
}
encode_log_derived_file(
source_accessions=["ENCFF001ABC"],
derived_file="/data/peaks_filtered.bed",
description="Blacklist-filtered peaks",
tool="bedtools v2.31.0"
)
Expected output:
{
"status": "logged",
"derived_file": "/data/peaks_filtered.bed",
"source_count": 1
}
encode_get_provenance(file_path="/data/peaks_filtered.bed")
Expected output:
{
"file": "/data/peaks_filtered.bed",
"tool": "bedtools v2.31.0",
"sources": [{"accession": "ENCFF001ABC", "type": "encode_file"}]
}
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| Provenance chain (accession → derived files) | Methods section generation | scientific-writing skill |
| Logged analysis steps with parameters | Reproducibility audit | publication-trust skill |
| MD5-verified file records | Data availability statement | cite-encode skill |
| Sequential script numbering | Pipeline documentation | pipeline-guide skill |
| Complete tool + version records | Tool citation list | cite-encode → BibTeX export |
pipeline-guide — ENCODE pipeline execution and monitoringcite-encode — Generating citations and bibliography for ENCODE dataquality-assessment — Evaluating quality of ENCODE experimentsmulti-omics-integration — Multi-omics workflows that generate provenancehistone-aggregation — Aggregation workflows that produce derived filesaccessibility-aggregation — ATAC-seq aggregation with provenancegeo-connector — Log cross-references between ENCODE and GEO datasetscross-reference — Link experiments to PubMed, DOI, GEO, NCT IDspublication-trust — Verify literature claims backing analytical decisionsnpx claudepluginhub ammawla/encode-toolkitTrack exact provenance for every operation on ENCODE data — tool versions, reference files, scripts, parameters, and timestamps — to enable publication-ready methods writing. Use when the user processes ENCODE files, runs any bioinformatics tool, creates filtered/merged datasets, runs pipelines, performs liftover, uses R/Python/Bash for analysis, or needs to document their analysis chain for reproducibility and publication. Also use when the user says "write me methods" to auto-generate methods sections from the provenance log. This skill implements comprehensive provenance documentation: every tool, every version, every reference file, every parameter, every accession — no shortcuts. Use this skill for ANY processing step, ANY file transformation, ANY analysis operation on ENCODE data.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.