From encode-toolkit
Generates publication-ready methods sections, figure legends, supplementary tables, and data availability statements from ENCODE analysis provenance. Use for reproducible genomics paper writing from logged pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/encode-toolkit:scientific-writingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate publication-quality scientific writing from ENCODE analysis records. This skill integrates with `data-provenance` and `cite-encode` to auto-generate methods from logged pipeline runs. Every generated section follows rigorous scientific documentation standards -- complete reporting of all experimental and computational parameters with zero ambiguity.
Generate publication-quality scientific writing from ENCODE analysis records. This skill integrates with data-provenance and cite-encode to auto-generate methods from logged pipeline runs. Every generated section follows rigorous scientific documentation standards -- complete reporting of all experimental and computational parameters with zero ambiguity.
Most methods sections in genomics papers are incomplete. They omit software versions, skip reference file details, conflate technical and biological replicates, and use phrases like "default parameters" without stating what those defaults are. Reviewers catch these omissions, and readers cannot reproduce the analysis.
This skill solves the problem by generating methods text directly from the provenance chain. When every processing step has been logged (via data-provenance), the methods section writes itself. When metadata has been captured from ENCODE (via track-experiments), the experimental details are already recorded. This skill assembles these records into publication-ready prose, figure legends, supplementary tables, and data availability statements.
This standard is not aspirational -- it is the minimum bar for reproducible science.
Every methods section MUST report the following fields. Omitting any of these fields produces an incomplete methods section that reviewers will flag and readers cannot reproduce.
| Field | Example | Why Required |
|---|---|---|
| Library preparation | TruSeq ChIP | Affects fragment size distribution and GC bias |
| Biological replicates | n=2 per condition | Statistical power and reproducibility |
| Cells/nuclei per replicate | 50,000 cells | Input sufficiency for the assay |
| Sequencing reads | 30M paired-end | Coverage depth determines sensitivity |
| Read length | 2x150 bp | Alignment accuracy and mappability |
| Paired/single-end | Paired-end | Fragment size estimation, structural variants |
| Sequencer | NovaSeq 6000 | Quality profile, error model, binning |
| Lab/batch | Snyder Lab, Stanford | Batch effect awareness |
| Reference genome | GRCh38/hg38 | Coordinate system for all downstream analysis |
| Gene annotation | GENCODE v44 | Gene definitions change between versions |
| ENCODE accessions | ENCSR133RZO | Exact data provenance for reproducibility |
| Blacklist version | ENCODE Blacklist v2 | Artifact exclusion affects all peak-based analyses |
# Track the experiment to capture metadata
encode_track_experiment(accession="ENCSR...", fetch_publications=True)
# Get full experiment details
encode_get_experiment(accession="ENCSR...")
# Get file-level metadata
encode_get_file_info(accession="ENCFF...")
# Get provenance for derived files
encode_get_provenance(file_path="/path/to/derived/file.bed")
Each template below is a fill-in-the-blank paragraph that reads like a real methods section. Bracketed fields [like this] are populated from ENCODE metadata and provenance records. Every template follows these documentation standards.
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) data for
[target] in [biosample] were obtained from the ENCODE Project (ENCODE
Project Consortium 2020) under accession [ENCSR accession]. [Library
preparation method] libraries were prepared from [number] biological
replicates ([cells/nuclei] per replicate) and sequenced on an Illumina
[sequencer model] to generate [read count]M [paired-end/single-end]
reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
trimmed with Trim Galore (v[version]; Krueger 2015) to remove adapter
sequences and low-quality bases (Phred < 20). Trimmed reads were aligned
to the [organism] reference genome ([assembly]) using BWA-MEM (v[version];
Li 2013) with default parameters. Duplicate reads were marked and removed
using Picard MarkDuplicates (v[version]; Broad Institute). Reads with
mapping quality < 30 were excluded using samtools (v[version]; Danecek
et al. 2021). Reads mapping to ENCODE Blacklist v2 regions (Amemiya et al.
2019) were removed using bedtools intersect (v[version]; Quinlan & Hall
2010).
Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters [--broad for broad marks / -q 0.05 for narrow marks]. For
narrow-peak targets, IDR analysis (Li et al. 2011) was performed on
replicate peak sets with a threshold of [0.05]. Signal tracks (fold
change over control) were generated using MACS2 bdgcmp and converted
to bigWig format using bedGraphToBigWig (Kent et al. 2010). Of [N]
called peaks, [N] ([%]) passed IDR filtering and [N] ([%]) remained
after blacklist removal.
Assay for Transposase-Accessible Chromatin with sequencing (ATAC-seq) data
for [biosample] were obtained from the ENCODE Project (ENCODE Project
Consortium 2020) under accession [ENCSR accession]. [Number] biological
replicates of [cells/nuclei] [cells/nuclei] each were transposed with
Tn5 transposase ([library kit]) and sequenced on an Illumina [sequencer]
to generate [read count]M [paired-end/single-end] reads of [read length]
bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
adapter-trimmed with Trim Galore (v[version]; Krueger 2015). Trimmed
reads were aligned to [assembly] using Bowtie2 (v[version]; Langmead &
Salzberg 2012) with parameters --very-sensitive -X 2000 --no-mixed
--no-discordant. Mitochondrial reads were removed. Duplicate reads were
removed using Picard MarkDuplicates (v[version]; Broad Institute). Reads
with mapping quality < 30 were excluded. Tn5 transposase offset
correction was applied (+4 bp on the positive strand, -5 bp on the
negative strand; Buenrostro et al. 2013). ENCODE Blacklist v2 regions
(Amemiya et al. 2019) were excluded.
Peaks were called using MACS2 (v[version]; Zhang et al. 2008) with
parameters --nomodel --shift -75 --extsize 150 --keep-dup all -q 0.05.
Nucleosome-free fragments (< 150 bp) were used for peak calling. Signal
tracks were generated as fold change over background. TSS enrichment
score was [value] (threshold >= 6; ENCODE data standards; Yan et al. 2020). Of [N]
called peaks, [N] ([%]) passed quality filtering.
RNA sequencing (RNA-seq) data for [biosample] were obtained from the
ENCODE Project (ENCODE Project Consortium 2020) under accession [ENCSR
accession]. Total RNA was extracted from [number] biological replicates
and [library preparation method] libraries were prepared. Libraries were
sequenced on an Illumina [sequencer] to generate [read count]M
[paired-end/single-end] reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
MultiQC (v[version]; Ewels et al. 2016). Adapter sequences were trimmed
with Trim Galore (v[version]; Krueger 2015). Reads were aligned to
[assembly] with [GENCODE annotation version] gene annotations using STAR
(v[version]; Dobin et al. 2013) in two-pass mode. Gene-level
quantification was performed using RSEM (v[version]; Li & Dewey 2011)
for expected counts and TPM values. Transcript-level quantification was
obtained with Kallisto (v[version]; Bray et al. 2016). Mapping rate was
[%] and rRNA contamination was [%] (thresholds: mapping 70-90%, rRNA <
10%; Conesa et al. 2016; ENCODE data standards). Replicate Pearson correlation was [r value]
(threshold >= 0.9).
Differential expression analysis was performed using DESeq2 (v[version];
Love et al. 2014) in R (v[version]). The design formula was [~ batch +
condition]. Genes with |log2 fold change| > [threshold] and Benjamini-
Hochberg adjusted p-value < [threshold] were considered differentially
expressed. Of [N] genes tested, [N] ([%]) were significantly
upregulated and [N] ([%]) were significantly downregulated.
Whole-genome bisulfite sequencing (WGBS) data for [biosample] were
obtained from the ENCODE Project (ENCODE Project Consortium 2020) under
accession [ENCSR accession]. Genomic DNA from [number] biological
replicates was bisulfite-converted using [conversion kit] and sequenced
on an Illumina [sequencer] to generate [read count]M [paired-end/
single-end] reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
trimmed with Trim Galore (v[version]; Krueger 2015) in --rrbs mode
[if applicable] or standard mode. Trimmed reads were aligned to
[assembly] using Bismark (v[version]; Krueger & Andrews 2011) with
Bowtie2 as the underlying aligner. Duplicate reads were removed using
Bismark deduplicate_bismark. Methylation calls were extracted using
MethylDackel (v[version]; https://github.com/dpryan79/MethylDackel)
with parameters [--minDepth X --mergeContext]. Bisulfite conversion
efficiency was [%] (threshold >= 98%; ENCODE data standards). CpG sites with
coverage >= [X]x were retained for analysis ([N] CpGs, [%] of all
genomic CpGs).
Differentially methylated regions (DMRs) were identified using DMRcate
(v[version]; Peters et al. 2021) with parameters [lambda, C]. CpG
sites within ENCODE Blacklist v2 regions (Amemiya et al. 2019) were
excluded prior to DMR calling.
Hi-C data for [biosample] were obtained from the ENCODE Project (ENCODE
Project Consortium 2020) under accession [ENCSR accession]. Chromatin
was crosslinked, digested with [restriction enzyme], proximity-ligated,
and sequenced on an Illumina [sequencer] to generate [read count]M
[paired-end] reads of [read length] bp across [number] biological
replicates.
Read pairs were aligned to [assembly] using BWA-MEM (v[version]; Li
2013) and processed with pairtools (v[version]; Open2C et al. 2024) for
pair classification, deduplication, and filtering. Valid interaction
pairs (cis-reads with MAPQ >= 30) were retained. Contact matrices were
generated using cooler (v[version]; Abdennur & Mirny 2020) at
resolutions of [1kb, 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb].
Matrices were balanced using iterative correction (ICE; Imakaev et al.
2012). Cis/trans ratio was [%] (threshold > 60%) and long-range cis
interactions (> 20kb) comprised [%] of total (threshold > 40%; Yardimci
et al. 2019).
Chromatin loops were called using HiCCUPS (Juicer v[version]; Durand
et al. 2016) at [resolutions] with default FDR thresholds.
Topologically associating domains (TADs) were identified using [method]
at [resolution]. [N] loops and [N] TADs were identified.
Cleavage Under Targets and Release Using Nuclease (CUT&RUN) data for
[target] in [biosample] were obtained from the ENCODE Project (ENCODE
Project Consortium 2020) under accession [ENCSR accession]. [Number]
biological replicates of [cells/nuclei] each were incubated with
[antibody] and processed with protein A-MNase. Released DNA fragments
were sequenced on an Illumina [sequencer] to generate [read count]M
[paired-end] reads of [read length] bp per replicate.
Raw reads were assessed with FastQC (v[version]; Andrews 2010) and
trimmed with Trim Galore (v[version]; Krueger 2015). Reads were aligned
to [assembly] using Bowtie2 (v[version]; Langmead & Salzberg 2012) with
parameters --very-sensitive --no-mixed --no-discordant -X 700. Spike-in
reads (E. coli or S. cerevisiae) were aligned separately and used for
calibration normalization ([spike-in method]). Duplicate reads were
removed with Picard MarkDuplicates (v[version]; Broad Institute). ENCODE
Blacklist v2 and suspect list regions (Amemiya et al. 2019; Nordin et
al. 2023) were excluded.
Peaks were called using SEACR (v[version]; Meers et al. 2019) in
[stringent/relaxed] mode with [IgG control / numeric threshold]. Note:
CUT&RUN has different QC profiles than ChIP-seq; standard ChIP-seq
metrics (FRiP, NSC, RSC) should not be directly applied (Nordin et al.
2023). [N] peaks were identified across replicates.
Single-cell RNA sequencing (scRNA-seq) data for [biosample] were obtained
from the ENCODE Project (ENCODE Project Consortium 2020) under accession
[ENCSR accession]. [Cell/nuclei isolation method] was performed on
[number] biological replicates. Libraries were prepared using [10x
Genomics Chromium / Smart-seq2 / other platform] targeting [N] cells per
replicate, and sequenced on an Illumina [sequencer] to generate [read
count]M reads per sample.
Raw reads were processed with [CellRanger v[version] (10x Genomics) /
STARsolo (STAR v[version]; Dobin et al. 2013)] to generate gene-cell
count matrices against [assembly] with [GENCODE annotation version].
Ambient RNA contamination was estimated and removed using [SoupX
(v[version]; Young & Behjati 2020) / CellBender (v[version]; Fleming
et al. 2023)]. Doublets were detected and removed using [Scrublet
(v[version]; Wolock et al. 2019) / DoubletFinder (v[version]; McGinnis
et al. 2019)]. Cells with [< N genes, > N% mitochondrial reads, < N
UMIs] were excluded.
Data were normalized using [SCTransform (v[version]; Hafemeister & Satija
2019) / scran (v[version]; Lun et al. 2016)] and integrated across
samples using [Harmony (v[version]; Korsunsky et al. 2019) / scVI
(v[version]; Lopez et al. 2018)]. Dimensionality reduction was performed
using PCA ([N] components) followed by UMAP (n_neighbors=[N],
min_dist=[value]; McInnes et al. 2018). Clustering was performed using
the Leiden algorithm (resolution=[value]; Traag et al. 2019) implemented
in [Seurat v[version] (Hao et al. 2024) / Scanpy v[version] (Wolf et
al. 2018)]. Cell types were annotated based on [marker genes / reference
mapping / automated annotation with method]. Final dataset comprised [N]
cells across [N] clusters representing [N] cell types.
Figure [N]. Genome browser view of [mark/signal] at the [gene name]
locus ([chr]:[start]-[end], [assembly]). Tracks shown from top to
bottom: [track 1 description, e.g., "H3K27ac ChIP-seq signal in
pancreatic islets (ENCSR...)"], [track 2], [track 3]. Signal tracks
represent fold change over input control. [Gene models from GENCODE
v[version] are shown at bottom.] [Shaded region highlights the
[promoter/enhancer/regulatory element] of interest.] Data were obtained
from the ENCODE Project (ENCODE Project Consortium 2020).
Figure [N]. Heatmap of [signal type, e.g., "H3K27ac ChIP-seq signal"]
across [N] [regions/genes/peaks] in [N] [samples/conditions/cell types].
Rows represent [individual peaks / genes / genomic regions], sorted by
[k-means clustering (k=[N]) / hierarchical clustering (method=[ward.D2],
distance=[euclidean]) / signal intensity]. Columns represent [samples /
cell types / conditions]. Color scale indicates [log2 fold change /
z-scored signal / CPM / RPKM] with [color scheme, e.g., "blue-white-red
diverging scale, range [-2, 2]"]. [Normalization method: quantile /
library size / spike-in.] Heatmap was generated using [deepTools
computeMatrix + plotHeatmap (v[version]; Ramirez et al. 2016) /
ComplexHeatmap (v[version]; Gu et al. 2016) / pheatmap (v[version];
Kolde 2019)].
Figure [N]. Volcano plot of differentially expressed genes between
[condition A] and [condition B] in [biosample/cell type]. X-axis shows
log2 fold change; y-axis shows -log10 adjusted p-value (Benjamini-
Hochberg correction). Significance thresholds: |log2FC| > [threshold]
(vertical dashed lines) and adjusted p-value < [threshold] (horizontal
dashed line). Red points: [N] significantly upregulated genes. Blue
points: [N] significantly downregulated genes. Gray points: [N] non-
significant genes. [Selected genes are labeled.] Differential expression
analysis was performed using [DESeq2 / edgeR / limma-voom] (v[version]).
[N] total genes were tested.
Figure [N]. Hi-C contact frequency map of [region, e.g., "chromosome 7:
25-30 Mb"] in [biosample] at [resolution, e.g., "10 kb"] resolution.
Upper triangle: [observed / Knight-Ruiz balanced / ICE-normalized]
contact frequencies displayed on a [log / linear] color scale. [Lower
triangle: [comparison condition / O/E ratio / difference map].] [Loops
identified by HiCCUPS are marked with [circles/squares].] [TAD
boundaries identified by [method] are shown as [lines/triangles].]
Contact matrix was generated using [cooler (Abdennur & Mirny 2020) /
Juicer (Durand et al. 2016)] from [N]M valid read pairs. Data were
obtained from ENCODE accession [ENCSR...].
Figure [N]. [UMAP / tSNE] projection of [N] single cells from
[biosample] colored by [cluster identity / gene expression / sample
origin / cell cycle phase]. [UMAP parameters: n_neighbors=[N],
min_dist=[value], computed on [N] principal components. / tSNE
parameters: perplexity=[N], computed on [N] principal components.]
[N] clusters were identified using the [Leiden / Louvain] algorithm
(resolution=[value]). Cell types were annotated based on [canonical
marker gene expression / reference-based mapping using [SingleR /
Azimuth / scArches] / manual curation]. [Inset: expression of [gene
name] across clusters, showing enrichment in [cell type].]
Figure [N]. [Venn diagram / UpSet plot] showing overlap of [peak sets /
gene lists / regulatory elements] across [N] [conditions / cell types /
datasets]. Set definitions: [Set A] = [N] [peaks/genes] from [source],
[Set B] = [N] [peaks/genes] from [source][, Set C = ...]. Overlaps were
computed using [bedtools intersect (minimum [N] bp overlap) / exact gene
ID matching]. [N] [peaks/genes] ([%]) were shared across all sets; [N]
([%]) were unique to [Set A]. [Statistical significance of overlap was
assessed using [Fisher exact test / hypergeometric test / permutation
test (N=[iterations])]; p = [value].] [UpSet plot generated with
UpSetR (v[version]; Conway et al. 2017).]
Figure [N]. Genomic feature distribution of [N] [peak type, e.g.,
"H3K27ac peaks"] in [biosample]. Peaks were annotated using ChIPseeker
(v[version]; Yu et al. 2015) with [GENCODE v[version] / TxDb annotation
version] gene models. Bars show percentage of peaks overlapping each
genomic feature category: promoter (<= [N] kb from TSS, [%]), 5' UTR
([%]), 3' UTR ([%]), exon ([%]), intron ([%]), downstream (<= [N] kb,
[%]), and intergenic ([%]). [Background genomic distribution is shown
for comparison (gray bars).] [A second panel shows the distribution of
peak distance to nearest TSS.]
A master table listing all ENCODE experiments used in the study.
| Column | Description | Source |
|---|---|---|
| ENCODE Accession | ENCSR identifier | encode_get_experiment |
| Assay | Assay type (ChIP-seq, ATAC-seq, etc.) | encode_get_experiment |
| Target | Antibody target (for ChIP/CUT&RUN) | encode_get_experiment |
| Biosample | Tissue or cell type | encode_get_experiment |
| Organism | Species | encode_get_experiment |
| Assembly | Genome build | encode_get_experiment |
| Lab | Submitting laboratory | encode_get_experiment |
| Biological Replicates | Number of bio reps | encode_get_experiment |
| Sequencing Reads | Total reads per replicate | File metadata |
| Read Length | bp, paired/single | File metadata |
| Sequencer | Instrument model | encode_get_experiment |
| Library | Library preparation method | encode_get_experiment |
| Date Released | Release date on ENCODE portal | encode_get_experiment |
| Audit Status | ENCODE quality audit level | encode_get_experiment |
Generate with:
encode_export_data(format="csv")
Assay-specific QC metrics for all experiments. Columns depend on the assay type.
ChIP-seq columns: Accession, Target, Total reads, Mapped reads, Mapping rate, Duplicate rate, NRF, PBC1, PBC2, NSC, RSC, FRiP, Peak count, IDR peaks
ATAC-seq columns: Accession, Total reads, Mapped reads, Mapping rate, Mitochondrial rate, Duplicate rate, NRF, TSS enrichment, Fragment size distribution (NFR/mono/di), FRiP, Peak count
RNA-seq columns: Accession, Total reads, Mapped reads, Mapping rate, rRNA rate, Exonic rate, Intronic rate, Intergenic rate, Genes detected (TPM > 1), Replicate correlation (Pearson r)
WGBS columns: Accession, Total reads, Mapped reads, Mapping rate, Bisulfite conversion rate, CpG coverage (mean), CpGs at 1x, CpGs at 5x, CpGs at 10x, Global methylation (%)
Hi-C columns: Accession, Total read pairs, Valid pairs, Cis pairs (%), Trans pairs (%), Long-range cis (%), Short-range cis (%), Duplicate rate, Library complexity
QC thresholds (include as table footnotes):
For differential expression, differential accessibility, or differential methylation analyses.
| Column | Description |
|---|---|
| Feature ID | Gene ID / Peak ID / CpG ID |
| Gene Symbol | Gene name (for expression) |
| Chromosome | chr |
| Start | Start coordinate |
| End | End coordinate |
| log2 Fold Change | Effect size |
| Standard Error | SE of log2FC |
| Stat | Test statistic |
| P-value | Raw p-value |
| Adjusted P-value | BH-corrected FDR |
| Base Mean / Mean Signal | Average expression or signal |
For aggregated or filtered peak sets.
| Column | Description |
|---|---|
| Chromosome | chr |
| Start | Start coordinate (0-based) |
| End | End coordinate |
| Name | Peak identifier |
| Score | Scaled score (0-1000) |
| Signal Value | Fold enrichment or signal |
| P-value | -log10(p-value) |
| Q-value | -log10(q-value) |
| Summit | Distance from start to summit |
| Source Accession | ENCODE file accession |
| Biosample | Source tissue/cell type |
Every tool used in the analysis with version and citation.
| Column | Description |
|---|---|
| Software | Tool name |
| Version | Exact version string |
| Purpose | What it was used for |
| Citation | Publication reference |
| DOI | Digital Object Identifier |
| URL | Download or documentation URL |
All sequencing data used in this study are publicly available through
the ENCODE Project portal (https://www.encodeproject.org) under the
following experiment accessions: [ENCSR list]. Processed files including
[peak calls / gene quantifications / contact matrices / methylation
calls] used in this analysis are available under file accessions [ENCFF
list]. ENCODE data are released under unrestricted use policy with no
embargo period (ENCODE Project Consortium 2020).
All derived data generated in this study, including [filtered peak sets /
merged catalogs / differential results / aggregated matrices], are
available at [repository URL / GEO accession / Zenodo DOI]. Complete
analysis scripts, provenance logs, and software environment
specifications are available at [GitHub URL]. A detailed description of
all processing steps, parameters, and software versions is provided in
Supplementary Table S[N].
Code used for all analyses in this study is available at [GitHub URL].
The repository includes [Nextflow / Snakemake / Bash / R / Python]
scripts for [list major analyses], along with the complete software
environment specification ([Docker image / conda environment YAML /
renv.lock]). All computational tools and their versions are listed in
Supplementary Table S[N].
Raw sequencing data generated in this study have been deposited in the
Gene Expression Omnibus (GEO) under accession [GSE number]. Processed
data are available at the same accession. Previously published ENCODE
data used in this study are available at https://www.encodeproject.org
under accessions listed in Supplementary Table S1.
Every bioinformatics tool used in an analysis must be cited with its full publication reference. The following table covers the most commonly used tools in ENCODE-based analyses.
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| BWA-MEM | Li H | arXiv | 2013 | 10.48550/arXiv.1303.3997 |
| Bowtie2 | Langmead B & Salzberg SL | Nature Methods | 2012 | 10.1038/nmeth.1923 |
| STAR | Dobin A et al. | Bioinformatics | 2013 | 10.1093/bioinformatics/bts635 |
| HISAT2 | Kim D et al. | Nature Biotechnology | 2019 | 10.1038/s41587-019-0201-4 |
| minimap2 | Li H | Bioinformatics | 2018 | 10.1093/bioinformatics/bty191 |
| samtools | Danecek P et al. | GigaScience | 2021 | 10.1093/gigascience/giab008 |
| Picard | Broad Institute | -- | -- | broadinstitute.github.io/picard |
| sambamba | Tarasov A et al. | Bioinformatics | 2015 | 10.1093/bioinformatics/btv098 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| FastQC | Andrews S | Babraham Bioinformatics | 2010 | bioinformatics.babraham.ac.uk |
| MultiQC | Ewels P et al. | Bioinformatics | 2016 | 10.1093/bioinformatics/btw354 |
| Trim Galore | Krueger F | Babraham Bioinformatics | 2015 | github.com/FelixKrueger/TrimGalore |
| fastp | Chen S et al. | Bioinformatics | 2018 | 10.1093/bioinformatics/bty560 |
| Preseq | Daley T & Smith AD | Nature Methods | 2013 | 10.1038/nmeth.2375 |
| Qualimap | Okonechnikov K et al. | Bioinformatics | 2016 | 10.1093/bioinformatics/btv566 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| MACS2 | Zhang Y et al. | Genome Biology | 2008 | 10.1186/gb-2008-9-9-r137 |
| IDR | Li Q et al. | Annals of Applied Statistics | 2011 | 10.1214/11-AOAS466 |
| SEACR | Meers MP et al. | Epigenetics & Chromatin | 2019 | 10.1186/s13072-019-0287-4 |
| Hotspot2 | John S et al. | -- | 2022 | github.com/Altius/hotspot2 |
| HOMER | Heinz S et al. | Molecular Cell | 2010 | 10.1016/j.molcel.2010.05.004 |
| F-Seq2 | Zhao H et al. | Bioinformatics | 2020 | 10.1093/bioinformatics/btab273 |
| deepTools | Ramirez F et al. | Nucleic Acids Research | 2016 | 10.1093/nar/gkw257 |
| bedtools | Quinlan AR & Hall IM | Bioinformatics | 2010 | 10.1093/bioinformatics/btq033 |
| bedGraphToBigWig | Kent WJ et al. | Bioinformatics | 2010 | 10.1093/bioinformatics/btq351 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| RSEM | Li B & Dewey CN | BMC Bioinformatics | 2011 | 10.1186/1471-2105-12-323 |
| Kallisto | Bray NL et al. | Nature Biotechnology | 2016 | 10.1038/nbt.3519 |
| Salmon | Patro R et al. | Nature Methods | 2017 | 10.1038/nmeth.4197 |
| featureCounts | Liao Y et al. | Bioinformatics | 2014 | 10.1093/bioinformatics/btt656 |
| HTSeq | Anders S et al. | Bioinformatics | 2015 | 10.1093/bioinformatics/btu638 |
| DESeq2 | Love MI et al. | Genome Biology | 2014 | 10.1186/s13059-014-0550-8 |
| edgeR | Robinson MD et al. | Bioinformatics | 2010 | 10.1093/bioinformatics/btp616 |
| limma | Ritchie ME et al. | Nucleic Acids Research | 2015 | 10.1093/nar/gkv007 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| Bismark | Krueger F & Andrews SR | Bioinformatics | 2011 | 10.1093/bioinformatics/btr167 |
| MethylDackel | Ryan DP | GitHub | -- | github.com/dpryan79/MethylDackel |
| DMRcate | Peters TJ et al. | Epigenetics & Chromatin | 2021 | 10.1186/s13072-021-00428-1 |
| bsmap | Xi Y & Li W | BMC Bioinformatics | 2009 | 10.1186/1471-2105-10-232 |
| methylKit | Akalin A et al. | Genome Biology | 2012 | 10.1186/gb-2012-13-10-r87 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| pairtools | Open2C et al. | PLOS Computational Biology | 2024 | 10.1371/journal.pcbi.1012164 |
| cooler | Abdennur N & Mirny LA | Bioinformatics | 2020 | 10.1093/bioinformatics/btz540 |
| Juicer | Durand NC et al. | Cell Systems | 2016 | 10.1016/j.cels.2016.07.002 |
| HiCCUPS | Rao SSP et al. | Cell | 2014 | 10.1016/j.cell.2014.11.021 |
| FAN-C | Kruse K et al. | Genome Biology | 2020 | 10.1186/s13059-020-02215-9 |
| Mustache | Roayaei Ardakany A et al. | Genome Biology | 2020 | 10.1186/s13059-020-02167-0 |
| HiGlass | Kerpedjiev P et al. | Genome Biology | 2018 | 10.1186/s13059-018-1486-1 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| Seurat | Hao Y et al. | Nature Biotechnology | 2024 | 10.1038/s41587-023-01767-y |
| Scanpy | Wolf FA et al. | Genome Biology | 2018 | 10.1186/s13059-017-1382-0 |
| SCTransform | Hafemeister C & Satija R | Genome Biology | 2019 | 10.1186/s13059-019-1874-1 |
| scran | Lun ATL et al. | Genome Biology | 2016 | 10.1186/s13059-016-0947-7 |
| Harmony | Korsunsky I et al. | Nature Methods | 2019 | 10.1038/s41592-019-0619-0 |
| scVI | Lopez R et al. | Nature Methods | 2018 | 10.1038/s41592-018-0229-2 |
| CellRanger | 10x Genomics | -- | -- | support.10xgenomics.com |
| STARsolo | Kaminow B et al. | Cell Genomics | 2021 | 10.1016/j.xgen.2021.100004 |
| Scrublet | Wolock SL et al. | Cell Systems | 2019 | 10.1016/j.cels.2018.11.005 |
| DoubletFinder | McGinnis CS et al. | Cell Systems | 2019 | 10.1016/j.cels.2019.03.003 |
| SoupX | Young MD & Behjati S | GigaScience | 2020 | 10.1093/gigascience/giaa151 |
| SingleR | Aran D et al. | Nature Immunology | 2019 | 10.1038/s41590-018-0276-y |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| ChIPseeker | Yu G et al. | Bioinformatics | 2015 | 10.1093/bioinformatics/btv145 |
| GREAT | McLean CY et al. | Nature Biotechnology | 2010 | 10.1038/nbt.1630 |
| clusterProfiler | Wu T et al. | Innovation | 2021 | 10.1016/j.xinn.2021.100141 |
| GSEA | Subramanian A et al. | PNAS | 2005 | 10.1073/pnas.0506580102 |
| Enrichr | Kuleshov MV et al. | Nucleic Acids Research | 2016 | 10.1093/nar/gkw377 |
| DAVID | Huang DW et al. | Nature Protocols | 2009 | 10.1038/nprot.2008.211 |
| ChromHMM | Ernst J & Kellis M | Nature Methods | 2012 | 10.1038/nmeth.1906 |
| liftOver | Kent WJ et al. | Genome Research | 2002 | 10.1101/gr.229102 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| IGV | Robinson JT et al. | Nature Biotechnology | 2011 | 10.1038/nbt.1754 |
| ggplot2 | Wickham H | Springer | 2016 | ISBN: 978-3-319-24277-4 |
| ComplexHeatmap | Gu Z et al. | Bioinformatics | 2016 | 10.1093/bioinformatics/btw313 |
| Gviz | Hahne F & Ivanek R | Methods in Molecular Biology | 2016 | 10.1007/978-1-4939-3578-9_16 |
| pyGenomeTracks | Lopez-Delisle L et al. | Bioinformatics | 2021 | 10.1093/bioinformatics/btaa692 |
| HiGlass | Kerpedjiev P et al. | Genome Biology | 2018 | 10.1186/s13059-018-1486-1 |
| UpSetR | Conway JR et al. | Bioinformatics | 2017 | 10.1093/bioinformatics/btx364 |
| Tool | Citation | Journal | Year | DOI |
|---|---|---|---|---|
| Nextflow | Di Tommaso P et al. | Nature Biotechnology | 2017 | 10.1038/nbt.3820 |
| Snakemake | Molder F et al. | F1000Research | 2021 | 10.12688/f1000research.29032.2 |
| Docker | Merkel D | Linux Journal | 2014 | -- |
| Singularity | Kurtzer GM et al. | PLOS ONE | 2017 | 10.1371/journal.pone.0177459 |
The provenance chain logged by the data-provenance skill contains all information needed to auto-generate a complete methods section. The workflow is:
encode_get_provenance(file_path="/path/to/derived/final_peaks.bed")
This returns the full chain: derived_file -> processing_steps -> source ENCODE files.
From each provenance record, extract:
tool_used field: tool name and versionparameters field: command-line arguments or function callssource_accessions field: ENCODE experiment and file accessionsMatch each tool to the appropriate methods template section:
encode_get_experiment(accession="ENCSR...")
Populate template fields: biosample, target, lab, replicates, sequencer, read length, read count, library preparation.
Cross-reference every tool in the provenance chain against the Tool Citation Reference table above. Compile a complete bibliography.
Combine all sections into a complete methods draft:
Review the draft against the documentation standards checklist to ensure no required fields are missing.
Given a provenance chain:
final_peaks.bed
<- tool: bedtools v2.31.0, params: intersect -v blacklist
<- filtered_peaks.bed
<- tool: awk, params: '$7 >= 4.5'
<- ENCFF123ABC.bed (IDR thresholded peaks, ENCSR456DEF)
Auto-generated methods:
H3K27ac ChIP-seq data for human pancreas were obtained from the ENCODE
Project (ENCODE Project Consortium 2020) under accession ENCSR456DEF.
IDR-thresholded peaks (ENCFF123ABC) aligned to GRCh38 were selected for
downstream analysis. Peaks were filtered to retain only those with MACS2
signal value >= 4.5 (retaining 34,521 of 45,231 peaks; 76.3%). Peaks
overlapping ENCODE Blacklist v2 regions (Amemiya et al. 2019) were
excluded using bedtools intersect (v2.31.0; Quinlan & Hall 2010),
yielding 34,198 peaks (99.1% of filtered set) for downstream analysis.
Not reporting software versions: "Reads were aligned with STAR" is unacceptable. Reviewers require "STAR v2.7.11a (Dobin et al. 2013)." Every tool needs name, version, and citation.
Missing biological replicate counts: "ChIP-seq was performed" does not tell readers whether results are from 1 replicate or 5. Always state: "n=[N] biological replicates per condition."
Confusing technical vs biological replicates: Technical replicates (same sample, sequenced twice) are NOT biological replicates (independent biological samples). Reporting "4 replicates" when 2 are technical inflates apparent statistical power.
Not specifying genome assembly version: "Reads were aligned to the human genome" is ambiguous. hg19, hg38, GRCh37, GRCh38, T2T-CHM13 are all different. Always specify: "GRCh38/hg38."
Using "default parameters" without stating what those defaults are: "MACS2 was run with default parameters" is not reproducible because defaults change between versions. State the actual parameter values.
Missing blacklist filtering mention: Peaks or signal tracks without blacklist filtering contain artifact regions. Always report: "ENCODE Blacklist v2 regions (Amemiya et al. 2019) were excluded."
Not reporting IDR threshold for ChIP-seq: IDR analysis is standard for ENCODE ChIP-seq. Report: "IDR threshold of 0.05" or whichever threshold was used.
Omitting spike-in normalization details for CUT&RUN: CUT&RUN relies on spike-in normalization. Report the spike-in organism, calibration method, and scaling factor.
Not distinguishing paired-end from single-end in methods: PE and SE reads produce different fragment size estimates, alignment rates, and duplicate detection. Always specify.
Omitting gene annotation version: GENCODE v38 and v44 define different gene sets. "Genes were annotated using GENCODE" is insufficient -- state the exact version.
Not reporting filtering statistics: Every filtering step should report input count, output count, and percentage retained. "Peaks were filtered" without numbers is uninformative.
Missing normalization method: "Signal tracks were compared across samples" -- how? Library-size normalization? Spike-in? RPKM? RPM? This changes interpretation.
| Reference | Year | Journal | Key Contribution | Citations |
|---|---|---|---|---|
| Landt et al. | 2012 | Genome Research | ChIP-seq reporting guidelines and quality standards; established FRiP, NSC, RSC thresholds | ~3,400 |
| ENCODE Consortium | 2020 | Nature | ENCODE Phase 3 expanded encyclopaedias; defined current data standards and uniform pipelines | ~1,200 |
| Conesa et al. | 2016 | Genome Biology | RNA-seq best practices survey; defined mapping rate, rRNA, and correlation thresholds | ~4,500 |
| Wilkinson et al. | 2016 | Scientific Data | FAIR data principles (Findable, Accessible, Interoperable, Reusable); framework for data sharing | ~8,000 |
| Sandve et al. | 2013 | PLOS Computational Biology | Ten simple rules for reproducible computational research; foundational reproducibility guide | ~1,800 |
| Baker | 2016 | Nature | 1,500 scientists lift the lid on reproducibility; established scale of reproducibility crisis | ~3,200 |
| Amemiya et al. | 2019 | Scientific Reports | ENCODE Blacklist: identification and exclusion of artifact regions across genome assemblies | ~1,372 |
| Buenrostro et al. | 2013 | Nature Methods | ATAC-seq method development; defined TSS enrichment and nucleosomal fragment thresholds | ~5,000 |
| Foox et al. | 2021 | Genome Biology | WGBS benchmarking and quality standards; bisulfite conversion thresholds | ~200 |
| Yardimci et al. | 2019 | Genome Biology | Hi-C quality metrics; cis/trans ratio and long-range cis thresholds | ~150 |
| Nordin et al. | 2023 | Genome Biology | CUT&RUN/CUT&Tag QC: different profiles from ChIP-seq; suspect list regions | ~50 |
| This skill produces... | Feed into... | Using tool/skill |
|---|---|---|
| Methods section text | Manuscript draft | Publication submission |
| Figure legends | Manuscript figures | visualization-workflow outputs |
| Data availability statements | Manuscript appendix | cite-encode accession lists |
| Supplementary table templates | Manuscript supplements | track-experiments → encode_export_data |
| Tool citation paragraphs | Bibliography | cite-encode → BibTeX/RIS export |
Goal: Auto-generate a publication-ready methods section describing ENCODE data acquisition, processing, and analysis, with complete tool citations and reproducible parameters. Context: Journal methods sections require precise documentation of every tool, version, parameter, and data source. This skill automates that process using tracked provenance data.
encode_list_tracked()
Expected output:
{
"experiments": [
{"accession": "ENCSR000AKA", "assay": "Histone ChIP-seq", "notes": "GM12878 H3K27ac"},
{"accession": "ENCSR637ENO", "assay": "ATAC-seq", "notes": "GM12878 accessibility"}
]
}
encode_get_provenance(file_path="/data/analysis/enhancers_filtered.bed")
Expected output:
{
"file": "/data/analysis/enhancers_filtered.bed",
"tool": "bedtools v2.31.0",
"sources": [{"accession": "ENCFF001ABC"}],
"description": "Blacklist-filtered H3K27ac peaks"
}
encode_summarize_collection()
Using provenance data, generate: "H3K27ac ChIP-seq data for GM12878 (ENCSR000AKA) were obtained from the ENCODE Portal (ENCODE Project Consortium, 2012). IDR-thresholded peaks (ENCFF001ABC) were filtered against the ENCODE blacklist v2 (Amemiya et al., 2019) using bedtools v2.31.0 (Quinlan & Hall, 2010)."
encode_list_tracked()
Expected output:
{
"experiments": [
{"accession": "ENCSR000AKA", "assay": "Histone ChIP-seq"}
]
}
encode_get_provenance(file_path="/data/peaks_filtered.bed")
Expected output:
{
"file": "/data/peaks_filtered.bed",
"tool": "bedtools v2.31.0",
"sources": [{"accession": "ENCFF001ABC"}]
}
encode_get_citations(accession="ENCSR000AKA")
Expected output:
{
"citations": [{"pmid": "29126249", "title": "ENCODE encyclopedia", "year": 2012}]
}
| Skill | Relationship |
|---|---|
| data-provenance | Source of provenance records for methods auto-generation; log every operation |
| cite-encode | Citation formatting and ENCODE data use policy compliance |
| quality-assessment | QC metrics to include in methods sections and Table S2 |
| pipeline-guide | Pipeline parameters for methods text; links to assay-specific pipeline skills |
| publication-trust | Verify integrity of cited publications before including in references |
| pipeline-chipseq | ChIP-seq pipeline parameters for methods template |
| pipeline-atacseq | ATAC-seq pipeline parameters for methods template |
| pipeline-rnaseq | RNA-seq pipeline parameters for methods template |
| pipeline-wgbs | WGBS pipeline parameters for methods template |
| pipeline-hic | Hi-C pipeline parameters for methods template |
| pipeline-cutandrun | CUT&RUN pipeline parameters for methods template |
| visualization-workflow | Figure generation workflow to pair with figure legends |
| peak-annotation | Peak annotation details for methods and figure legends |
| batch-analysis | Multi-experiment analyses requiring comprehensive methods |
When generating scientific writing, present:
Complete methods paragraph(s) in a code block for easy copy-paste. Every paragraph should be self-contained and ready for insertion into a manuscript.
Documentation standards checklist showing which required metadata fields are covered vs. missing:
[x] Library preparation: TruSeq ChIP
[x] Biological replicates: n=2
[x] Sequencing reads: 42.3M PE
[ ] Cells/nuclei per replicate: NOT FOUND -- check experiment metadata
[x] Read length: 76 bp
...
Citation list of all tools referenced in the methods, formatted for the target reference manager (BibTeX, RIS, or inline text).
Supplementary table structure with column headers and example rows, ready to populate.
Data availability statement draft customized to the specific ENCODE accessions used.
Missing information flags: If any required documentation fields cannot be populated from available metadata, flag them explicitly with suggestions for how to obtain them (e.g., "Cells per replicate not available in ENCODE metadata -- check the associated publication or contact the submitting lab").
npx claudepluginhub ammawla/encode-toolkit --plugin encode-toolkitGenerates publication-ready methods sections, figure legends, supplementary tables, and data availability statements from ENCODE analysis provenance. Use for reproducible genomics paper writing from logged pipelines.
Processes deep-sequencing coverage with deepTools: converts BAM to bigWig, runs QC (correlation, PCA, fingerprint), and generates TSS/peak heatmaps and profiles for ChIP-seq, ATAC-seq, or RNA-seq data.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.