How this skill is triggered — by the user, by Claude, or both
Slash command
/curation-skills:curate-bulk-rnaseqThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill guides processing of bulk RNA-seq datasets for VEuPathDB resources.
TODO.mdresources/editing-large-xml.mdresources/pdf-extraction.mdresources/step-1-fetch-metadata.mdresources/step-2-analyze-samples.mdresources/step-3-curate-contacts.mdresources/step-4-generate-presenter.mdresources/step-5-generate-outputs.mdresources/valid-projects.jsonscripts/check-delivery-dirs.shscripts/check-repos.shscripts/fetch-miniml.jsscripts/fetch-sra-metadata.jsscripts/generate-analysis-config.jsscripts/generate-presenter-xml.jsscripts/generate-samplesheet.jsThis skill guides processing of bulk RNA-seq datasets for VEuPathDB resources.
This workflow requires the following repositories in veupathdb-repos/:
First, run the repository status check to verify repositories are present:
Note: this script is located in the skill directory
bash scripts/check-repos.sh ApiCommonPresenters EbrcModelCommon
If repositories are missing, the script will provide clone instructions.
Branch Confirmation: After verifying repositories exist, check their current branches and status using git -C <path>, then confirm with the user before proceeding.
Example:
git -C veupathdb-repos/ApiCommonPresenters branch --show-current
git -C veupathdb-repos/ApiCommonPresenters status -sb
IMPORTANT: All commands in this workflow must be run from your curation workspace directory (the directory that contains veupathdb-repos/ as a subdirectory).
For Claude Code:
cd commands to change into subdirectoriesgit -C <path> for git operations in subdirectoriesThe workflow creates:
tmp/ - Intermediate files (gitignored)delivery/bulk-rnaseq/<BIOPROJECT>/ - Pipeline outputs (gitignored)Gather the following before starting:
PRJNA1018599)If a journal article is available for this dataset, providing it enhances the curation workflow:
To include a PDF:
tmp/<BIOPROJECT>_article.pdf (e.g., tmp/PRJNA1018599_article.pdf)The PDF will be processed by a subagent once in Step 1 and extracted data saved to tmp/<BIOPROJECT>_pdf_extracted.json for use throughout the workflow.
Fetch run-level metadata from ENA and sample attributes from NCBI BioSample. If a journal article PDF is available, extract key information for use in later steps.
Commands:
node scripts/fetch-sra-metadata.js <BIOPROJECT>
Output: tmp/<BIOPROJECT>_sra_metadata.json
Optional - Fetch MINiML for GEO-linked datasets:
node scripts/fetch-miniml.js <BIOPROJECT>
Output: tmp/<GSE>_family.xml (if GEO-linked)
Optional - Extract PDF data:
If tmp/<BIOPROJECT>_article.pdf is present, a subagent will extract it (do not read it yourself).
Output (on success): tmp/<BIOPROJECT>_pdf_extracted.json
Detailed instructions: Step 1 - Fetch Metadata
Claude analyzes the fetched metadata to:
Output: tmp/<BIOPROJECT>_sample_annotations.json
Detailed instructions: Step 2 - Analyze Samples
Identify and curate contact entries from GEO contributors or BioProject submitters.
Actions:
veupathdb-repos/EbrcModelCommon/Model/lib/xml/datasetPresenters/contacts/allContacts.xmlDetailed instructions: Step 3 - Curate Contacts
Generate the datasetPresenter XML, review/edit it, then insert into the presenter file.
Command:
node scripts/generate-presenter-xml.js <BIOPROJECT> <PROJECT> <PRIMARY_CONTACT_ID> [ADDITIONAL_CONTACT_IDS...]
Output: tmp/<BIOPROJECT>_presenter.xml
Workflow:
Target file: veupathdb-repos/ApiCommonPresenters/Model/lib/xml/datasetPresenters/<PROJECT>.xml
Detailed instructions: Step 4 - Generate Presenter
Generate pipeline configuration files for the data processing team.
Commands:
bash scripts/check-delivery-dirs.sh bulk-rnaseq <BIOPROJECT>
node scripts/generate-analysis-config.js <BIOPROJECT> [--strand-specific]
node scripts/generate-samplesheet.js <BIOPROJECT> [strandedness]
The strandedness argument accepts: stranded, unstranded, or auto. If omitted, the script checks _pdf_extracted.json and _sample_annotations.json before falling back to auto.
Outputs in delivery/bulk-rnaseq/<BIOPROJECT>/:
analysisConfig.xml - Pipeline configurationsamplesheet.csv - Also for the processing pipelineDetailed instructions: Step 5 - Generate Outputs
After completing this workflow:
delivery/bulk-rnaseq/<BIOPROJECT>/ to data processing teamscripts/fetch-sra-metadata.js - Fetches SRA run metadata from ENA + BioSample attributes from NCBIscripts/fetch-miniml.js - Fetches MINiML XML for GEO-linked datasetsscripts/generate-presenter-xml.js - Generates RNA-seq datasetPresenter XMLscripts/generate-analysis-config.js - Generates analysisConfig.xml for pipelinescripts/generate-samplesheet.js - Generates/delivers samplesheet.csv and sampleAnnotations.jsonscripts/check-repos.sh - Validates veupathdb-repos/ repository setup (synced from shared/)scripts/check-delivery-dirs.sh - Creates delivery directory structure (synced from shared/)npx claudepluginhub veupathdb/dataset-curator --plugin curation-skillsTrack exact provenance for every operation on ENCODE data — tool versions, reference files, scripts, parameters, and timestamps — to enable publication-ready methods writing. Use when the user processes ENCODE files, runs any bioinformatics tool, creates filtered/merged datasets, runs pipelines, performs liftover, uses R/Python/Bash for analysis, or needs to document their analysis chain for reproducibility and publication. Also use when the user says "write me methods" to auto-generate methods sections from the provenance log. This skill implements comprehensive provenance documentation: every tool, every version, every reference file, every parameter, every accession — no shortcuts. Use this skill for ANY processing step, ANY file transformation, ANY analysis operation on ENCODE data.
Wraps nf-core/rnaseq bulk RNA-seq preprocessing from FASTQ or BAM inputs with preflight checks, reproducibility tracking, and downstream handoff to differential expression skills.
Retrieves DNA/RNA sequences, raw reads (FASTQ), genome assemblies, and metadata from the European Nucleotide Archive via REST APIs and FTP for genomics and bioinformatics pipelines.