From curation-skills
Process genome assembly datasets for VEuPathDB resources
How this skill is triggered — by the user, by Claude, or both
Slash command
/curation-skills:curate-genome-assemblyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill guides processing of genome assembly datasets for VEuPathDB resources.
TODO.mdresources/curator-branching.mdresources/editing-large-xml.mdresources/step-1-fetch-ncbi.mdresources/step-2-fetch-bioproject.mdresources/step-3-fetch-pubmed.mdresources/step-4-curate-contacts.mdresources/step-5-update-presenter.mdresources/valid-projects.jsonscripts/check-repos.shscripts/fetch-bioproject.jsscripts/fetch-pubmed.jsscripts/generate-presenter-xml.jsThis skill guides processing of genome assembly datasets for VEuPathDB resources.
This workflow requires the following repositories in veupathdb-repos/:
First, run the repository status check to verify repositories are present:
Note: this script is located in the skill directory
bash scripts/check-repos.sh ApiCommonPresenters EbrcModelCommon
If repositories are missing, the script will provide clone instructions.
Branch Confirmation: After verifying repositories exist, check their current branches and status using git -C <path>, then confirm with the user before proceeding. Users typically create dataset-specific branches (see curator branching guidelines).
Example:
git -C veupathdb-repos/ApiCommonPresenters branch --show-current
git -C veupathdb-repos/ApiCommonPresenters status -sb
IMPORTANT: All commands in this workflow must be run from your curation workspace directory (the directory that contains veupathdb-repos/ as a subdirectory).
For Claude Code:
cd commands to change into veupathdb-repos/ subdirectoriesgit -C <path> for git operations in subdirectoriesgit -C veupathdb-repos/ApiCommonPresenters status instead of cd veupathdb-repos/ApiCommonPresenters && git statusThe workflow will create a tmp/ subdirectory in the curation workspace directory for intermediate files.
Gather the following before starting:
GCA_000988875.2 including version)Fetch assembly metadata from NCBI using the GenBank accession.
Command:
curl -X GET "https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/<ASSEMBLY_ACCESSION>/dataset_report" \
-H "Accept: application/json" > tmp/<ASSEMBLY_ACCESSION>_dataset_report.json
Detailed instructions: Step 1 - Fetch NCBI Metadata
Extract the BioProject accession from the assembly report and fetch additional details.
Command:
node scripts/fetch-bioproject.js <BIOPROJECT_ACCESSION>
This retrieves the BioProject title and description, saved to tmp/<BIOPROJECT>_bioproject.json.
Detailed instructions: Step 2 - Fetch BioProject
Find and fetch publications for the genome assembly.
Command:
node scripts/fetch-pubmed.js <ASSEMBLY_ACCESSION>
Results saved to tmp/<ASSEMBLY_ACCESSION>_pubmed.json.
Detailed instructions: Step 3 - Fetch PubMed
Identify and curate contact entries for the genome submission.
Contact identification priority:
Actions:
veupathdb-repos/EbrcModelCommon/Model/lib/xml/datasetPresenters/contacts/allContacts.xmlDetailed instructions: Step 4 - Curate Contacts
Generate the datasetPresenter XML and insert it into the appropriate presenter file.
Command:
node scripts/generate-presenter-xml.js <ASSEMBLY_ACCESSION> <PROJECT> <PRIMARY_CONTACT_ID> [ADDITIONAL_CONTACT_IDS...]
Target file: veupathdb-repos/ApiCommonPresenters/Model/lib/xml/datasetPresenters/<PROJECT>.xml
Detailed instructions: Step 5 - Update Presenter Files
After completing this workflow:
scripts/fetch-bioproject.js - Fetches BioProject metadata from NCBI (esearch + esummary)scripts/fetch-pubmed.js - Fetches PubMed records linked to a BioProject (elink + esummary)scripts/generate-presenter-xml.js - Generates datasetPresenter XML from fetched metadatascripts/check-repos.sh - Validates veupathdb-repos/ repository setup (synced from shared/)npx claudepluginhub veupathdb/dataset-curator --plugin curation-skillsRetrieves DNA/RNA sequences, raw reads (FASTQ), genome assemblies, and metadata from the European Nucleotide Archive via REST APIs and FTP for genomics and bioinformatics pipelines.
Downloads genomes, genes, virus sequences, and taxonomy data from NCBI using the datasets and dataformat CLI tools. Supports metadata queries, ortholog packages, and large-scale dehydrated bulk pulls.
Queries the European Nucleotide Archive for sequences, reads, assemblies, and annotations via REST APIs. Searches studies/samples, retrieves FASTA/EMBL, lists FASTQ/BAM file URLs, and resolves taxonomy or cross-references.