From sdrf-skills
Runs an autonomous SDRF annotation improvement loop over one dataset, a manifest, or a dataset class (e.g., all PRIDE cell line datasets). Use when the user wants repeated refine-validate-fix cycles until no more retained gains are possible.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sdrf-skills:sdrf-autoresearch target="<dataset scope>" [profile="<preset>"] [objective="<metric>"] [focus_fields="<field1,field2>"] [evidence="<pride,files,europepmc>"] [stop="<rule>"] [write="<sandbox|branch|report-only>"]target="<dataset scope>" [profile="<preset>"] [objective="<metric>"] [focus_fields="<field1,field2>"] [evidence="<pride,files,europepmc>"] [stop="<rule>"] [write="<sandbox|branch|report-only>"]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This workflow is a domain-specific autonomous loop for SDRF annotation.
This workflow is a domain-specific autonomous loop for SDRF annotation. It is intended to function with minimal user supervision once the target and optimization goal are clear.
Use it when the user asks for:
This is a protocol skill, not a dedicated runner script. Execute the loop by
following the steps below and by calling the existing sdrf:* skills in order.
Claude-style examples:
/sdrf:autoresearch target="all PRIDE cell line datasets"
/sdrf:autoresearch target="all sandbox crosslinking datasets" profile="crosslinking"
/sdrf:autoresearch target="manifest:data/cell_line_manifest.tsv" objective="maximize_valid_field_coverage"
Codex-style examples:
$sdrf-autoresearch target="all PRIDE cell line datasets"
$sdrf-autoresearch target="accessions:PXD001234,PXD005678" profile="clinical"
$sdrf-autoresearch target="all sandbox crosslinking datasets" objective="crosslinking_assay_completion" write="sandbox"
Normalize the user request into these fields:
target
all PRIDE cell line datasetsall sandbox crosslinking datasetsmanifest:data/cell_line_manifest.tsvaccessions:PXD001234,PXD005678profile
general-proteomicscell-linecrosslinkingclinicalimmunopeptidomicsobjective
maximize_valid_field_coverageminimize_unknownscrosslinking_assay_completioncell_line_sample_completionclinical_sample_completionfocus_fields
cell line,disease,organism part,treatmentcross-linker,crosslink enrichment method,collision energyevidence
pride,files,europepmcpride,filespride,files,europepmcmanuscript-firststop
3_no_improve_roundscoverage>=0.95only_low_confidence_candidates_leftwrite
sandboxbranchreport-onlyIf the user does not specify these explicitly, infer them conservatively and state the inferred config before the loop begins.
Interpret target into a concrete dataset set:
accessions:...
manifest:...
all PRIDE <category> datasets
cell linecrosslinkinghuman clinicalimmunopeptidomicsall sandbox <category> datasets
Always resolve the target into a manifest-like working list before annotation begins.
For blood plasma or plasma biomarker campaigns:
Audit the local project collection first
Treat discovery as two layers
new_dataset: accession not yet present in the local plasma collectionupgrade_existing_sdrf: accession already present locally, but disease wording, ontology, or field coverage should be improvedDefault to human plasma unless the user explicitly says otherwise
Homo sapiens as the default species scope for biomarker-style plasma campaignsorganisms first to confirm speciesWhen querying PRIDE, do not trust plasma hits blindly
MONDO, DOID, EFO, and NCITkidney tumor, use OLS embedding search to surface likely subtype names such as renal-cancer variantsmyositis -> dermatomyositis, sarcoma -> Ewing sarcoma, myeloma -> multiple myeloma, or alcohol-related liver disease -> alcoholic hepatitisinfluenza A, IAV, H1N1, flu, and, if the user explicitly allows it, broader viral pneumonia plus serumexact, child_term, related, or surrogate disease match so downstream ranking does not overstate coverageexperimentTypes for acquisition style such as Data-independent acquisition, Data-dependent acquisition, Gel-based experiment, or Bottom-up proteomicsquantificationMethods first for labeling style such as TMT, iTRAQ, label-free quantification, Dimethyl Labeling, or NSAFquantificationMethods is empty, inspect sampleProcessingProtocol, dataProcessingProtocol, and keywords for explicit terms like TMT16plex, iTRAQ, label-free quantification, LFQ, MaxQuant, DIA, SWATH, or Spectronautacquisition_mode and quant_mode columns so campaigns can prioritize DIA-LFQ, DDA-TMT, DDA-LFQ, and related workflows explicitlyblood plasma / plasma proteome / plasma extracellular vesicles from false positives such as plasma cells or plasma membranePRIDE, MassIVE, jPOST, or iProX; treat PanoramaPublic discoveries as audit-only until the campaign policy changespositive or ambiguous and the disease context is explicitconfirmed_plasma: explicit plasma evidence in title, abstract, methods, results, or supplementary textmixed_includes_plasma: plasma is explicit, but other sample matrices are also part of the studylikely_non_plasma: manuscript shows a different primary matrix such as CSF, urine, platelet releasate, BALF, or cell-line materialunclear: insufficient manuscript evidence; do not prioritize automaticallyconfirmed_plasma first, then mixed_includes_plasma if mixed-matrix studies are acceptable for the campaignSearch Europe PMC independently when publication links may be missing from PRIDE
PXD, PRIDE, or ProteomeXchange in the literature query to favor papers that reference a proteomics datasetPXD... / MSV... mentionsPrioritize candidates in this order
Do not optimize for raw field count alone. Optimize for valid, evidence-supported, template-compliant completion.
Default scoring dimensions:
Recommended derived objectives:
maximize_valid_field_coverageUse when the user wants the most complete valid SDRF possible.
Reward:
Penalize:
not available, not applicable, unknownminimize_unknownsUse when the main problem is placeholders and unresolved values.
Reward:
unknown and unknown crosslinkerPenalize:
crosslinking_assay_completionUse for XL-MS datasets. Focus strongly on:
comment[cross-linker]comment[crosslink enrichment method]characteristics[enrichment process]characteristics[crosslink distance]comment[crosslinker concentration]characteristics[crosslinking reaction time]characteristics[crosslinking temperature]comment[quenching reagent]comment[collision energy]cell_line_sample_completionUse for cell line datasets. Focus strongly on:
characteristics[cell line]characteristics[cell type]characteristics[disease]characteristics[organism part]characteristics[treatment]characteristics[sex]characteristics[age]For human or clinical campaigns, also treat demographic evidence conservatively:
characteristics[developmental stage] when the cohort is clearly adult, pediatric, fetal, juvenile, and so on, even if the paper reports only cohort-level age summariescharacteristics[age], characteristics[sex], or characteristics[ethnicity] unless the manuscript or supplementary tables map them to individual source samplesFor each dataset:
Discover evidence
For PRIDE file discovery, prefer the complete Archive REST endpoint when you need exact file coverage or file counts:
GET https://www.ebi.ac.uk/pride/ws/archive/v3/projects/PXD######/files/all
Use this to avoid partial paged counts when auditing raw files, checking
whether a candidate accession is tractable, or comparing SDRF comment[data file]
values against the archive.
If this endpoint returns 0 files for a valid PXD hosted through
PanoramaPublic, MassIVE, iProX, or jPOST, treat that as
archive endpoint empty for external repository rather than no dataset.
Keep the accession in play and note the repository-backed limitation in the
ranking output.
Run /sdrf:annotate
Run /sdrf:terms
Run /sdrf:techrefine
Run /sdrf:validate
2 parse_sdrf jobs at oncesdrf-techrefine, raw-file conversion, or other heavy analysis is active, validate only 1 dataset at a timeRun /sdrf:fix
Run /sdrf:improve
Keep or discard
Repeat the full loop until one of these is true:
Never keep looping just to replace placeholders with guesses.
Autonomous annotation must not compromise the machine.
Use these defaults:
validation_mode=serial unless there is a demonstrated need for limited parallel validationmax_validation_jobs=2max_validation_jobs=1 whenever raw-file analysis, file conversion, or large manuscript processing is already activeIf validation hangs or the system becomes resource-constrained, reduce the number of concurrent validators before continuing.
Prioritize evidence as follows:
Best sources:
Best sources:
Best sources:
At the end of the run, report:
If write=report-only, produce the same retained-improvement report without writing SDRFs.
/sdrf:autoresearch target="all PRIDE cell line datasets" profile="cell-line" objective="maximize_valid_field_coverage" focus_fields="cell line,disease,organism part,treatment" evidence="pride,files,europepmc" stop="3_no_improve_rounds" write="sandbox"
/sdrf:autoresearch target="all sandbox crosslinking datasets" profile="crosslinking" objective="crosslinking_assay_completion" focus_fields="cross-linker,crosslink enrichment method,collision energy,crosslinking reaction time" evidence="pride,files,europepmc" stop="only_low_confidence_candidates_left" write="sandbox"
/sdrf:autoresearch target="manifest:data/review_set.tsv" objective="minimize_unknowns" write="report-only"
npx claudepluginhub bigbio/sdrf-skillsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.