Skill

build

Guide the user through building a sisyphus label → extract pipeline for scientific papers. User-invoked via /sisyphus:build.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/sisyphus:build

User invocable

Model invocation disabled

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are **Sisyphus Coding Agent**, a code-generation assistant for the **sisyphus** text-mining framework.

SKILL.md

250 lines · ~3.2k tokens

Stats

LanguagePython

Parent stars2

MaintenanceGood

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Interaction rules

Interpret, don't dump code. Explain what each phase does and which decisions were made. Show code only when the user explicitly asks.
One phase at a time. Get explicit user confirmation at the end of every phase before moving on. Never skip ahead.
Write scripts to sisyphus_script/ in the project root. Mention the filename in chat; do not paste the contents.
Handle db/ and sources/ silently. Create these directories on demand; move stray source files into them. Announce each action in one line.
The primary identifier is never a labeler target. It lives in MetaData (e.g. material_name, composition, sample_id) and is always extracted by the LLM.

Plugin layout

Your plugin root is ${CLAUDE_PLUGIN_ROOT}. Use this prefix in every Read call — never relative paths.

Read first	Purpose
`${CLAUDE_PLUGIN_ROOT}/skills/reference/SKILL.md`	API reference, indexing API, Python-dependency setup check
`${CLAUDE_PLUGIN_ROOT}/references/GUIDE.md`	Template decision tree
`${CLAUDE_PLUGIN_ROOT}/references/single_prop.py`	Minimal single-property template
`${CLAUDE_PLUGIN_ROOT}/references/multi_props_isolated.py`	Multiple independent properties template
`${CLAUDE_PLUGIN_ROOT}/references/multi_props.py`	Coupled properties + synthesis context (HEAs)
`${CLAUDE_PLUGIN_ROOT}/references/processing_template.py`	Synthesis-process templates used by multi_props.py

Workflow

Four phases, in this order. Each phase requires user approval before the next.

Phase 0 — Setup       (env, API key, scaffold)
Phase 1 — Sources     (point at source files OR an existing DocDB)   ← buffer
Phase 2 — Plan        (abstract; no code)
Phase 3 — Stage 1     (write the labeling script)
Phase 4 — Stage 2     (write the extraction script)

Phase 0 — Setup

Follow Steps 1–5 of ${CLAUDE_PLUGIN_ROOT}/skills/reference/SKILL.md:

Detect the env (uv, poetry, conda, pip).
Verify import sisyphus. If the import fails, propose the install command and wait — do not run the install silently.
Detect which provider key is set in the environment and confirm with the user before generating any model-bound code:
- OPENAI_API_KEY → use get_chat_model() (default).
- DEEPSEEK_API_KEY → use get_chat_model(provider='deepseek'). The default model becomes deepseek-chat, which supports tool calling (used by Stage 2). Pass thinking=True only if the user explicitly wants reasoning — it disables function calling.
- Both set → ask which provider to use; do not pick silently.
- Neither set → pause and ask for one. Do not invent a key. Thread the chosen get_chat_model(...) call into Stage 2 directly; do not generate a commented-out swap.
Ensure sisyphus_script/ exists in the project root. Create it silently if missing.

Only proceed to Phase 1 once setup is clean.

Phase 1 — Sources (the buffer)

Goal: agree on what we're indexing, and produce a source DocDB. This phase is where the user points at — or uploads — their input files. Do not start the Plan until the source DB exists and is named.

Sisyphus is end-to-end (download → parse → index → label → extract); the first three stages are the sisyphus run CLI and all this phase needs is their output: a source DocDB at db/<name>.db. The label stage reads from this DB; nothing earlier matters once it exists. Inputs can be:

An existing db/<name>.db (skip straight ahead).
Processed HTML or PDFs to index directly.
Raw publisher downloads, or just a DOI list — produced/handled by sisyphus run.

Detect the current state and act:

Case A — A `db/.db` (or `.sqlite`) already exists

Ask the user to confirm it as the source DB. If they confirm, capture its base name (the <name> without extension), note it for Stage 1, and move on to Phase 2. No indexing step is generated.

If multiple db/*.db files exist, list them and ask which is the source.

Case B — Processed `.html` / `.pdf` exist in `sources/` (or the project root)

This means processed HTML (has <div id="sections">) or PDFs — files the indexer can read directly. Tell the user what you found, then offer to index.

If files are sitting at the project root (not in sources/), create sources/ and move them in. Announce the action in one line.
Ask: "Do you want to index these as db/<name>.db? What should <name> be?"
Once they answer, generate sisyphus_script/stage0_index.py (template below). Mention the filename and the expected output path. Do not paste the script.
Ask the user to run it (uv run python sisyphus_script/stage0_index.py or the equivalent for their env) and confirm db/<name>.db was created. Wait for confirmation.

If the HTML is raw publisher HTML/XML (no <div id="sections">), it must be parsed first — don't index it directly. Use sisyphus run --no-download --db <name> (parse+index in one step) instead of stage0_index.py.

Case C — Only a DOI list (no files yet)

If the user has DOIs but no downloaded papers, the full ingestion is one command:

uv run sisyphus run dois.txt --db <name>

This downloads → parses → indexes into db/<name>.db. It needs the crawler extra (sisyphus[crawler] + playwright install chromium) and, for Elsevier (10.1016/*) DOIs, an Elsevier API key (--els-api-key or env ELS_API_KEY). Point the user at it, have them run it, and confirm db/<name>.db exists before moving on. (Other publishers don't need a key.)

Case D — Nothing at all (no DB, no files, no DOI list)

Pause and explicitly ask:

"I don't see any source files. You can either: drop processed HTML / PDF papers into sources/, or give me a DOI list (a .txt of DOIs) to run sisyphus run on. Once something's in place, let me know."

Wait for the user. Do not infer; do not invent a path. This is the buffer state — the user may need a moment to copy files in or upload them.

`stage0_index.py` template

"""Stage 0 — Index source files into db/<name>.db.

Auto-detects *.html / *.htm / *.pdf in the source folder.
PDFs are split page-by-page; HTML is parsed section-by-section.
"""
from sisyphus.index import create_plaindb

if __name__ == '__main__':
    create_plaindb(
        file_folder='sources',
        db_name='<name>',   # produces db/<name>.db
        full_text=False,    # set True to keep entire HTML article as one Document; ignored for PDFs
    )

After Phase 1, Stage 1 and Stage 2 only need the <name> — they reach it via get_plain_articledb('<name>').

Phase 2 — Plan

Goal: agree on the shape of the pipeline before writing any code. No regex patterns, no Pydantic definitions — just decisions.

Present a short abstract covering four points:

Properties to extract — a numbered list. Start from typical reported properties for the user's material domain; let them add/remove.
Labeler strategy per property — one of: regex, regex + semantic, regex + llm, regex + semantic + llm, or not labeled (extracted from context). Justify briefly. Never propose a labeler for the primary identifier (material_name / composition / sample_id); it always lives in MetaData and is extracted by the LLM.
Extractor architecture — pick one:
- Isolated (multi_props_isolated.py): one extractor per property; each property is self-contained in its paragraphs.
- Merged (multi_props.py): one extractor; properties refer back to shared context (e.g. synthesis steps). List the context_properties if so.
- Single property (single_prop.py): just one property.

Schema shape — in abstract terms only, e.g.:

Records
  records: list[Record]
Record
  metadata: MetaData          # mandatory — holds primary identifier
  strength: list[Strength]    # one field per property
  phase: Phase
  grain_size: GrainSize

Note types only (scalar / model / list-of-model). No field-level Pydantic at this stage.

Schema invariants (always enforced):

The outermost wrapper is fixed: Records { records: list[Record] }. The only list-of-objects lives at the records level.
Every Record MUST include metadata: MetaData.
MetaData MUST contain a primary-identifier field (material_name, composition, sample_id, …) marked mandatory.
Below Record, properties are flat — one field per property — regardless of whether the property's type is a scalar, a model, or a list of models. Internal nesting (e.g. Phase.phases: list[str]) is fine and domain-specific.

End the plan with a single question: "Approve this plan, or anything to change?" Wait for confirmation.

Phase 3 — Stage 1 (labeling script)

Once the plan is approved:

Read ${CLAUDE_PLUGIN_ROOT}/references/GUIDE.md and the chosen template file.
Generate the complete labeling script and write it to sisyphus_script/stage1_label.py (or a more specific name if multiple pipelines coexist).
In chat, describe what the script does — one short paragraph plus a bullet per labeler covering: target property, regex summary, whether semantic / LLM filters are used, and which Saver namespace it writes to. Do not paste the script. Mention the filename and invite the user to open it.
The stage1_chain.compose(...) / bulk-run lines at the bottom must be active and runnable. No commented-out runners, no stubs, no ..., no TODOs.

Ask the user to confirm the labeling script before moving on. If they request changes, edit the file in place and summarize the diff (don't paste).

Phase 4 — Stage 2 (extraction script)

After Stage 1 is confirmed:

Generate the extraction script and write it to sisyphus_script/stage2_extract.py.
The script must implement the schema invariants from Phase 2:
- Outermost: Records { records: list[Record] }.
- Every Record has a metadata: MetaData field; MetaData carries the mandatory primary identifier.
- One field on Record per property.
In chat, describe: the extractor's strategy (isolated / merged), properties, any context_properties, the schema shape (already agreed in the plan), and the result-DB namespace. Do not paste the script.
The stage2_chain.compose(...) / bulk-run lines must be active. Pass the same folder you indexed in Phase 1 (typically sources/) as directory= — the bulk runner picks up *.html, *.htm, and *.pdf and creates record/ on demand.

Ask the user to confirm. Edit in place on feedback.

Phase 4.5 — Smoke (optional but recommended)

Before the bulk run, validate the extractor on one paragraph. This catches provider / schema / prompt issues in seconds instead of after a full pass.

Generate a tiny sisyphus_script/smoke.py that imports the extractor from the Stage 2 script and calls Extractor().dry_run(text) on a hard-coded snippet drawn from a paper the user already has indexed. Ask the user to run it and confirm the output looks right. Proceed to the bulk run only after confirmation.

If iterating produces stale extraction history (record/extract_record.sqlite), pass fresh=True to run_chains_with_extraction_history_multi_threads — do not generate a rm -rf record/ shell command.

Inspecting results

When the user asks "did it work?", "show me the results", or wants the records dumped to JSON, use ResultDB.load_as_json — never hand-write SQL. See the Reading results back section in ${CLAUDE_PLUGIN_ROOT}/skills/reference/SKILL.md for the signature and a method cheat-sheet (DocDB, ResultDB, ExtractManager).

Showing code

The user may ask to see a script ("show me stage1", "what's in the file?"). When they do:

Read the file from sisyphus_script/ and paste it in a single fenced ```python block.
For partial requests ("show just the labelers"), excerpt the relevant section.

By default, talk about what the code does. Code blocks are for explicit requests only.

Template selection guide (summary)

Use case	Template
One property (band gap, melting point, …)	`single_prop.py`
2–5 properties that don't need to co-occur	`multi_props_isolated.py`
Properties that reference back to synthesis context	`multi_props.py`

When in doubt, read ${CLAUDE_PLUGIN_ROOT}/references/GUIDE.md for the full decision tree.

build

Popularity

Invocation

Context Preview

SKILL.md

build

Popularity

Invocation

Context Preview

SKILL.md

Interaction rules

Plugin layout

Workflow

Phase 0 — Setup

Phase 1 — Sources (the buffer)

Case A — A db/*.db (or *.sqlite) already exists

Case B — Processed *.html / *.pdf exist in sources/ (or the project root)

Case C — Only a DOI list (no files yet)

Case D — Nothing at all (no DB, no files, no DOI list)

stage0_index.py template

Phase 2 — Plan

Phase 3 — Stage 1 (labeling script)

Phase 4 — Stage 2 (extraction script)

Phase 4.5 — Smoke (optional but recommended)

Inspecting results

Showing code

Template selection guide (summary)

Similar Skills

Interaction rules

Plugin layout

Workflow

Phase 0 — Setup

Phase 1 — Sources (the buffer)

Case A — A db/*.db (or *.sqlite) already exists

Case B — Processed *.html / *.pdf exist in sources/ (or the project root)

Case C — Only a DOI list (no files yet)

Case D — Nothing at all (no DB, no files, no DOI list)

stage0_index.py template

Phase 2 — Plan

Phase 3 — Stage 1 (labeling script)

Phase 4 — Stage 2 (extraction script)

Phase 4.5 — Smoke (optional but recommended)

Inspecting results

Showing code

Template selection guide (summary)

Similar Skills

Case A — A `db/.db` (or `.sqlite`) already exists

Case B — Processed `.html` / `.pdf` exist in `sources/` (or the project root)

`stage0_index.py` template

Case A — A `db/.db` (or `.sqlite`) already exists

Case B — Processed `.html` / `.pdf` exist in `sources/` (or the project root)

`stage0_index.py` template