This skill should be used when the user asks to "extract a study", "convert this paper", "process this journal article", or drops a PMC link, DOI, PubMed URL, or research PDF into the conversation. Handles anything from PubMed/PMC, NEJM, JAMA, Lancet, JACC, Cureus, Nature, etc. Specifically for papers with IMRaD structure (Abstract, Methods, Results, Discussion), not books. Also triggers when the user mentions a study PDF without explicitly asking to convert it.
How this skill is triggered — by the user, by Claude, or both
Slash command
/obscura-scraper-crawler:extract-studyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Convert research papers into clean Markdown with a metadata header, IMRaD section structure, and preserved references. The bundled Python script handles two-column layouts, section detection, and metadata extraction automatically.
Convert research papers into clean Markdown with a metadata header, IMRaD section structure, and preserved references. The bundled Python script handles two-column layouts, section detection, and metadata extraction automatically.
Reuse extract-book's .venv if it exists; otherwise create one in the working directory:
[ -d .venv ] || python3 -m venv .venv
source .venv/bin/activate && python -c "import pdfplumber" 2>/dev/null || pip install pdfplumber
Don't suppress pip errors with 2>/dev/null — surface them so missing toolchains (e.g. no Python build deps) fail loud rather than as a confusing ModuleNotFoundError on the next line.
source .venv/bin/activate && python {SKILL_DIR}/scripts/extract_study_pdf.py "<path-to-pdf>" --dry-run
This prints the detected metadata (title, year, DOI, PMID/PMCID) and the list of sections found. Authors and journal are not auto-detected — fill those in by hand from the PDF's first page after extraction. Review with the user before extracting.
If the dry run prints a WARNING: detected N tokens that look like ... right-to-left order line, one of the paper's tables (usually a correlation matrix) was extracted with reversed cells. Either rerun with --layout or plan to reconstruct that table from the published HTML — never ship the .md without fixing it.
source .venv/bin/activate && python {SKILL_DIR}/scripts/extract_study_pdf.py "<path-to-pdf>" -o "<output-path>.md"
If -o is omitted, output goes next to the PDF.
After extraction, read the top of the file and verify:
When to do a full rewrite vs. patch: For short case reports, letters, or editorials (≤6 pages, non-IMRaD), the extracted output often needs complete reorganization. In that case, read the PDF carefully and rewrite the Markdown from scratch using the target format below. A full rewrite in one pass is faster than patching dozens of column-bleed artifacts.
Location: Ask the user where to save it unless the project's CLAUDE.md or an existing folder of study extractions makes it obvious. Match the filing pattern of neighboring studies if there is one.
Filename: Lowercase kebab-case — <firstauthor>-<year>-<short-slug>.md.
<firstauthor> is the first author's surname only (strip initials and "et al"). "Dugani SB" → dugani.<year> is the publication year.<short-slug> is a 3–6 word descriptive title. Keep the full paper title in the # H1 heading inside the file.obrien, "Mendoza-López" → mendoza-lopez.Example: dugani-2021-lipid-markers-womens-health.md.
PDF handling: Rename the source PDF to match the Markdown filename exactly (same kebab slug, same directory, .pdf extension) so the PDF and MD are paired and discoverable under one search. Use git mv if the PDF is already tracked.
# [Paper Title]
**Authors:** [First Author], [Second Author], et al.
**Journal:** [Journal Name]
**Year:** [Year]
**DOI:** [DOI]
**PMID/PMCID:** [if known]
**Study design:** [RCT / cohort / meta-analysis / etc. — fill in manually if not obvious]
## Key Findings
- [bullet 1]
- [bullet 2]
- ...
---
## Abstract
[extracted text]
## Background / Introduction
[extracted text]
## Methods
[extracted text]
## Results
[extracted text]
## Discussion
[extracted text]
## Conclusion
[extracted text]
---
## References
[preserved reference list]
The script scans for IMRaD-style headings at the start of lines. It recognizes common variants:
Headings can be in ALL CAPS, Title Case, or numbered (1. Introduction, 2. Methods). The script normalizes them to ## Section Name. Anything before the first detected section becomes the metadata block + abstract area; anything after References stays under References.
--layout (uses pdfplumber's layout=True mode). For very short papers (≤6 pages), a full manual rewrite from the PDF is usually faster than patching bleed artifacts.0.28** come out as **82.0 and labels like CSI come out as ISC. The script detects this and prints a warning at extraction time. Fix by either retrying with --layout or pulling the table straight from the published HTML (PMC or the journal site) and pasting it in.nihms-XXXXXX.pdf), the PDF puts "HHS Public Access / Author manuscript" cover matter and a running journal cite ahead of the title. The script filters this chrome, but always cross-check the .md against the PMC HTML page (pmc.ncbi.nlm.nih.gov/articles/PMC<id>/) — it gives you a clean Table 2, intact references with PubMed links, and verified PMID. Make this a default step for any NIHMS-prefixed PDF.cross- + sectional), but unfamiliar journal-specific cover matter can still leak through — sanity-check the title.10.xxxx/... patterns. If the paper uses an unusual DOI format, add it manually.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub noisemeldorg/skills --plugin extraction-skills