From research-pipeline
Multi-source literature discovery across academic and non-academic sources. Can be the FIRST step in a research project — give it a topic and it creates the library from scratch. Searches OpenAlex, Semantic Scholar, arXiv, and the general web (blogs, whitepapers, NIST docs, vendor publications, industry reports). Use this skill whenever the user says "research this topic," "find papers on," "discover literature," "I want to research," "find sources about," "what's been written about," "build me a bibliography," "start researching," "find more papers," "expand my bibliography," "what am I missing," "find related work," "literature search," "fill research gaps," or any request to search for literature — academic or otherwise — on a topic. This is the recommended ENTRY POINT for new research projects. Also trigger when gap analysis results suggest missing coverage areas, or when the user gives a topic without specifying what to do with it.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-pipeline:literature-discoveryThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The starting point for any research project. Give it a topic — it finds the literature,
The starting point for any research project. Give it a topic — it finds the literature, creates the library, and loads everything into Supabase. Works with or without an existing library.
This skill has two discovery engines that work together:
API Engine — Direct calls to academic databases (OpenAlex, Semantic Scholar, arXiv) and web search. Fast, structured, returns metadata-rich results.
Multi-Model Swarm — Fans the same research query out to 2-3 LLMs via OpenRouter (Perplexity Sonar Pro + Gemini Flash or Claude Sonnet). Each model searches its own knowledge base and web-grounded sources, finding material the others miss. Results are merged, deduplicated, and synthesized by Claude.
By default, both engines run. The API engine catches the structured academic literature. The swarm catches the practitioner content, niche reports, and sources that don't show up in academic indexes.
The user has a topic but no bibliography yet. This is the most common entry point.
Flow:
The user already has a library and wants to find what's missing.
Flow:
Ask the user what they want to research. If they've given a clear topic, proceed. If vague, ask a few sharpening questions:
If no library exists for this topic:
INSERT INTO research_libraries (name, description, metadata)
VALUES (
'{topic_name}',
'{user_description}',
'{"topics": [...], "created_by": "dorian", "source_types": ["academic", "industry", "government"]}'::jsonb
)
RETURNING id, name
Tell the user: "Created library '[name]' — now searching for sources."
If a library already exists:
SELECT id, name FROM research_libraries ORDER BY created_at DESC
Ask which one, or auto-select if name matches.
GET https://api.openalex.org/works?search={query}&per_page=50&sort=relevance_score:desc
Add &[email protected] for polite pool access (faster rate limits).
Filter options:
&filter=from_publication_date:2020-01-01&filter=concepts.id:C41008148&filter=is_oa:trueKey response fields: doi, title, authorships, publication_year,
primary_location.source.display_name, abstract_inverted_index, cited_by_count
Reconstruct abstracts from inverted index (see references/api-response-formats.md).
Rate limit: 10/sec without key, 100/sec with mailto.
GET https://api.semanticscholar.org/graph/v1/paper/search?query={query}&limit=50&fields=title,authors,year,abstract,externalIds,citationCount
Rate limit: 100 requests/5 minutes.
GET https://export.arxiv.org/api/query?search_query=all:{query}&start=0&max_results=50&sortBy=relevance
Rate limit: 1 request/3 seconds. Always respect this.
For cybersecurity/compliance topics, use category filter: cat:cs.CR
These are just as important for practitioner-oriented research.
Use COMPOSIO_SEARCH_WEB or RUBE_SEARCH_TOOLS to search for:
Construct targeted queries for each source type:
"{topic}" site:blog OR whitepaper OR "technical report""{topic}" site:nist.gov OR site:disa.mil OR site:cyber.gov"{topic}" site:docs.* OR "technical documentation""{topic}" conference OR presentation OR "talk" filetype:pdf"{topic}" site:reddit.com OR site:stackoverflow.comFor compliance and cybersecurity topics, search NIST directly:
WebFetch: https://csrc.nist.gov/publications?keywords={query}
Extract publication titles, abstracts, and PDF links.
For STIG-related research:
WebFetch: https://public.cyber.mil/stigs/
Search for relevant STIGs, SRGs, and related documentation.
For each discovered source, classify it:
| Type | Entry Type | Example |
|---|---|---|
| Journal article | article | IEEE, ACM papers |
| Conference paper | inproceedings | Black Hat, RSA talks |
| Preprint | preprint | arXiv papers |
| Government publication | government | NIST SP 800-series |
| Industry whitepaper | whitepaper | Vendor security reports |
| Blog post | blog | Practitioner insights |
| Standard/Framework | standard | DISA STIGs, CIS Benchmarks |
| Book/Chapter | book | Textbooks, reference guides |
| Other web source | web | Everything else |
After the API and web searches complete, run the multi-model swarm to catch what structured APIs miss. This uses OpenRouter to query multiple LLMs in parallel, each with different training data, web access, and knowledge bases.
Send the same research prompt to 2-3 models via Rube MCP, targeting the OpenRouter chat completions endpoint. Each model returns sources it knows about. Claude then merges, deduplicates, and validates the combined results.
| Model | OpenRouter ID | Strength | Notes |
|---|---|---|---|
| Perplexity Sonar Pro | perplexity/sonar-pro | Web-grounded search, real-time citations with verified URLs | Primary. Always use. Best source of real, current URLs. |
| Google Gemini 2.5 Flash | google/gemini-2.5-flash-preview-05-20 | Fast, strong on government/standards docs | Use Flash not Pro — Pro's reasoning burns the token budget before generating output. |
| Anthropic Claude Sonnet | anthropic/claude-sonnet-4 | Deep reasoning, strong cross-domain connections | Good complement to Perplexity. Honest about what it doesn't know. |
| Model | Why |
|---|---|
openai/gpt-4o | Fabricates URLs. Every single URL it returned in testing was example.com. Cannot be trusted for source discovery. |
deepseek/deepseek-r1 | Reasoning phase takes too long, causes Rube MCP 60-second timeout. Use deepseek/deepseek-chat instead if you want DeepSeek coverage. |
google/gemini-2.5-pro-preview-05-06 | Spends most of its token budget on internal reasoning, then hits max_tokens before outputting results. Use Flash instead. |
Run Perplexity + one other model as a pair via RUBE_MULTI_EXECUTE_TOOL.
Two models in parallel completes within the 60-second Rube timeout.
Three models risks timeout — run the third separately if needed.
Send this prompt (adapted per topic) to each model:
You are a research discovery agent. Your job is to find ALL significant sources
— academic papers, government publications, industry whitepapers, blog posts,
conference talks, and standards documents — on the following topic:
TOPIC: {topic}
CONTEXT: {library_description}
For each source you find, provide:
1. Title (exact)
2. Authors (if known)
3. Year of publication
4. Type: academic | government | industry | blog | standard | book
5. URL or DOI (if you have it — ONLY real ones, never fabricate)
6. A 2-3 sentence description of what the source covers
7. Why it's relevant to this research topic
Find at least 15 sources. Prioritize:
- Seminal/foundational works that everyone in this field cites
- Recent publications (last 3 years) showing current state of the art
- Government standards and guidance documents (NIST, DISA, DoD)
- Practitioner perspectives (blogs, conference talks, vendor whitepapers)
- Contrarian or critical viewpoints that challenge mainstream thinking
DO NOT fabricate citations. If you're unsure about a URL or DOI, say so.
It's better to give a title without a link than a fake link.
Format your response as a JSON array:
[
{
"title": "...",
"authors": "...",
"year": 2024,
"type": "academic",
"url": "https://...",
"doi": "10.1234/...",
"description": "...",
"relevance": "..."
}
]
Use RUBE_MULTI_EXECUTE_TOOL or make parallel RUBE_REMOTE_WORKBENCH calls
to the OpenRouter API:
POST https://openrouter.ai/api/v1/chat/completions
Headers:
Authorization: Bearer {OPENROUTER_API_KEY}
Content-Type: application/json
HTTP-Referer: https://moxywolf.com
X-Title: MoxyWolf Research Pipeline
Body:
{
"model": "{model_id}",
"messages": [
{"role": "user", "content": "{swarm_prompt}"}
],
"temperature": 0.3,
"max_tokens": 4000
}
Send all model requests in parallel. Don't wait for one to finish before starting the next.
After all models respond:
"discovered_by": ["perplexity", "gemini"]Actual costs from live testing (March 2026):
Total per discovery run (2 models): ~$0.03-0.07. Negligible for the coverage improvement.
After merging with API results, flag swarm-only discoveries in the presentation:
🔍 Multi-Model Discovery ([count] unique sources not found by API search)
22. [Title] ([Year]) — found by Perplexity + Gemini Flash
Type: whitepaper
URL: [link]
Why: [relevance explanation]
23. [Title] ([Year]) — found by Perplexity only (URL verified)
Type: blog
URL: [link]
Why: [relevance explanation]
This makes it clear which sources came from the swarm vs. structured APIs, so the user can weigh confidence accordingly.
If expanding an existing library:
SELECT doi, title, url FROM citations WHERE library_id = {id}
Remove matches on DOI, URL, or title similarity >90%.
For new libraries, deduplicate across search results (same paper found by multiple APIs).
Group results by source type for clarity:
Literature Discovery: "{topic}"
══════════════════════════════
Found [X] sources across academic and industry channels.
📄 Academic Papers ([count])
1. [Title] ([Year]) — [Journal]
Authors: [...] | Citations: [count]
DOI: [doi]
2. ...
🏛️ Government & Standards ([count])
3. [NIST SP 800-171 Rev 3] — NIST
Published: [date]
URL: [link]
4. ...
📝 Industry & Practitioner ([count])
5. [Blog Title] — [Site]
Author: [...] | Published: [date]
URL: [link]
6. ...
Which of these should I add to your library?
Say 'all', specific numbers, 'academic only', 'industry only', or 'none'.
For each approved source:
INSERT INTO citations (
library_id, citation_key, entry_type, title, authors, year,
journal, abstract, doi, arxiv_id, url, bibtex_raw,
verification_status, source
) VALUES (
{library_id},
'{generated_citation_key}',
'{entry_type}',
'{title}',
'{authors}',
{year},
'{journal_or_publisher}',
'{abstract_or_description}',
'{doi}',
'{arxiv_id}',
'{url}',
'{bibtex_raw_if_available}',
CASE WHEN doi IS NOT NULL OR arxiv_id IS NOT NULL THEN 'verified' ELSE 'unverified' END,
'{source_api}'
)
ON CONFLICT (library_id, citation_key) DO NOTHING
For non-academic sources without a BibTeX key, generate one:
{first_author_surname}_{year} for authored works{org_acronym}_{year}_{short_title} for org publications (e.g., nist_2024_sp800171){site}_{year}_{slug} for blog posts (e.g., krebs_2024_stig_automation)For web sources, generate a BibTeX entry so the library stays export-compatible:
@misc{krebs_2024_stig_automation,
author = {Krebs, Brian},
title = {Automating STIG Compliance at Scale},
year = {2024},
url = {https://example.com/article},
note = {Blog post. Accessed: 2026-03-18},
abstract = {Brief description of the content...}
}
Library Built: {name}
═══════════════════
Total sources added: [X]
├─ Academic papers: [count]
├─ Government/standards: [count]
├─ Industry/practitioner: [count]
└─ Other: [count]
Verification status:
├─ Verified (DOI/arXiv): [count]
└─ Unverified (web): [count]
Next steps:
→ "Verify my citations" — validate DOIs and check for broken links
→ "Find more papers" — run another discovery round with refined terms
→ "Synthesize my research" — build thematic map + writing perspective
→ "Import my BibTeX" — add your own collected references on top
When expanding an existing library, choose based on its state:
Broaden search terms based on themes already in the library.
Follow references and citations from the library's most-cited papers via Semantic Scholar.
Target specific gaps identified in research_gaps table.
Fill year-range gaps — find recent work or foundational papers.
If the library is all academic, find industry sources. If all industry, find the academic backing.
Weekly automated scan (configure via /schedule):
references/api-response-formats.md — Parsing guides for OpenAlex, Semantic Scholar, arXiv, CrossRef, DataCitereferences/openrouter-swarm.md — Multi-model swarm configuration: model roster, prompt templates, merge algorithm, cost breakdownnpx claudepluginhub moxywolfllc/moxywolf-plugins --plugin research-pipelineGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.