From superpowers
Automates web-scale data collection for research datasets using a human-in-the-loop LLM framework. Formulates search queries, navigates pages, extracts structured data, and performs quality control.
How this skill is triggered — by the user, by Claude, or both
Slash command
/superpowers:llm-web-data-collectionThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill provides a human-in-the-loop framework for automating web-scale data collection using Large Language Models. It addresses the challenges of manual data collection being time-consuming and error-prone by automating:
This skill provides a human-in-the-loop framework for automating web-scale data collection using Large Language Models. It addresses the challenges of manual data collection being time-consuming and error-prone by automating:
Key Innovation: Human-in-the-loop design allows researchers to inspect and adjust decisions at each stage, ensuring alignment with research objectives while mitigating LLM hallucinations and search engine bias.
Use this skill when:
Define Target Dataset:
dataset_spec = {
"name": "Clinical Trial Sites",
"description": "Collect information about clinical trial sites including location, specialties, and contact information",
"fields": [
{"name": "site_name", "type": "string", "required": True},
{"name": "location", "type": "string", "required": True},
{"name": "specialties", "type": "list", "required": False},
{"name": "contact_email", "type": "email", "required": False},
{"name": "phone", "type": "phone", "required": False},
{"name": "website", "type": "url", "required": False}
],
"constraints": [
"Only include active sites",
"Focus on US-based facilities",
"Prefer academic medical centers"
]
}
Human Review Point:
LLM-Based Query Generation:
def generate_search_queries(dataset_spec, llm):
"""
Use LLM to generate diverse search queries
from dataset description
"""
prompt = f"""
Given this dataset specification:
{json.dumps(dataset_spec, indent=2)}
Generate 20 diverse search engine queries that would help
find web pages containing this information.
Consider:
- Different phrasings of the same concept
- Specific vs general queries
- Including and excluding certain terms
- Different source types (directories, databases, articles)
Return as JSON list of queries.
"""
response = llm.generate(prompt)
queries = parse_json(response)
return queries
Query Diversification:
def diversify_queries(initial_queries, llm):
"""
Expand queries to reduce search engine bias:
- Add synonyms
- Vary query structure
- Include different geographic modifiers
- Add temporal modifiers if relevant
"""
diversified = []
for query in initial_queries:
variations = llm.generate_variations(query)
diversified.extend(variations)
# Remove duplicates and near-duplicates
return deduplicate(diversified)
Human Review Point:
Search Execution:
def execute_search(queries, search_engine="google"):
"""
Execute search queries and collect URLs
"""
all_results = []
for query in queries:
results = search_api.search(
query,
num_results=50,
engine=search_engine
)
for result in results:
result['source_query'] = query
all_results.extend(results)
return deduplicate_urls(all_results)
Page Relevance Scoring:
def score_page_relevance(url, page_content, dataset_spec, llm):
"""
Use LLM to assess page relevance to dataset spec
"""
prompt = f"""
Dataset objective: {dataset_spec['description']}
Page URL: {url}
Page content (first 5000 chars): {page_content[:5000]}
Score this page's relevance (0-10) and explain:
1. Does it contain relevant data points?
2. Is the data structured or extractable?
3. Is this a primary source or aggregator?
Return JSON: {{"score": X, "reasoning": "...", "data_fields_present": [...]}}
"""
return llm.generate(prompt)
Human Review Point:
Schema-Guided Extraction:
def extract_data(page_content, dataset_spec, llm):
"""
Extract structured data according to schema
"""
prompt = f"""
Extract the following fields from this page content:
Fields to extract:
{json.dumps(dataset_spec['fields'], indent=2)}
Page content:
{page_content}
Rules:
- Only extract explicitly stated information
- Mark uncertain extractions with confidence score
- Return null for missing required fields
- Flag potential hallucination risks
Return JSON matching the schema.
"""
extracted = llm.generate(prompt)
return validate_extraction(extracted, dataset_spec)
Hallucination Mitigation:
def verify_extraction(extracted_data, page_content, llm):
"""
Verify extracted data against source to prevent hallucination
"""
verification_results = []
for field, value in extracted_data.items():
# Check if value appears verbatim or closely in source
if not find_in_source(value, page_content):
# Use LLM to verify derivation
prompt = f"""
Verify this extraction:
Field: {field}
Extracted value: {value}
Source text: {page_content}
Is this value:
1. Directly stated in source
2. Reasonably derived from source
3. Possibly hallucinated
Return confidence score and evidence.
"""
verification = llm.generate(prompt)
verification_results.append(verification)
return flag_low_confidence(verification_results)
Human Review Point:
Cross-Validation:
def cross_validate(dataset, external_sources):
"""
Validate extracted data against known sources
"""
validation_results = []
for record in dataset:
# Check against external databases/APIs
external_match = lookup_external(record, external_sources)
if external_match:
agreement = compute_agreement(record, external_match)
validation_results.append({
'record': record,
'external_match': external_match,
'agreement': agreement
})
return validation_results
Consistency Checks:
def check_consistency(dataset):
"""
Check for internal consistency:
- Duplicate detection
- Conflicting values
- Outlier detection
- Format validation
"""
issues = []
# Duplicate detection
duplicates = find_duplicates(dataset)
issues.extend(duplicates)
# Value consistency (same entity, different values)
conflicts = find_conflicts(dataset)
issues.extend(conflicts)
# Outlier detection
outliers = detect_outliers(dataset)
issues.extend(outliers)
return issues
Human Review Point:
Generate Research-Ready Output:
def export_dataset(dataset, output_format="csv"):
"""
Export in standard research formats
"""
# CSV for tabular data
if output_format == "csv":
df = pd.DataFrame(dataset)
df.to_csv("dataset.csv", index=False)
# JSON for nested data
elif output_format == "json":
with open("dataset.json", "w") as f:
json.dump(dataset, f, indent=2)
# Generate data dictionary
generate_data_dictionary(dataset)
# Generate provenance log
generate_provenance_log(dataset)
Documentation:
# Dataset Documentation
## Collection Methodology
- Queries used: [list]
- Sources searched: [list]
- Date range: [dates]
## Quality Metrics
- Total records: X
- Verified records: Y%
- Human-reviewed: Z%
## Limitations
- Search engine bias mitigation: [description]
- Known gaps: [description]
## Provenance
- Each record includes source URL
- Extraction confidence scores included
| Phase | Checkpoint | Decision |
|---|---|---|
| 1 | Dataset spec review | Approve/modify schema |
| 2 | Query review | Add/remove queries |
| 3 | Page relevance | Adjust scoring criteria |
| 4 | Extraction review | Correct extractions |
| 5 | Quality review | Resolve conflicts |
| 6 | Final approval | Approve dataset |
# LLM
pip install openai # or anthropic, google-generativeai
# Web
pip install requests beautifulsoup4 selenium
# Data
pip install pandas
# Search APIs (optional)
pip install googlesearch-python
npx claudepluginhub lunartech-x/superpowers --plugin superpowersSearches the web, extracts URL content, enriches datasets, and runs deep research reports with emphasis on academic and scientific sources.
Indexes deep research principle skills for methodology, source evaluation, hallucination prevention, and synthesis-reporting; provides /research command for orchestrated multi-agent web research with verification.
Operates the anysite CLI for web data extraction, dataset pipelines, batch API processing, scheduling, SQL queries, database loading, and LLM-powered data analysis.