Generates SFT training data for content learning: Q&A pairs from documents or topic trees via web search. Use for teaching domain knowledge through supervised fine-tuning.
How this skill is triggered — by the user, by Claude, or both
Slash command
/lightningrod-python-sdk:content-learning-examplesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
---
From documents: Documents → chunk → QuestionAndLabelGenerator (extracts Q and A) → SFT. Use QuestionAndLabelGenerator, not WebSearchLabeler — the answers are in the documents.
From a topic/domain (no documents): Domain → TopicTreeSeedGenerator → questions → WebSearchLabeler (finds answers from the web) → SFT.
Goal: Train a model to give step-by-step survival instructions for grid-down emergencies.
TopicTreeSeedGenerator decomposes broad domains into specific leaf seeds for coverage, then WebSearchLabeler finds authoritative answers from the web.
Source:
lightningrod-python-sdk/notebooks/fine_tuning/03_survival_llm.ipynb
from lightningrod import (
LightningRod, QuestionPipeline,
QuestionGenerator, FreeResponseAnswerType, WebSearchLabeler,
)
# TopicTreeSeedGenerator is coming soon — not yet available in the SDK.
# When released, import it from lightningrod and use as shown below.
from lightningrod import TopicTreeSeedGenerator # available soon
lr = LightningRod(api_key=api_key)
answer_type = FreeResponseAnswerType(
labeler_instruction=(
"You are a survival expert giving emergency field instructions. "
"Direct, numbered steps. No introductions or disclaimers. "
"Specific measurements and techniques."
),
answer_format_instruction=(
"Direct, step-by-step answer. Start with step 1, no introduction."
),
)
pipeline = QuestionPipeline(
# TopicTreeSeedGenerator decomposes each root topic into degree^depth leaf seeds.
# 16 roots × 5^2 = 400 specific seeds like
# "Field medicine → improvising supplies → makeshift tourniquets"
seed_generator=TopicTreeSeedGenerator(
topic=[
"Field medicine and trauma care in austere environments",
"Water purification and safe water sourcing without electricity",
"Food preservation, canning, and long-term storage without refrigeration",
"Ham radio and emergency communications setup and operation",
"Land navigation using map, compass, and natural indicators",
"Growing food: gardening, permaculture, and seed saving",
"Herbal medicine and natural remedies from wild plants",
"Construction, structural repair, and improvised building",
"Welding, metalworking, and tool fabrication",
"Vehicle repair and mechanical troubleshooting without a shop",
"Fire starting, fire management, and fuel sourcing",
"Emergency shelter building from natural and salvaged materials",
"Hunting, trapping, fishing, and wild game processing",
"Knot tying, rope work, and cordage making",
"Weather reading and natural forecasting without instruments",
"Perimeter security, self-defense, and community safety planning",
],
tree_depth=2, # levels of recursive expansion
tree_degree=5, # subtopics per node
model_name="google/gemini-3-flash-preview",
model_system_prompt=(
"You are an expert in survival and self-reliance. "
"Generate specific, practical subtopics useful in a grid-down emergency."
),
),
question_generator=QuestionGenerator(
answer_type=answer_type,
questions_per_seed=10, # high — topic seeds are conceptual, not dense text
instructions=(
"Generate practical survival questions for grid-down emergencies. "
"Specific, scenario-based, ask HOW to do something with limited tools. "
"Each must cover a DISTINCT technique."
),
examples=[
"How do I purify water using only sand, gravel, and charcoal?",
"How do I perform a needle decompression for tension pneumothorax in the field?",
"How do I build a Dakota fire hole to minimize smoke and maximize heat?",
],
bad_examples=[
"What is survival? (too vague)",
"Tell me about water purification. (not actionable)",
"How does a ham radio work? (theoretical, not how-to)",
],
),
labeler=WebSearchLabeler(answer_type=answer_type, confidence_threshold=0.8),
)
dataset = lr.transforms.run(pipeline, name="SurvivalLLM")
After dataset = lr.transforms.run(...), prepare a train split and run hosted SFT on Lightning Rod (same service as GRPO training):
from lightningrod import prepare_for_training, FilterParams, SplitParams, SFTTrainingConfig
# Lint the full dataset before splitting
from lightningrod import display_lint_overview, get_lint_affected_sample_ids
lint_result = lr.datasets.linter.run(dataset.id)
display_lint_overview(lint_result)
train_dataset, test_dataset = prepare_for_training(
dataset,
filter=FilterParams(),
split=SplitParams(test_size=0.2),
)
BASE_MODEL = "Qwen/Qwen3-8B-Instruct"
training_config = SFTTrainingConfig(
base_model_id=BASE_MODEL,
training_steps=50,
epochs=3,
learning_rate=2e-4,
)
cost = lr.training.estimate_cost(training_config, dataset=train_dataset)
job = lr.training.run(training_config, dataset=train_dataset, name="survival-sft-v1")
# job.model_id — your LoRA checkpoint for inference via lr.predict(...)
For low-level local training loops (e.g. direct Tinker ServiceClient), use the Tinker SDK separately; the snippet above is the recommended path when your data already lives in Lightning Rod samples.
Goal: Train a model to answer clinical nutrition questions using knowledge from medical textbooks.
QuestionAndLabelGenerator extracts Q&A pairs directly from document chunks — no labeler needed since the answers are in the text.
Source:
llm_forecasting/notebooks/client_work/takeoff41/dataset_generation.ipynb
import json
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from lightningrod import (
LightningRod, FileSetMetadataSchemaInput,
MetadataFieldDefinitionInput, MetadataFieldType,
)
lr = LightningRod(api_key=api_key)
schema = FileSetMetadataSchemaInput(fields=[
MetadataFieldDefinitionInput(
name="book_title", field_type=MetadataFieldType.STRING, required=True,
description="Title of the textbook"
),
])
fileset = lr.filesets.create(
name="Medical Nutrition Textbooks",
description="Clinical nutrition textbooks for SFT training data",
metadata_schema=schema,
)
# textbooks is a list of (pdf_path, title) tuples
file_names = [pdf_path.name for pdf_path, _ in textbooks]
upload_response = lr.filesets.upload_folder(fileset.id, file_names)
# Upload PDFs in parallel
def upload_file(pdf_path, title):
url = upload_response.upload_urls.additional_properties[pdf_path.name]
with open(pdf_path, "rb") as f:
requests.put(url, data=f.read()).raise_for_status()
with ThreadPoolExecutor(max_workers=10) as executor:
for pdf_path, title in textbooks:
executor.submit(upload_file, pdf_path, title)
# Upload metadata manifest
manifest = {pdf_path.name: {"book_title": title} for pdf_path, title in textbooks}
manifest_url = upload_response.upload_urls.additional_properties["_manifest.json"]
requests.put(manifest_url, data=json.dumps(manifest).encode("utf-8"))
# The vector index is built automatically when the FileSet is first used in a pipeline
from lightningrod import (
QuestionPipeline, FileSetSeedGenerator,
QuestionAndLabelGenerator, FreeResponseAnswerType,
)
pipeline = QuestionPipeline(
seed_generator=FileSetSeedGenerator(
file_set_id=fileset.id,
chunk_size=4000, # larger chunks = more context per Q&A
chunk_overlap=200,
),
question_generator=QuestionAndLabelGenerator(
answer_type=FreeResponseAnswerType(),
questions_per_seed=3, # 3 Q&A pairs per chunk — dense medical text
instructions=(
"Generate questions testing understanding of clinical nutrition concepts, "
"medical procedures, and evidence-based practices. Specific, proper terminology. "
"Answers should cite specific values/ranges when mentioned."
),
),
)
dataset = lr.transforms.run(pipeline, max_seeds=4000, name="Medical nutrition Q&A")
sft_data = []
for s in dataset.download():
if not s.is_valid: continue
q, a = s.question.question_text, s.label.label
if not q or not a or a == "undetermined": continue
sft_data.append({"messages": [
{"role": "system", "content": "You are a clinical nutrition expert."},
{"role": "user", "content": q},
{"role": "assistant", "content": a},
]})
| Book | Q&A Pairs |
|---|---|
| ASPEN Parenteral Nutrition | 1,504 |
| ASPEN Fluids & Electrolytes | 1,127 |
| ASPEN Pediatric Nutrition | 3,787 |
| Handbook | 1,347 |
| NBNSC Book | 908 |
| Pediatric Nutrition | 1,666 |
| Total | 10,339 |
QuestionAndLabelGenerator, not WebSearchLabeler — answers are in the documentsWebSearchLabeler is correct — the web provides answers for topic-generated questionsFilterCriteria(min_score=0.7), score cutoffs, or agreement checksquestions_per_seed to density: topic tree nodes → 10, doc chunks (4000) → 3, doc chunks (2000) → 2, short text → 1npx claudepluginhub lightning-rod-labs/lightningrod-python-sdk --plugin lightningrod-python-sdkRecommends RL (GRPO) vs SFT training patterns for forward-looking reasoning or domain knowledge. Guides dataset construction from seeds, documents, or topic trees.
Generates a structured markdown course with visual diagrams and evidence-based learning features for any topic the user wants to learn from scratch.
Guides learners through Socratic questioning and progressive scaffolding to build understanding, correct misconceptions, and mentor problem-solving.