Skill

content-learning-examples

Generates SFT training data for content learning: Q&A pairs from documents or topic trees via web search. Use for teaching domain knowledge through supervised fine-tuning.

Python

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/lightningrod-python-sdk:content-learning-examples

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

---

SKILL.md

264 lines · ~2.6k tokens

Stats

LanguageJupyter Notebook

Stars51

Forks4

MaintenanceExcellent

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

Content Learning Examples (SFT)

Two Starting Points

From documents: Documents → chunk → QuestionAndLabelGenerator (extracts Q and A) → SFT. Use QuestionAndLabelGenerator, not WebSearchLabeler — the answers are in the documents.

From a topic/domain (no documents): Domain → TopicTreeSeedGenerator → questions → WebSearchLabeler (finds answers from the web) → SFT.

Example 1: Survival Field Guide (Topic Tree + Web Q&A)

Goal: Train a model to give step-by-step survival instructions for grid-down emergencies.

TopicTreeSeedGenerator decomposes broad domains into specific leaf seeds for coverage, then WebSearchLabeler finds authoritative answers from the web.

Source: lightningrod-python-sdk/notebooks/fine_tuning/03_survival_llm.ipynb

Pipeline

from lightningrod import (
    LightningRod, QuestionPipeline,
    QuestionGenerator, FreeResponseAnswerType, WebSearchLabeler,
)
# TopicTreeSeedGenerator is coming soon — not yet available in the SDK.
# When released, import it from lightningrod and use as shown below.
from lightningrod import TopicTreeSeedGenerator  # available soon

lr = LightningRod(api_key=api_key)

answer_type = FreeResponseAnswerType(
    labeler_instruction=(
        "You are a survival expert giving emergency field instructions. "
        "Direct, numbered steps. No introductions or disclaimers. "
        "Specific measurements and techniques."
    ),
    answer_format_instruction=(
        "Direct, step-by-step answer. Start with step 1, no introduction."
    ),
)

pipeline = QuestionPipeline(
    # TopicTreeSeedGenerator decomposes each root topic into degree^depth leaf seeds.
    # 16 roots × 5^2 = 400 specific seeds like
    # "Field medicine → improvising supplies → makeshift tourniquets"
    seed_generator=TopicTreeSeedGenerator(
        topic=[
            "Field medicine and trauma care in austere environments",
            "Water purification and safe water sourcing without electricity",
            "Food preservation, canning, and long-term storage without refrigeration",
            "Ham radio and emergency communications setup and operation",
            "Land navigation using map, compass, and natural indicators",
            "Growing food: gardening, permaculture, and seed saving",
            "Herbal medicine and natural remedies from wild plants",
            "Construction, structural repair, and improvised building",
            "Welding, metalworking, and tool fabrication",
            "Vehicle repair and mechanical troubleshooting without a shop",
            "Fire starting, fire management, and fuel sourcing",
            "Emergency shelter building from natural and salvaged materials",
            "Hunting, trapping, fishing, and wild game processing",
            "Knot tying, rope work, and cordage making",
            "Weather reading and natural forecasting without instruments",
            "Perimeter security, self-defense, and community safety planning",
        ],
        tree_depth=2,       # levels of recursive expansion
        tree_degree=5,      # subtopics per node
        model_name="google/gemini-3-flash-preview",
        model_system_prompt=(
            "You are an expert in survival and self-reliance. "
            "Generate specific, practical subtopics useful in a grid-down emergency."
        ),
    ),
    question_generator=QuestionGenerator(
        answer_type=answer_type,
        questions_per_seed=10,          # high — topic seeds are conceptual, not dense text
        instructions=(
            "Generate practical survival questions for grid-down emergencies. "
            "Specific, scenario-based, ask HOW to do something with limited tools. "
            "Each must cover a DISTINCT technique."
        ),
        examples=[
            "How do I purify water using only sand, gravel, and charcoal?",
            "How do I perform a needle decompression for tension pneumothorax in the field?",
            "How do I build a Dakota fire hole to minimize smoke and maximize heat?",
        ],
        bad_examples=[
            "What is survival? (too vague)",
            "Tell me about water purification. (not actionable)",
            "How does a ham radio work? (theoretical, not how-to)",
        ],
    ),
    labeler=WebSearchLabeler(answer_type=answer_type, confidence_threshold=0.8),
)

dataset = lr.transforms.run(pipeline, name="SurvivalLLM")

SFT Training

After dataset = lr.transforms.run(...), prepare a train split and run hosted SFT on Lightning Rod (same service as GRPO training):

from lightningrod import prepare_for_training, FilterParams, SplitParams, SFTTrainingConfig

# Lint the full dataset before splitting
from lightningrod import display_lint_overview, get_lint_affected_sample_ids

lint_result = lr.datasets.linter.run(dataset.id)
display_lint_overview(lint_result)

train_dataset, test_dataset = prepare_for_training(
    dataset,
    filter=FilterParams(),
    split=SplitParams(test_size=0.2),
)

BASE_MODEL = "Qwen/Qwen3-8B-Instruct"
training_config = SFTTrainingConfig(
    base_model_id=BASE_MODEL,
    training_steps=50,
    epochs=3,
    learning_rate=2e-4,
)

cost = lr.training.estimate_cost(training_config, dataset=train_dataset)
job = lr.training.run(training_config, dataset=train_dataset, name="survival-sft-v1")
# job.model_id — your LoRA checkpoint for inference via lr.predict(...)

For low-level local training loops (e.g. direct Tinker ServiceClient), use the Tinker SDK separately; the snippet above is the recommended path when your data already lives in Lightning Rod samples.

Example 2: Medical Textbooks (Document Q&A)

Goal: Train a model to answer clinical nutrition questions using knowledge from medical textbooks.

QuestionAndLabelGenerator extracts Q&A pairs directly from document chunks — no labeler needed since the answers are in the text.

Source: llm_forecasting/notebooks/client_work/takeoff41/dataset_generation.ipynb

Step 1: Upload Documents to FileSet

import json
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from lightningrod import (
    LightningRod, FileSetMetadataSchemaInput,
    MetadataFieldDefinitionInput, MetadataFieldType,
)

lr = LightningRod(api_key=api_key)

schema = FileSetMetadataSchemaInput(fields=[
    MetadataFieldDefinitionInput(
        name="book_title", field_type=MetadataFieldType.STRING, required=True,
        description="Title of the textbook"
    ),
])

fileset = lr.filesets.create(
    name="Medical Nutrition Textbooks",
    description="Clinical nutrition textbooks for SFT training data",
    metadata_schema=schema,
)

# textbooks is a list of (pdf_path, title) tuples
file_names = [pdf_path.name for pdf_path, _ in textbooks]
upload_response = lr.filesets.upload_folder(fileset.id, file_names)

# Upload PDFs in parallel
def upload_file(pdf_path, title):
    url = upload_response.upload_urls.additional_properties[pdf_path.name]
    with open(pdf_path, "rb") as f:
        requests.put(url, data=f.read()).raise_for_status()

with ThreadPoolExecutor(max_workers=10) as executor:
    for pdf_path, title in textbooks:
        executor.submit(upload_file, pdf_path, title)

# Upload metadata manifest
manifest = {pdf_path.name: {"book_title": title} for pdf_path, title in textbooks}
manifest_url = upload_response.upload_urls.additional_properties["_manifest.json"]
requests.put(manifest_url, data=json.dumps(manifest).encode("utf-8"))

# The vector index is built automatically when the FileSet is first used in a pipeline

Step 2: Run Q&A Generation Pipeline

from lightningrod import (
    QuestionPipeline, FileSetSeedGenerator,
    QuestionAndLabelGenerator, FreeResponseAnswerType,
)

pipeline = QuestionPipeline(
    seed_generator=FileSetSeedGenerator(
        file_set_id=fileset.id,
        chunk_size=4000,        # larger chunks = more context per Q&A
        chunk_overlap=200,
    ),
    question_generator=QuestionAndLabelGenerator(
        answer_type=FreeResponseAnswerType(),
        questions_per_seed=3,   # 3 Q&A pairs per chunk — dense medical text
        instructions=(
            "Generate questions testing understanding of clinical nutrition concepts, "
            "medical procedures, and evidence-based practices. Specific, proper terminology. "
            "Answers should cite specific values/ranges when mentioned."
        ),
    ),
)

dataset = lr.transforms.run(pipeline, max_seeds=4000, name="Medical nutrition Q&A")

Step 3: Filter and Format for SFT

sft_data = []
for s in dataset.download():
    if not s.is_valid: continue
    q, a = s.question.question_text, s.label.label
    if not q or not a or a == "undetermined": continue
    sft_data.append({"messages": [
        {"role": "system", "content": "You are a clinical nutrition expert."},
        {"role": "user", "content": q},
        {"role": "assistant", "content": a},
    ]})

Results

Book	Q&A Pairs
ASPEN Parenteral Nutrition	1,504
ASPEN Fluids & Electrolytes	1,127
ASPEN Pediatric Nutrition	3,787
Handbook	1,347
NBNSC Book	908
Pediatric Nutrition	1,666
Total	10,339

Things to Watch For

From documents: use QuestionAndLabelGenerator, not WebSearchLabeler — answers are in the documents
From topics: WebSearchLabeler is correct — the web provides answers for topic-generated questions
Quality filter always. FilterCriteria(min_score=0.7), score cutoffs, or agreement checks
System prompt matters. Shapes persona and gets baked into training data
Lint before splitting. Run the dataset linter on the full generated dataset before splitting or training — it catches structural issues (duplicates, missing fields) that quality filters don't check for
Match questions_per_seed to density: topic tree nodes → 10, doc chunks (4000) → 3, doc chunks (2000) → 2, short text → 1

content-learning-examples

Popularity

Invocation

Context Preview

SKILL.md

content-learning-examples

Popularity

Invocation

Context Preview

SKILL.md

Content Learning Examples (SFT)

Two Starting Points

Example 1: Survival Field Guide (Topic Tree + Web Q&A)

Pipeline

SFT Training

Example 2: Medical Textbooks (Document Q&A)

Step 1: Upload Documents to FileSet

Step 2: Run Q&A Generation Pipeline

Step 3: Filter and Format for SFT

Results

Things to Watch For

Similar Skills

Content Learning Examples (SFT)

Two Starting Points

Example 1: Survival Field Guide (Topic Tree + Web Q&A)

Pipeline

SFT Training

Example 2: Medical Textbooks (Document Q&A)

Step 1: Upload Documents to FileSet

Step 2: Run Q&A Generation Pipeline

Step 3: Filter and Format for SFT

Results

Things to Watch For

Similar Skills