Generates seeds from local files, CSVs, PDFs, and user uploads using the Lightningrod SDK. Supports chunking, metadata extraction, and large-scale file processing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/lightningrod-python-sdk:custom-dataset-seedsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```python
from lightningrod import preprocessing
# Glob pattern — supports .txt, .md, .pdf, .csv
samples = preprocessing.files_to_samples(
"data/*.pdf",
chunk_size=1000,
chunk_overlap=100,
)
# Single file
samples = preprocessing.file_to_samples("report.pdf")
# CSV with explicit columns
samples = preprocessing.files_to_samples(
"data.csv",
csv_text_column="body",
csv_label_column="outcome", # optional — embeds label in sample
)
# Raw string chunks
samples = preprocessing.chunks_to_samples(chunks, metadata={"source": "internal"})
input_dataset = lr.datasets.create_from_samples(samples, batch_size=1000)
# Pass to lr.transforms.run():
dataset = lr.transforms.run(pipeline, input_dataset=input_dataset, max_questions=10)
Prefer preprocessing.files_to_samples() for small, one-shot collections that only need to become seeds. Reach for a FileSet when any of these apply:
from lightningrod import (
FileSetMetadataSchemaInput, MetadataFieldDefinitionInput, MetadataFieldType,
)
# Metadata schema is optional — include it only if you plan to filter on these fields later
schema = FileSetMetadataSchemaInput(fields=[
MetadataFieldDefinitionInput(
name="ticker", field_type=MetadataFieldType.STRING, required=True,
description="Company ticker",
),
])
fs = lr.filesets.create(
name="quarterly-reports",
description="Investor reports",
metadata_schema=schema, # omit for unstructured collections
)
from datetime import datetime
# Per-file metadata is a dict keyed by filename. file_date powers temporal ordering.
result = lr.filesets.upload_files(
fs.id,
file_paths=["report_q1.pdf", "report_q2.pdf"],
metadata={
"report_q1.pdf": {"ticker": "APEX", "file_date": datetime(2024, 3, 31)},
"report_q2.pdf": {"ticker": "APEX", "file_date": datetime(2024, 6, 30)},
},
)
print(result.succeeded, result.failed, result.errors)
# Scale path — uses parallel GCS transfer. Requires the [transfer] extra:
# pip install "lightningrod-ai[transfer]"
result = lr.filesets.upload_directory(
fs.id, "./docs", pattern="*.pdf", max_workers=100, show_progress=True,
)
After upload, choose the transform pattern that matches how the documents relate to your questions.
FileSetSeedGeneratorUse when documents provide the material questions are about, but labels come from elsewhere (web search, news events, a later-resolved outcome).
from lightningrod import FileSetSeedGenerator
seed_gen = FileSetSeedGenerator(
file_set_id=fs.id,
chunk_size=2000,
chunk_overlap=200,
metadata_filters=["ticker='APEX'"], # SQL-style, optional
)
FileSetDocumentContextGenerator / FileSetDocumentLabelerNo embeddings, no vector search. Picks a single document by chronological relationship to the seed and passes its full text to the LLM. Right when:
NEXT_DOCUMENT / PREVIOUS_DOCUMENT) — Qdrant can't express thisfrom lightningrod import (
FileSetDocumentContextGenerator, FileSetDocumentLabeler, TemporalConstraint,
BinaryAnswerType,
)
# TemporalConstraint values: EQUAL, NEXT_DOCUMENT, PREVIOUS_DOCUMENT, BEFORE, AFTER
context = FileSetDocumentContextGenerator(
file_set_id=fs.id,
temporal_constraint=TemporalConstraint.EQUAL, # same doc as seed
metadata_filter_keys=["ticker"], # match seed's ticker
system_instruction="Extract sections relevant to forecasting.",
max_document_chars=200_000, # optional
)
labeler = FileSetDocumentLabeler(
file_set_id=fs.id,
temporal_constraint=TemporalConstraint.NEXT_DOCUMENT, # resolve from next report
metadata_filter_keys=["ticker"],
confidence_threshold=0.7,
answer_type=BinaryAnswerType(
labeler_instruction="Resolve Yes/No only when explicitly addressed.",
),
)
QdrantContextGenerator / QdrantRAGLabelerBuilds a vector index over the FileSet (BAAI/bge-small-en-v1.5, index_chunk_size=1500). At runtime, embeds the question and retrieves top_k chunks across the whole corpus. Right when:
from lightningrod import QdrantContextGenerator, QdrantRAGLabeler
context = QdrantContextGenerator(
file_set_id=fs.id,
top_k=5,
# Maps Qdrant payload key -> sample metadata key. Restricts retrieval to
# chunks whose `ticker` payload equals the sample's `ticker`.
payload_filters={"ticker": "ticker"},
temporal_direction="before", # soft timestamp filter: "before" | "after"
)
labeler = QdrantRAGLabeler(
file_set_id=fs.id,
payload_filters={"ticker": "ticker"},
temporal_direction="after", # forward-looking questions resolved by later docs
confidence_threshold=0.7,
answer_type=BinaryAnswerType(),
)
| Dimension | Qdrant* | FileSetDocument* |
|---|---|---|
| Retrieval | Vector search, top_k chunks | Single whole document, picked chronologically |
| Index | Builds embeddings on first use | None |
| Temporal param | temporal_direction="before"/"after" | temporal_constraint=TemporalConstraint.{EQUAL,NEXT_DOCUMENT,PREVIOUS_DOCUMENT,BEFORE,AFTER} |
| Metadata filter | payload_filters={"qdrant_key": "sample_key"} | metadata_filter_keys=["key1", "key2"] |
| Best for | Knowledge-base search | Periodic reports that resolve each other |
Rule of thumb: FileSetDocument = periodic reports that resolve each other. Qdrant = searching a knowledge base.
Before building a pipeline, check that the data is suitable:
| Check | How | Minimum bar |
|---|---|---|
| Volume | len(samples) | ≥ 50 samples for a meaningful demo |
| Date coverage | Check sample.date fields | Dates present for temporal split; span ≥ 30 days for forecasting |
| Text quality | Spot-check sample.text values | Readable prose, not garbled OCR or empty strings |
| Label availability | Check sample.label if using QuestionAndLabelGenerator | Labels present and non-null |
If the data fails a check, explain the issue clearly and stop — do not build a pipeline on bad inputs.
chunk_size=1000, chunk_overlap=100 works for most documentschunk_size=500)chunk_size=1500)notebooks/getting_started/02_custom_documents_datasource.ipynbnotebooks/custom_filesets/npx claudepluginhub lightning-rod-labs/lightningrod-python-sdk --plugin lightningrod-python-sdkGuides users through building forecasting datasets and fine-tuning models using the Lightningrod SDK. Follows proven patterns and benchmarks against frontier models.
Uploads, validates, and manages datasets for DataRobot projects. Handles file uploads, data quality checks, schema review, and prediction dataset preparation.
Creates evaluation datasets for Dokimos in JSON, CSV, or JSONL formats for LLM evaluation, test data, experiments, and format conversions.