By jmagly
Corpus-to-dataset pipeline for AI training data curation. Ingests sources, synthesizes examples, generates preference pairs, applies decontamination, and exports to Alpaca/ShareGPT/ChatML/JSONL/Parquet with provenance and reproducibility. Grounded in 485 research REFs covering DPO/KTO/ORPO/SimPO, Self-Instruct/Evol/Orca/Phi/PersonaHub/STaR/ReST, Model Collapse guard, Datasheets/Model Cards/Data Statements, HF Datasets/Arrow+Parquet.
Computes dataset-level metrics (diversity, difficulty, domain balance, quality grade distribution) and prepares the matric-eval handoff package for model evaluation.
Coordinates dataset versioning, datasheet/model card generation, integrity manifests, and the publication gate including override escalation paths.
Runs exact, fuzzy, and semantic contamination checks against eval-set targets and feeds the publication gate.
Generates SFT training examples from admitted sources using self-instruct, evol-instruct, squad, and STaR patterns with per-example provenance.
Runs mechanical format adapters (alpaca, sharegpt, chatml, jsonl, parquet) with round-trip validation and sidecar metadata.
Acquire a training data source with license validation and delegate ingest to the semantic memory kernel
Generate Datasheet, Model Card, and Data Statement from a dataset manifest
Deterministically rebuild a dataset from its manifest and verify fixity equivalence
Create a versioned training dataset with manifest, fixity, provenance, and archive snapshot
Detect training-eval overlap against benchmark sets before dataset publication
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Corpus-to-dataset pipeline for AI training data curation
15 skills, 7 agents, 5 format adapters, 3 decontamination modes, 6 default benchmark targets, 485 research REFs. Agentic surface that works out of the box + optional Python runtime for scale.
/plugin install training@aiwg # Claude Code plugin install
aiwg use training # or via AIWG CLI
Get Started · What You Get · Architecture · Research · Docs
aiwg-training is a marketplace plugin for AIWG that turns any corpus — research papers, code repositories, conversation logs, documentation sites — into training-ready datasets for fine-tuning language models. It produces datasets suitable for SFT, DPO, KTO, ORPO, SimPO, and GRPO training workflows, with full provenance, license inheritance, benchmark decontamination, and byte-for-byte reproducibility.
If you have tried to build a fine-tuning dataset and ended up with ad-hoc scripts, manually curated JSONL files, mystery licenses, and hope-this-doesn't-contaminate-the-eval vibes, aiwg-training is the missing infrastructure layer. It implements every published best practice from dataset methodology research (Self-Instruct, Evol-Instruct, Orca, PersonaHub, STaR), preference-optimization research (DPO, KTO, ORPO, SimPO), governance standards (Datasheets for Datasets, Model Cards, Data Statements, ML Reproducibility Checklist), and safety research (Benchmark Contamination, Model Collapse, Llama Guard) behind a single cohesive framework.
Unlike HuggingFace datasets (storage format) or Axolotl (training orchestrator), aiwg-training is a curation pipeline. It ingests, assesses, synthesizes, filters, formats, decontaminates, versions, and documents — the work that happens before you invoke trainer.train() and the part that determines whether your fine-tune actually learns anything useful.
Building a fine-tuning dataset is hard in ways that don't show up in tutorials. Four failure modes dominate:
Typical dataset scripts produce JSONL files with no record of where each example came from, what license governs it, what transformations were applied, or how to rebuild the same dataset again next week. When something goes wrong — a model overfits a biased subsample, a source is later retracted, a license changes — there's no way to trace or fix it.
Without aiwg-training: 70%+ of published fine-tuning datasets fail the ML Reproducibility Checklist (Pineau et al. 2020). Lineage from raw source to trained model is almost always missing.
With aiwg-training: Every example traces back to its source via W3C PROV (REF-062). Every dataset version ships with a SHA-256 fixity manifest + deterministic seed + reproduction recipe. aiwg-training dataset reproduce byte-reproduces any prior version.
Most fine-tuning datasets accidentally include examples from the benchmarks you'll later use to evaluate the model. Your "HumanEval 67.2%" score is meaningless if 40% of HumanEval was in your training data. Published papers have been retracted over this.
Without aiwg-training: Benchmark leakage is detected post-hoc, if ever. REF-442 (Sainz et al. 2023) shows ChatGPT reproduces CoNLL-2003 verbatim — pervasive contamination across major benchmarks.
With aiwg-training: Decontamination is a first-class pipeline stage that blocks publication. Three detection modes (exact 13-gram per REF-442, fuzzy edit-distance, semantic embedding similarity). Six default targets (MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval) extensible to any benchmark. The decontamination-gate lint rule makes override explicit with triple audit trail (manifest + activity log + report appendix).
npx claudepluginhub jmagly/aiwg-trainingMarketing automation framework with 37 specialized agents for campaign management, content strategy, brand compliance, and analytics. Full campaign lifecycle from strategy to measurement.
Complete SDLC framework with 58 specialized agents for software development lifecycle management. Phase-based workflows (Inception→Elaboration→Construction→Transition), security reviews, testing orchestration, and deployment automation.
Voice profile system for consistent, authentic writing. Apply, create, blend, and analyze voices. Includes 4 built-in profiles: technical-authority, friendly-explainer, executive-brief, and casual-conversational.
Writing quality validation and AI pattern detection. Identify AI-generated patterns, enhance authenticity, and enforce writing standards. Includes writing-validator agent and ai-pattern-detection skill.
Core AIWG utilities for context regeneration, workspace management, development kit, and @-mention traceability. Essential foundation for other AIWG plugins.
Synthetic data generation — composable blocks and YAML-defined flows for building LLM training datasets
LLM post-training — unified interface for SFT, OSFT, LoRA fine-tuning, and GRPO reinforcement learning
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Style transfer pipeline for training LLMs to write in specific author styles using SFT with LoRA
Self-documenting, self-improving framework for analytical repositories
Agent Skills for AI/ML tasks including dataset creation, model training, evaluation, and research paper publishing on Hugging Face Hub