Improve LLM apps and agents from real traces — reduce cost/latency, raise quality/reliability, capture traces, build evals, run local optimization (GEPA), compare models/providers, and route through Understudy.
Use when a developer wants to replace an expensive frontier model on a classification workload (binary, multi-class, multi-label, or structured extraction) with a fine-tuned open-weight student — "distill this classifier", "can a small model do this tagging job", "the frontier labels these for $X, make it cheaper", "consensus-label my data". Multi-teacher majority-vote labeling, failure-directed SFT data, and a four-way promote/shadow/collect/stop verdict.
Use when a developer already has production LLM traces — a bucket of captures, provider log exports, or gateway capture files — and wants them turned into local, redacted eval sets, or profiled for cost first. "Ingest my traces", "turn these logs into an eval set", "where is my LLM spend going", "which calls could a local model take over".
Use when a developer wants to install, enable, update, reinstall, or verify the Understudy skills as a Claude Code plugin — "install understudy", "add the understudy skills", "set up the plugin", "why can't you see the understudy skill". Runs the non-interactive `claude plugin` CLI (or shows the commands), then tells the developer the one activation step and whether a restart is needed.
Use to build a simulated, seeded environment (AutomationBench / verifiers style) so any model can run a captured agentic workload end-to-end and be scored on final state — "simulate this workload's tools", "build a validator for these traces", "let a small model attempt the whole task", "score recall/precision against gold", or any handoff from understand-workload toward whole-case model comparison.
Use when you need to know HOW two model runs differ behaviorally on the same tasks, not just THAT one scores higher — per-task trajectory diffing that classifies the gap as persistence/recovery, knowledge, or format/parsing. "why does the bigger model pass these", "is this gap RL-shaped", "diff these two trajectory runs", "where do the trajectories diverge", "what would distillation buy me". The behavioral complement to compare-model-sweep.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Public, MIT-licensed Understudy skill library and thin CLI.
This repo is the public skills surface for local-first AI workload evaluation, optimization planning, gateway handoff, and agent-led implementation. The CLI is thin TypeScript/Node: durable shortcuts, auth, artifact checks, and runtime wrappers that a coding agent can monitor.
The OSS MVP loop is local-first and skill-led:
capture evidence -> attach harness/environment
-> confirm metric/validator/holdout -> rerun baseline
-> optimize workload -> conservative claim packet
Registration is not required for that loop. Hosted gateway access is available
after understudy login; browser, channel, daemon, and desktop-runtime
commands remain outside this public CLI until intentionally extracted.
The hosted surface this CLI consumes is documented at docs.understudylabs.com — see open-source/agent-tools for how this repo fits the platform and open-source/cli for the command-level CLI reference. The skills here stay local-first; the docs site covers the hosted contracts behind them.
| Spine | Path | Purpose |
|---|---|---|
| CLI | src/ | Thin TypeScript shortcuts for auth, artifact checks, and durable runs. |
| Skills | skills/ | MVP progressive-disclosure agent playbooks. |
| Docs | docs/ | Public methodology and release-boundary notes. |
| Scripts | scripts/ | Repo hygiene checks, not product CLI code. |
| Vendor | vendor/ | Vendored or mirrored compatibility shims, with license metadata. |
The CLI should stay boring. Workflow judgment belongs in skills; durable shortcuts belong in TypeScript only when the agent needs reliable execution, auth injection, artifact writes, or a safety gate.
Fast first-run installer:
curl -fsSL https://raw.githubusercontent.com/UnderstudyLabs/understudy-agent-tools/main/install.sh | bash
This installs the CLI, installs or refreshes the Claude Code plugin when
claude is available, then opens Claude Code in the current directory. In
Claude Code, run:
/reload-plugins
/understudy:onboard
The installer intentionally does not download model weights, start MLX, install Pi, launch tmux/iTerm, or make frontier calls. Those belong inside the Claude Code skill flow, where the agent can explain the tradeoffs, ask consent, coach the user on opening their preferred terminal, and run the same commands itself when appropriate.
For non-interactive installs, add --yes:
curl -fsSL https://raw.githubusercontent.com/UnderstudyLabs/understudy-agent-tools/main/install.sh | bash -s -- --yes
The installer is resumable. It writes step markers under
~/.understudy/agent-tools/install-state; after a failed run, use:
curl -fsSL https://raw.githubusercontent.com/UnderstudyLabs/understudy-agent-tools/main/install.sh | bash -s -- --resume
You can also jump directly to a step:
curl -fsSL https://raw.githubusercontent.com/UnderstudyLabs/understudy-agent-tools/main/install.sh | bash -s -- --from-step 2
Developer install from a clone:
npm install
npm run build
node dist/bin.js --help
After package publication:
npm install -g @understudylabs/understudy-agent-tools
understudy spine
No provider calls, uploads, model downloads, secret-value inspection, or hosted
jobs run by default. After authentication, the CLI emits bounded product
telemetry documented in docs/telemetry.md; disable it
with UNDERSTUDY_TELEMETRY=0.
The skills in skills/ ship as a Claude Code plugin, declared in
.claude-plugin/ (plugin.json + marketplace.json).
Installing it registers the public invocable skills in skills/,
including the understudy orchestrator, onboarding, capture/eval, optimization,
local model, distillation, RLM, and verifier-handoff workers.
From a clone of this repo:
claude plugin marketplace add /path/to/understudy-agent-tools
claude plugin install understudy@understudy-skills
Then run /reload-plugins in your Claude Code session to activate — no
restart required. The equivalent interactive flow is /plugin marketplace add <path> then /plugin install understudy@understudy-skills. The
install-plugin skill automates this and
reports whether the plugin is already installed.
After /reload-plugins, run /understudy:onboard. That is where the coding
agent guides the first local model, terminal choice, Pi/tmux handoff, and any
frontier comparison with explicit consent.
npx claudepluginhub understudylabs/understudy-agent-tools --plugin understudyComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.
Unity Development Toolkit - Expert agents for scripting/refactoring/optimization, script templates, and Agent Skills for Unity C# development