By slapglif
End-to-end ML training ecosystem powered by SkyPilot. Launch, monitor, fix, iterate, ablate, and serve models on any cloud GPU with production-grade frameworks.
Analyzes SkyPilot cloud spending and optimizes GPU costs. Monitors active clusters, identifies waste (idle clusters, over-provisioned resources, missing autostop), and suggests concrete savings strategies with dollar amounts. Use proactively when expensive jobs are running or when reviewing cloud costs, or reactively when users ask about spending.
Validates SkyPilot YAML configurations before launch. Checks for known gotchas, estimates costs, verifies resource feasibility, and suggests optimizations. Use proactively before any sky launch or sky jobs launch command, or reactively when a user asks to review a config. Triggers when YAML files are being written or when launch commands are about to execute.
Designs and runs controlled ML experiments -- hyperparameter sweeps, architecture ablations, scaling law studies, and A/B comparisons. Use when comparing configurations, searching for optimal settings, or when the user discusses experimentation, ablation, or comparison. Triggers proactively after a training run completes to suggest next experiments.
Diagnoses and fixes ML training failures -- OOM crashes, NaN loss, loss plateaus, gradient explosion, slow throughput, data corruption, and preemption recovery. Use when training encounters problems or produces unexpected results. Triggers proactively when error patterns are detected in logs, or reactively when users report training issues.
Autonomous ML training lifecycle manager. Use when the user wants to train a model end-to-end -- framework selection, YAML generation, launch, monitoring, failure recovery, and iteration. Triggers proactively when training tasks are detected, such as dataset preparation discussions, model fine-tuning requests, or when training scripts exist but have not been launched.
Use when saving or resuming training checkpoints, merging LoRA adapters, converting between model formats (safetensors, GGUF, PyTorch), quantizing models, using mergekit for model merging, or managing checkpoint lifecycle on cloud storage
Use when reducing cloud GPU costs, choosing spot vs on-demand, estimating training expenses, managing budgets, configuring auto-stop, or optimizing spend across clouds - covers SkyPilot optimizer, spot failover, pricing tiers, and budget management
Use when designing a data pipeline, curating pretraining data, choosing between NeMo Curator and other tools, configuring tokenization, running deduplication (exact or fuzzy), applying data filtering or quality classifiers, working with FineWeb or Dolma or RedPajama, or assessing data quality for ML training - the production data pipeline design reference
Use when setting up multi-GPU or multi-node training, configuring torchrun or deepspeed launchers, choosing parallelism strategy (FSDP, DeepSpeed ZeRO, tensor parallel, pipeline parallel), tuning NCCL, or enabling high-performance networking with SkyPilot
Use when selecting a training framework, comparing NeMo vs Axolotl vs torchtune vs TRL vs DeepSpeed vs Megatron, choosing between FSDP and DeepSpeed, deciding how to fine-tune or pretrain a model, or configuring LoRA/QLoRA/full fine-tuning - the definitive framework selection guide for ML training at any scale
Admin access level
Server config contains admin-level keywords
Executes bash commands
Hook triggers when Bash tool is used
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Uses power tools
Uses Bash, Write, or Edit tools
Uses power tools
Uses Bash, Write, or Edit tools
npx claudepluginhub slapglif/skymcp --plugin skymcpMCP server for Maestro mobile testing + ADB toolkit - Playwright-style control for iOS/Android apps
Zero-config P2P networking — connect agents to a swarm for messaging, file transfer, and tunneling
Multi-agent message bus for coordinating AI agents across sessions via pub/sub, request/response, and broadcast patterns
Mathematical physics tooling suite - symbolic math, numerical physics, ML, theorem proving, verification, bioinformatics, and discrete mathematics
Machine learning training and inference pipeline using cloud GPUs (Modal, Lambda Labs, RunPod) with HuggingFace ecosystem - no local GPU required
SkyPilot agent skill for launching cloud VMs, Kubernetes pods, and Slurm jobs across 25+ clouds
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Claude Code skill pack for CoreWeave (24 skills)
Train ML models with scikit-learn, PyTorch, TensorFlow. Use for classification/regression, neural networks, hyperparameter tuning, or encountering overfitting, underfitting, convergence issues.
LLM post-training — unified interface for SFT, OSFT, LoRA fine-tuning, and GRPO reinforcement learning