By t2ance
Prediction-first autonomous monitoring for ML/DL training jobs. General-purpose framework with domain-specific skills for GRPO/RL, distributed training, and Kubernetes.
Heuristics for monitoring multi-GPU and multi-process distributed training. Common patterns, NCCL diagnostics, known failure modes. Reference knowledge, not rules.
Heuristics for monitoring GRPO, PPO, and other RL training. Common patterns, typical indicators, known failure modes. Reference knowledge, not rules.
Heuristics for monitoring training jobs on Kubernetes. Common patterns, pod anomalies, scheduling failures, escalation ladder. Reference knowledge, not rules.
Interactive setup wizard for the training-monitor plugin. Goal-driven — the agent determines which dependencies are needed by checking project context and asking the user only when the context is ambiguous. Installs missing dependencies and reports available capabilities.
Prediction-first monitoring for ML/DL training jobs. Single-agent execution with reviewer sub-agent. Derives judgment criteria from training artifacts, not hardcoded rules.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Plugin marketplace for ML/DL training monitoring.
claude plugin marketplace add t2ance/training-monitor-plugin
claude plugin install training-monitor@training-monitor
Or add to ~/.claude/settings.json:
{
"extraKnownMarketplaces": {
"training-monitor": {
"source": {
"source": "github",
"repo": "t2ance/training-monitor-plugin"
}
}
},
"enabledPlugins": {
"training-monitor@training-monitor": true
}
}
| Plugin | Description |
|---|---|
| training-monitor | Prediction-first autonomous monitoring for ML/DL training jobs |
After installation, run /monitor-doctor for interactive setup.
npx claudepluginhub t2ance/training-monitor-pluginMetacognitive FOK/JOL loop MCP with periodic open-session reminder.
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Machine learning training and inference pipeline using cloud GPUs (Modal, Lambda Labs, RunPod) with HuggingFace ecosystem - no local GPU required
Train ML models with scikit-learn, PyTorch, TensorFlow. Use for classification/regression, neural networks, hyperparameter tuning, or encountering overfitting, underfitting, convergence issues.
Claude Code skill pack for CoreWeave (24 skills)
LLM post-training — unified interface for SFT, OSFT, LoRA fine-tuning, and GRPO reinforcement learning
Skills for tracing, evaluating, and improving AI agents with MLflow. Supports the full agent improvement loop: instrument → trace → evaluate → iterate → validate.