By kengbailey
Tune llama-server for optimal performance and GPU utilization. Analyzes GPU VRAM, model architecture (dense/MoE), and generates launch commands for maximum tok/s.
Personal Claude Code plugin marketplace.
Add this marketplace to Claude Code:
/plugin marketplace add <owner>/bailey-claude-marketplace
Then install individual plugins:
/plugin install <plugin-name>@bailey-marketplace
| Plugin | Description | Source |
|---|---|---|
claude-mem | Persistent memory system for Claude Code. Captures tool usage, compresses observations with AI, and re-injects relevant context into future sessions. | External (thedotmack/claude-mem) |
llama-tune | Tune llama-server for optimal performance and GPU utilization. Supports dense and MoE models. | In-repo |
Persistent memory across Claude Code sessions. Automatically captures everything Claude does, compresses it with AI, and provides continuity in future sessions.
Auto-installed dependencies (installed on first run):
Runtime:
localhost:37777http://localhost:37777~/.claude-mem/Install:
/plugin install claude-mem@bailey-marketplace
Tunes llama-server (llama.cpp) launch parameters for maximum tok/s on your hardware. Auto-detects GPU VRAM, CPU cores, and system RAM. Inspects GGUF model files to determine architecture (dense vs MoE), then calculates optimal flags including KV cache quantization, flash attention, expert offloading (MoE), and partial GPU layer placement.
Features:
llama-ggufSkill: /llama-tune <model.gguf> [--ctx SIZE] [--slots N] [--port PORT] [--launch]
Install:
/plugin install llama-tune@bailey-marketplace
In-repo plugins go in the plugins/ directory. External plugins are referenced by source in .claude-plugin/marketplace.json.
MIT
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub kengbailey/bailey-marketplace --plugin llama-tuneWhen setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
Run AI models locally with Ollama - free alternative to OpenAI, Anthropic, and other paid LLM APIs. Zero-cost, privacy-first AI infrastructure.
Local-first resolver for Hugging Face models (GGUF, MLX, safetensors). The agent checks your own storage and any mounted drives before downloading anything.
Spawn any third-party LLM provider with an Anthropic-compatible API (e.g. DeepSeek, GLM, Kimi, Qwen, MiniMax) as real Claude Code agent-team teammates or one-shot subagents — driven exactly like native teammates. Your main session's own auth is untouched (OAuth subscription or API key, either works); provider workers bill the provider API key via apiKeyHelper (the key never enters env/argv/history). Requires the `cc-fleet` binary on PATH, installed separately.
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM SOTA Humanize loops, human code review, production incident triage, and model PR-history dossiers.
Delegate heavy code generation to a local LLM (Ollama / LM Studio). Save tokens, keep oversight.