From llmstxt-generator
Guide for Ollama setup, model selection, and integration for llms.txt generation
How this skill is triggered — by the user, by Claude, or both
Slash command
/llmstxt-generator:ollama-integrationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Ollama** is a lightweight local LLM runtime that lets you run large language models on your machine without cloud dependencies. The llmstxt-generator plugin uses Ollama to intelligently generate llms.txt files.
Ollama is a lightweight local LLM runtime that lets you run large language models on your machine without cloud dependencies. The llmstxt-generator plugin uses Ollama to intelligently generate llms.txt files.
Ollama is an open-source project that makes running LLMs locally simple and fast:
Official Site: https://ollama.ai GitHub: https://github.com/ollama/ollama
Option 1: Download App (Easiest)
Option 2: Homebrew
brew install ollama
ollama serve # Start the server
Option 1: Installation Script
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve
Option 2: Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama:latest
# Check if Ollama is running
ollama list
# Should show available models, or empty list initially
If you see "Connection refused", Ollama isn't running:
# macOS/Linux
ollama serve
# Or check if service is running
ps aux | grep ollama
Different Ollama models have different characteristics. Here's a guide for generating llms.txt:
ollama pull llama2
Best for: Balanced generation quality and speed
ollama pull mistral
Best for: Quick generation on limited hardware
ollama pull neural-chat
Best for: Natural, conversational link descriptions
ollama pull dolphin-mixtral
Best for: Maximum quality when resources available
Too Small:
Too Slow:
Specialized:
Each model is a one-time download that stores locally:
# Download (first time only)
ollama pull llama2
# ~7GB download, ~15 min depending on internet
# Future runs use local copy (instant)
ollama list
# Shows: NAME ID SIZE MODIFIED
# llama2:latest 91ab59b18b92 3.8 GB 2 minutes ago
# Remove a model to free space
ollama rm llama2
# Check disk usage
du -sh ~/.ollama/
# Show all models
ollama list
The llmstxt-generator plugin communicates with Ollama via API (default: http://localhost:11434).
Ollama starts automatically as a background service after installation.
# Verify it's running
ollama list
# In terminal window 1
ollama serve
# In terminal window 2
ollama list
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama:latest
# Download a model
docker exec ollama ollama pull llama2
The llmstxt-generator plugin connects to Ollama's API. Here's how it works:
# Ollama API endpoint (default)
http://localhost:11434
# Generate request example (what the plugin does)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama2",
"prompt": "Create an llms.txt structure for FastHTML library",
"stream": false
}'
If Ollama runs on different host/port, configure in .claude/llmstxt-generator.local.md:
# llmstxt-generator Configuration
## Ollama Settings
- host: http://192.168.1.100:11434
- default_model: llama2
# Test if Ollama is accessible
curl http://localhost:11434/api/tags
# Should return:
# {"models":[{"name":"llama2:latest",...}]}
On typical hardware (MacBook Pro M1, 16GB RAM):
| Model | First Token | Full llms.txt | Quality |
|---|---|---|---|
| mistral (7B) | ~100ms | ~2-3 seconds | Good |
| neural-chat (7B) | ~100ms | ~2-3 seconds | Excellent |
| llama2 (13B) | ~200ms | ~4-6 seconds | Excellent |
| dolphin-mixtral (8x7B) | ~500ms | ~15-20 seconds | Superior |
Note: Speeds vary by hardware. GPU acceleration greatly improves performance.
For the prompt: "Create an llms.txt link description for an authentication guide":
mistral:
"Complete OAuth2 implementation with JWT tokens, refresh logic"
llama2:
"Comprehensive OAuth2, API key, and JWT authentication patterns with examples"
neural-chat:
"Step-by-step OAuth2, API keys, and JWT authentication with error handling"
dolphin-mixtral:
"Complete authentication guide: OAuth2 flows, API key management, JWT tokens, session handling, security best practices"
All are good; llama2 and neural-chat offer best balance.
Cause: Ollama isn't running
Solution:
# Start Ollama
ollama serve
# Or check if it's running
lsof -i :11434
Cause: Model hasn't been downloaded yet
Solution:
# Download the model
ollama pull llama2
# List available models
ollama list
Cause: Model is too large for available RAM
Solution:
ollama serve --num-gpu 0 to use CPU only (slower)Cause: Model running on CPU instead of GPU, or model too large
Solution:
# Check if GPU is enabled (look for "metal" or "cuda")
ollama serve
# If using CPU only:
# - Reduce context size
# - Use smaller model (mistral)
# - Add GPU support (check ollama.ai for hardware)
Solution:
# Refresh model list
ollama list
# If still not showing, restart Ollama
pkill ollama
ollama serve
For advanced users, you can use custom quantized models:
# Create Modelfile (custom model config)
cat > Modelfile << EOF
FROM ./model.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
# Create custom model
ollama create mymodel -f Modelfile
# Use in generation
ollama run mymodel "prompt"
Ollama automatically uses GPU if available:
ollama serve outputSmaller context = faster generation:
# Generate with smaller context (faster)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama2",
"prompt": "...",
"num_ctx": 512
}'
For long-running generation, stream responses:
# Stream mode (get output as it generates)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama2",
"prompt": "...",
"stream": true
}'
First inference is slower. Subsequent requests are faster:
# Pre-warm to cache model in memory
ollama run llama2 "What is llms.txt?" > /dev/null
Do you have GPU acceleration?
├─ Yes (GPU)
│ └─ Use: llama2 (best quality/speed balance)
│ └─ Or: dolphin-mixtral (maximum quality)
│
└─ No (CPU only)
├─ Generation takes >10 seconds?
│ └─ Use: mistral (fast, good quality)
│
└─ Generation is fast enough?
└─ Use: neural-chat (excellent for descriptions)
In .claude/llmstxt-generator.local.md:
# Ollama Settings
## Connection
- host: http://localhost:11434
- timeout: 60 # seconds
## Model Selection
- default_model: llama2
- allow_model_selection: true
## Generation Parameters
- temperature: 0.7 # 0-1: Lower = deterministic, Higher = creative
- top_p: 0.9 # Nucleus sampling (0-1)
- top_k: 40 # Diversity control
- num_ctx: 2048 # Context window size
- num_predict: 512 # Max tokens to generate
- repeat_penalty: 1.1 # Prevent repetition
ollama pull llama2/llmstxt:generate in Claude Codegenerate_llmstxt.py for advanced usagenpx claudepluginhub brunogama/llmstxt-generatorAutomates Ollama installation, hardware-based model selection, GPU setup, and client integration (Python/Node.js/REST) for local LLM inference on macOS/Linux/Docker.
Configures Mozilla Llamafile to run GGUF models locally with OpenAI-compatible API. Manages installation, server startup, GPU/CPU configs, SDK integrations, and troubleshooting.
Optimizes local LLM inference, model selection, VRAM usage, and deployment using Ollama, llama.cpp, vLLM, LM Studio. Covers GGUF/EXL2 quantization and privacy-first setups for offline AI apps.