From ork
Provides LLM integration patterns for function calling, streaming responses, Ollama local inference, and fine-tuning customization. Use for tool use, SSE streaming, local deployment, LoRA/QLoRA, or multi-provider APIs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ork:llm-integrationThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in `rules/` loaded on-demand.
checklists/fine-tuning-decision.mdchecklists/streaming-checklist.mdchecklists/tool-checklist.mdmetadata.jsonreferences/dpo-alignment.mdreferences/lora-qlora.mdreferences/model-selection.mdreferences/synthetic-data.mdreferences/tool-schema.mdreferences/when-to-finetune.mdrules/_sections.mdrules/_template.mdrules/calling-parallel.mdrules/calling-tool-definition.mdrules/calling-validation.mdrules/context-caching.mdrules/context-window-management.mdrules/evaluation-benchmarks.mdrules/evaluation-metrics.mdrules/local-gpu-optimization.mdPatterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Function Calling | 3 | CRITICAL | Tool definitions, parallel execution, input validation |
| Streaming | 3 | HIGH | SSE endpoints, structured streaming, backpressure handling |
| Local Inference | 3 | HIGH | Ollama setup, model selection, GPU optimization |
| Fine-Tuning | 3 | HIGH | LoRA/QLoRA training, dataset preparation, evaluation |
| Context Optimization | 2 | HIGH | Window management, compression, caching, budget scaling |
| Evaluation | 2 | HIGH | LLM-as-judge, RAGAS metrics, quality gates, benchmarks |
| Prompt Engineering | 4 | HIGH | CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization |
Total: 20 rules across 7 categories
# Function calling: strict mode tool definition
tools = [{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search knowledge base",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
},
"required": ["query", "limit"],
"additionalProperties": False
}
}
}]
# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
async def generate():
async for token in async_stream(prompt):
yield {"event": "token", "data": token}
yield {"event": "done", "data": ""}
return EventSourceResponse(generate())
# Local inference: Ollama with LangChain
llm = ChatOllama(
model="deepseek-r1:70b",
base_url="http://localhost:11434",
temperature=0.0,
num_ctx=32768,
)
# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B",
max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.
calling-tool-definition.md -- Strict mode schemas, OpenAI/Anthropic formats, LangChain bindingcalling-parallel.md -- Parallel tool execution, asyncio.gather, strict mode constraintscalling-validation.md -- Input validation, error handling, tool execution loopsDeliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.
streaming-sse.md -- FastAPI SSE endpoints, frontend consumers, async iteratorsstreaming-structured.md -- Streaming with tool calls, partial JSON parsing, chunk accumulationstreaming-backpressure.md -- Backpressure handling, bounded buffers, cancellationRun LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.
local-ollama-setup.md -- Installation, model pulling, environment configurationlocal-model-selection.md -- Model comparison by task, hardware profiles, quantizationlocal-gpu-optimization.md -- Apple Silicon tuning, keep-alive, CI integrationCustomize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.
tuning-lora.md -- LoRA/QLoRA configuration, Unsloth training, adapter mergingtuning-dataset-prep.md -- Synthetic data generation, quality validation, deduplicationtuning-evaluation.md -- DPO alignment, evaluation metrics, anti-patternsManage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.
context-window-management.md -- Five-layer architecture, anchored summarization, compression triggerscontext-caching.md -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.
evaluation-metrics.md -- LLM-as-judge, RAGAS metrics, hallucination detectionevaluation-benchmarks.md -- Quality gates, batch evaluation, pairwise comparisonDesign, version, and optimize prompts for production LLM applications.
prompt-design.md -- Chain-of-Thought, few-shot learning, pattern selection guideprompt-testing.md -- Langfuse versioning, DSPy optimization, A/B testing, self-consistencyprompt-react-pattern.md -- ReAct loop for tool-using agents, thought-action-observation formatprompt-optimization.md -- Token reduction, cost optimization, model tiering, prompt spec format| Decision | Recommendation |
|---|---|
| Tool schema mode | strict: true (2026 best practice) |
| Tool count | 5-15 max per request |
| Streaming protocol | SSE for web, WebSocket for bidirectional |
| Buffer size | 50-200 tokens |
| Local model (reasoning) | deepseek-r1:70b |
| Local model (coding) | qwen2.5-coder:32b |
| Fine-tuning approach | LoRA/QLoRA (try prompting first) |
| LoRA rank | 16-64 typical |
| Training epochs | 1-3 (more risks overfitting) |
| Context compression | Anchored iterative (60-80%) |
| Compress trigger | 70% utilization, target 50% |
| Judge model | GPT-5.2-mini or Haiku 4.5 |
| Quality threshold | 0.7 production, 0.6 drafts |
| Few-shot examples | 3-5 diverse, representative |
| Prompt versioning | Langfuse with labels |
| Auto-optimization | DSPy MIPROv2 |
ork:rag-retrieval -- Embedding patterns, when RAG is better than fine-tuningagent-loops -- Multi-step tool use with reasoningllm-evaluation -- Evaluate fine-tuned and local modelslangfuse-observability -- Track training experimentsKeywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:
Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:
Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:
Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:
npx claudepluginhub yonatangross/orchestkit --plugin orkUnifies Python LLM API calls to 100+ providers (OpenAI, Anthropic, Ollama, llamafile) in OpenAI format with retries, fallbacks, exceptions, cost tracking. Triggers on litellm imports/completion().
Provides production-ready patterns for building LLM applications: RAG pipelines, document chunking, embedding models, vector database selection, and agent architectures.
Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.