From johnfink-skills
Use whenever the user is building, debugging, or extending code that puts an LLM in a loop — calling tools, taking multi-step actions, deciding what to do next. Triggers include words like "agent", "agentic", "tool use", "tool calling", "ReAct", "autonomous", and code that loops over `messages.create` / `responses.create` / similar SDK calls. Apply when wiring webhooks/events/queues to an LLM, when an agent is misbehaving (looping, flailing, blowing context, spawning too many subagents), and when planning a new agent. Do NOT apply for one-shot prompting — only when the LLM has tools and decides its own next step.
How this skill is triggered — by the user, by Claude, or both
Slash command
/johnfink-skills:agentic-llmThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Most "first-pass agent" code is one of two things: a 30-line ReAct loop calling an LLM with a tool list, or an event handler that fires off an LLM call per inbound webhook. Both work in a demo. Both crater in production. The principles below are what separates a real agentic system from a demo.
Most "first-pass agent" code is one of two things: a 30-line ReAct loop calling an LLM with a tool list, or an event handler that fires off an LLM call per inbound webhook. Both work in a demo. Both crater in production. The principles below are what separates a real agentic system from a demo.
The escalation ladder for "I need an LLM to do X":
No model at all. Can deterministic code, regex, SQL, or a rules-based router solve this? If yes, stop.
Embeddings, not generation. If the task is classification into N buckets, similarity search, clustering, deduplication, routing to one of K handlers, or finding the closest match — you want an embedding model + cosine similarity / k-NN / a tiny classifier head. Not an LLM in a loop.
This is the speed point. Embeddings are orders of magnitude faster than an agent. A cosine-similarity classifier against 12 bucket centroids returns in single-digit milliseconds, deterministically, for fractional-cent cost. An agent doing the same job: seconds of latency, dollars per thousand calls, non-deterministic outputs, and a tool-call budget you have to bound. "Categorize this ticket," "route this query to the right backend," "find similar items in the catalog," "is this near-duplicate of something we've seen?" — these are embedding problems, not agent problems.
Embed once at ingest, store the vectors, compare at query time. That's the whole system.
Single optimized LLM call. One tight prompt, one model call, structured output. Most problems people give to agents end here.
Workflow. Prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer. Predefined control flow with model-directed branches at the named decision points.
Autonomous agent. Last resort.
Reach for a full autonomous loop only when:
Agents trade reliability, latency, and cost for flexibility — and "flexibility" is rarely what the actual problem needs. Most production "agents" are really workflows with one or two model-directed branches.
Every tool call is a roundtrip: latency, tokens, and a chance for the model to choose wrong. So before exposing a tool, ask: will the model need this information on essentially every run? If yes, put it in the prompt up front. Don't make the model ask.
Examples of things to pre-load, not tool-fetch:
Examples of things to leave behind a tool:
The rule: pre-load high-confidence, small context; JIT-fetch low-confidence or large context. A first-pass agent that builds get_current_date, get_user_timezone, get_routing_table as tools is doing extra work that costs latency on every run. Just put those in the system prompt.
This is the single biggest mistake in first-pass agentic code, and it's the one principle here not borrowed from the canonical agent essays — it's classical producer/consumer queueing applied to agent invocation. Don't let inbound events spawn agent invocations directly.
Bad:
@app.post("/webhook")
async def webhook(req: Request):
return await run_agent(req.json()) # fans out per event, unbounded
A burst of 500 events spawns 500 concurrent agent loops, each chewing tokens, each racing for the same downstream resources, none cancellable.
Good — events enqueue, workers dequeue:
@app.post("/webhook")
async def webhook(req: Request):
await queue.put(req.json()) # cheap, bounded
return {"status": "queued"}
# elsewhere, N workers (N = chosen concurrency)
async def worker():
while True:
msg = await queue.get()
await run_agent(msg)
The queue can be Redis, SQS, an asyncio.Queue, a DB table polled by workers, an APScheduler jobstore — anything that gives you (a) a place to land work without doing it, and (b) a bounded set of consumers. The point isn't the technology; the point is the agent's concurrency is set by the worker pool, not by the inbound traffic.
Benefits:
For UI-driven cases, the "fire-and-forget" variant works too: the action inserts a row and returns a session ID immediately; a background task runs the cycle and streams events via SSE/WebSocket. The frontend never blocks on agent execution.
The model will always try one more thing. Enforcement lives in the harness, not the prompt. Required limits, per run:
AbortController / cancellation token, not a sleep watchdog.Every limit gets a clean failure path that returns a structured result ({success: false, reason, turns, tool_calls}), not a thrown exception. The caller needs to know why it stopped.
When a tool call fails, the agent loop should:
ToolResult whose output describes the error in natural language.The model can read "ENOENT: file not found at /x/y" and retry with a corrected path. It cannot read a Python stack trace that crashed the loop. Unknown tool name? Same — return {"error": "unknown tool: foo"} to the model, don't 500.
This is the difference between an agent that gracefully recovers and one that the engineer has to babysit.
Tools are how the model perceives and acts on your system. Treat their design with the care you'd give a public API consumed by a junior engineer.
schedule_meeting(when, with, topic) beats three calls to find_user, find_slot, create_event.search_recent_emails vs search_archived_emails, not two search tools that the model has to guess between."validation failed" — "validation failed: 'when' must be ISO8601, got '2026-13-01'".Test tools by reading the description cold: can you tell what the tool does, when to use it, and what inputs are valid? If not, the model can't either.
The tradeoff runs in a direction that's easy to get backward:
The small, high-confidence pre-loads from §1 (date, user profile, routing table) are the case where front-loading is good for both axes — small enough not to crush attention, always-needed enough that the model expects to use them. The "often costs accuracy" caveat kicks in when you start front-loading large or speculative material to avoid tool calls.
Pick the axis before tuning:
The mechanics below apply regardless; the dials get set differently:
cacheControl: { type: "ephemeral" } on the system message). Multi-turn conversations within the 5-min TTL get drastic cost/latency wins.Success probability decays exponentially in the number of things one prompt is trying to do. Three specialized prompts that chain are almost always better than one mega-prompt that tries to plan + research + act + format.
Common decomposition: a planner produces a structured plan (deterministic, schema-validated), an executor runs the plan step by step (possibly with the model's help on each step), an evaluator scores the result. Each role has a focused prompt and a small tool set.
Bonus: each role is independently testable. Mega-prompts are not.
When the model issues multiple tool calls in a single turn, run them concurrently (asyncio.gather, Promise.all). This is free latency reduction — they were already independent or the model wouldn't have batched them.
Do NOT parallelize across turns. Turn N+1's prompt depends on turn N's tool results; trying to overlap them is racing with yourself. The agent loop is fundamentally sequential per run.
Anthropic's published numbers: multi-agent research systems burn ~15× the tokens of single-turn chat. The 80% of quality variance came from total token spend, not architectural cleverness. The architecture is worth it when the task is parallelizable and breadth-first (research, scouting, brainstorming). It is NOT worth it for coding, anything that needs shared state across workers, or anything that needs tight coordination.
Heuristic: if you'd struggle to describe what each subagent is doing in one sentence each, you don't need subagents — you need a single agent with better tools.
Agents are non-deterministic and stateful. You cannot ship one without a way to know it's getting better or worse over time.
If you don't have evals yet, that's the next thing to build — before more tools, before more capability, before scaling.
Agents fail in ways that one-shot LLM calls do not: mid-loop crashes, deploys landing during a long-running session, identical inputs producing divergent traces. Build for it.
.md files, not in codeA prompt is content. Code that talks to an LLM should read("prompts/foo.md"), not embed a multi-line string literal that looks like markdown but is never rendered as markdown anywhere.
Bad:
SYSTEM_PROMPT = """
You are a triage assistant.
## Your task
Read the incoming ticket and output JSON with...
"""
Good:
SYSTEM_PROMPT = (Path(__file__).parent / "prompts" / "triage_system.md").read_text()
What you gain:
.md is what you actually wrote — headings, lists, code blocks rendered..md file diffs cleanly. The same edit inside a Python triple-string is a blob diff.{ characters — none of it needs escaping in a real file.prompts/welcome.md in any editor — they don't need to learn how to dodge Python quoting.prompts/triage.md loads from Python and TypeScript. Triple-strings don't cross language boundaries.For templating: use Jinja, handlebars, or just .format() on the loaded content. The substitution layer is independent of where the prompt lives.
The only case where embedding makes sense: a literal one-line instruction in a throwaway script. Anything multi-line, anything formatted, anything reused — it goes in a file.
For any agent touching untrusted input (user content, web pages, third-party tool outputs), pick at most two of:
All three together is the lethal trifecta — a prompt injection in the untrusted input steers the agent to exfiltrate the sensitive data via the action it can take. Mitigations exist (sandboxing, allowlists, human-in-the-loop) but prompt injection is unsolved; design the agent so a successful injection can't reach all three corners.
| Failure | Principle |
|---|---|
| Webhook storm spawns 500 parallel agents | 2 (queue) |
| Agent loops forever on the same broken tool call | 3 (bounds) + 4 (errors as results) |
| Context window blows up at turn 7 | 3 (token budget) + 6 (compaction) |
| Tool returns 10MB; next call 400s | 5 (paginate/truncate) |
| Model confuses two similar tools | 5 (naming + descriptions) |
| 15 subagents spawned for a one-shot lookup | 3 (concurrency cap) + 9 (earn it) |
| Mega-prompt accuracy is 60% | 1 (workflow first) + 7 (decompose) |
| Prompt injection exfiltrates customer data | Rule of Two |
| "It used to work" but no one knows when it broke | 10 (evals) + 11 (observability) |
Multi-line prompt embedded as a """...""" string in code | 12 — move to prompts/<name>.md, load via read_text() |
npx claudepluginhub johnfink8/skill-repo --plugin johnfink-skillsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.