From grimoire
Enforces token budgets, per-user quotas, request timeouts, and loop detection in LLM applications to prevent runaway costs and denial of service.
How this skill is triggered — by the user, by Claude, or both
Slash command
/grimoire:apply-llm-resource-limitsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Enforce token budgets, per-user quotas, request timeouts, and loop detection in LLM applications — preventing runaway inference costs, agent infinite loops, and user-driven denial of service.
Enforce token budgets, per-user quotas, request timeouts, and loop detection in LLM applications — preventing runaway inference costs, agent infinite loops, and user-driven denial of service.
Adopted by: OWASP Top 10 for LLM Applications 2025 LLM04 (Model Denial of Service). OpenAI, Anthropic, and Google all enforce per-account and per-minute rate limits on their APIs. AWS Bedrock, Azure OpenAI, and Google Vertex AI all provide quota management controls. The AI engineering community (LangChain, LlamaIndex, AutoGen) all include resource limit configurations in their framework defaults. Status: Emerging — the attack class is well-understood, but defense tooling and best practices are still being standardized in 2024-2025. Impact: A single user submitting a prompt with a 100,000-token context window at $0.01/1K tokens generates $1 per request — at 100 concurrent requests, $100/minute in inference costs. Agentic systems (AutoGPT-style) with no iteration limits have been observed consuming $50–$500 in a single runaway session. Without limits, a single malicious or buggy user can drain a monthly budget in minutes. Recursive tool-calling loops in multi-agent systems can saturate compute indefinitely. Why best: Monitoring costs after the fact is the common approach — it detects abuse after damage occurs. Pre-emptive per-request and per-user limits cap the blast radius at known thresholds.
Sources: OWASP LLM Top 10 2025 LLM04; CWE-770; OpenAI rate limit documentation; Anthropic responsible scaling policy
Set per-request token budgets:
MAX_INPUT_TOKENS = 4096 # max context size per request
MAX_OUTPUT_TOKENS = 2048 # max generated tokens per response
def call_llm(prompt: str, system: str = "") -> str:
# Estimate tokens before calling (approx 4 chars per token)
estimated_input = len(prompt + system) // 4
if estimated_input > MAX_INPUT_TOKENS:
raise ValueError(f"Input too long: ~{estimated_input} tokens")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=MAX_OUTPUT_TOKENS, # hard cap on output
system=system,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Implement per-user and per-tenant token quotas:
import redis
from datetime import datetime
DAILY_TOKEN_LIMIT = 100_000 # per user per day
MONTHLY_TOKEN_LIMIT = 2_000_000 # per user per month
def check_and_consume_quota(user_id: str, estimated_tokens: int):
day_key = f'tokens:{user_id}:{datetime.utcnow().date()}'
month_key = f'tokens:{user_id}:{datetime.utcnow().strftime("%Y-%m")}'
pipe = redis.pipeline()
pipe.incrby(day_key, estimated_tokens)
pipe.expire(day_key, 86400 * 2) # 2-day TTL
pipe.incrby(month_key, estimated_tokens)
pipe.expire(month_key, 86400 * 35)
day_total, _, month_total, _ = pipe.execute()
if day_total > DAILY_TOKEN_LIMIT:
raise QuotaExceeded(f"Daily token limit reached. Resets at midnight UTC.")
if month_total > MONTHLY_TOKEN_LIMIT:
raise QuotaExceeded(f"Monthly token limit reached.")
Limit agent iteration counts and recursion depth:
class SafeAgent:
MAX_ITERATIONS = 20
MAX_TOOL_CALLS_PER_ITER = 5
def run(self, task: str) -> str:
iteration = 0
total_tool_calls = 0
while iteration < self.MAX_ITERATIONS:
iteration += 1
response = self.llm.complete(task, tools=self.tools)
if not response.tool_calls:
return response.content # task complete
if len(response.tool_calls) > self.MAX_TOOL_CALLS_PER_ITER:
raise AgentError("Too many tool calls in single step")
total_tool_calls += len(response.tool_calls)
self.execute_tool_calls(response.tool_calls)
raise AgentError(f"Agent exceeded {self.MAX_ITERATIONS} iterations without completion")
Set wall-clock timeouts on LLM calls:
import asyncio
async def call_llm_with_timeout(prompt: str, timeout_seconds: float = 30.0) -> str:
try:
response = await asyncio.wait_for(
llm_client.acompletions(prompt),
timeout=timeout_seconds
)
return response.text
except asyncio.TimeoutError:
logger.warning("LLM call timed out after %ss", timeout_seconds)
raise ServiceTimeout("AI response took too long")
Monitor and alert on cost anomalies:
def track_llm_cost(user_id: str, input_tokens: int, output_tokens: int,
model: str):
# Calculate cost (example rates — check current pricing)
cost_usd = (input_tokens * 0.003 + output_tokens * 0.015) / 1000
metrics.increment('llm.tokens.input', input_tokens, tags={'user': user_id})
metrics.increment('llm.tokens.output', output_tokens, tags={'user': user_id})
metrics.gauge('llm.cost.request', cost_usd, tags={'user': user_id, 'model': model})
# Alert if single request exceeds threshold
if cost_usd > 1.00: # $1 per request
alert_ops(f"High-cost LLM request: ${cost_usd:.2f} from user {user_id}")
Validate and truncate user-supplied context before including in prompts:
import tiktoken
def truncate_to_token_limit(text: str, max_tokens: int, model: str = "gpt-4") -> str:
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
if len(tokens) <= max_tokens:
return text
# Truncate and add indicator
truncated = enc.decode(tokens[:max_tokens])
return truncated + "\n[Content truncated due to length]"
max_tokens on every LLM API call — never rely on default limits.npx claudepluginhub jeffreytse/grimoire --plugin grimoireDetects and fixes resource exhaustion vulnerabilities in LLM endpoints: missing token caps, rate limits, and prompt-length bounds.
Enforces dollar-cost caps for paid AI API calls in loops, queues, retries, and agent steps. Prevents runaway spend by requiring per-run and per-day $-limits before writing any call site.
Monitors, caps, and recovers from context accumulation in agentic systems with per-cycle cost tracking, budget enforcement, and emergency pruning. Use for long-lived agent loops, rising API costs, or post-mortem analysis.