From token-budget
Plan, implement, and monitor token budgets for LLM-based applications. Covers three areas — context budgets (fitting content into context windows), cost budgets (tracking spend per user/service/month), and quota/rate budgets (rate limiting LLM calls). Use this skill whenever the user mentions token budgets, token limits, LLM costs, context window management, token tracking, API budgets, rate limiting for LLMs, token usage monitoring, or wants to control how much context is sent to an LLM. Also trigger when the user is building a ChatModelListener, token counter, or usage tracker for LangChain4j/Quarkus.
How this skill is triggered — by the user, by Claude, or both
Slash command
/token-budget:token-budgetThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Help plan, implement, and monitor token budgets in LLM-based applications. Token budgets come in three flavors — often used in combination:
Help plan, implement, and monitor token budgets in LLM-based applications. Token budgets come in three flavors — often used in combination:
Each flavor addresses a different concern but they share infrastructure (token counting, tracking, configuration). This skill guides the user through an interview to understand their needs, then generates the appropriate Quarkus/LangChain4j code.
Before generating code, ask these questions to understand the scope. Skip questions that are already answered from context.
Which use case? Context / Cost / Quota / Combination?
Which LLM provider? Ollama / LM Studio / OpenAI / Anthropic / Azure OpenAI / other?
How many services/agents make LLM calls?
Is there a monthly monetary budget?
Which framework? Quarkus + LangChain4j (primary) / Spring + LangChain4j / other?
Should token usage be persisted?
After the interview, determine which budget types to generate and read the corresponding reference file(s) from references/ for implementation details.
All three budget types share a common entry point — the ChatModelListener from LangChain4j. This CDI bean intercepts every LLM call and provides access to input/output token counts.
@ApplicationScoped
public class TokenTrackingListener implements ChatModelListener {
@Override
public void onResponse(ChatModelResponseContext context) {
var usage = context.response().tokenUsage();
int inputTokens = usage.inputTokenCount();
int outputTokens = usage.outputTokenCount();
// Route to the appropriate budget handler(s)
}
}
This listener is the foundation. Each budget type adds its own logic on top.
Before sending a request, you often need to estimate how many tokens the content will use. The skill generates a TokenEstimator utility:
Based on the interview answers, read the appropriate reference file(s) and generate code:
| Use Case | Reference File | Key Artifacts |
|---|---|---|
| Context Budget | references/context-budget.md | TokenBudgetService, TokenEstimator, prioritization config |
| Cost Budget | references/cost-budget.md | TokenUsageTracker, REST endpoints, alert events, Flyway migration |
| Quota Budget | references/quota-budget.md | TokenRateLimiter, priority queue, Prometheus metrics |
AtomicLong / ConcurrentHashMap for in-memory trackingChatModelListener abstracts the providerquarkus-micrometer-registry-prometheusapplication.properties with sensible defaultsGenerated code follows these patterns:
{project.package}.token (or {module}.control if using BCE)@ApplicationScoped for all services@ConfigProperty with prefix token-budgetLog.infof() for budget consumption, Log.warnf() for threshold alerts@QuarkusTest with REST Assured for endpoint testingCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub mgoericke/javamark-claude-plugins --plugin token-budget