Makes Claude prompts cheaper to send and replies cheaper to receive — without changing meaning — and shows the savings on a live dashboard (UI brand: FortiInferenceIQ).
CONCISE) — appends a short "be brief, lead with the answer" instruction to
the last turn, so replies come back shorter. The big lever: output tokens cost ~5× input.You apply it two ways: a Claude Code hook/plugin (works on any login, incl. Pro/Max OAuth) and an in-path proxy (API-key only; measures real output savings). It never breaks Claude Code's tool loop.
| Path | What it is |
|---|---|
core-engine/optimize.py | mechanical filler-strip + token counting + reporting; also a CLI |
core-engine/router.py | deterministic intent → model routing (Haiku/Sonnet/Opus) |
core-engine/semcache.py | 3-layer semantic cache (non-agentic, text-only) |
core-engine/calibrate.py | same-prompt brevity gauge (reply-trimming on vs off) |
proxy/intercept.py | optimizing reverse proxy, port :8082 |
dashboard/collector.py | standalone metrics dashboard, port :8088 |
.claude/hooks/optimize_prompt.py | UserPromptSubmit hook (shipped as a plugin) |
iq | launcher: starts the proxy and runs Claude Code through it |
In Claude Code:
/plugin marketplace add svuillaume/InferenceIQ
/plugin install inferenceiq@inferenceiq
Auto-shortens your prompts and nudges shorter replies on every prompt; never blocks; works on any login. (Takes effect next session.)
docker compose up -d --build # start the proxy on :8082
./iq # run Claude Code through the proxy
docker compose down # stop
CONCISE=0 docker compose up -d intercept (on by default).INFERENCEIQ_DASHBOARD=http://<host>:8088 docker compose up -d intercept.Depends on nothing else; one instance collects from many machines and persists totals across restarts via a volume.
cd dashboard
docker build -t iq-dashboard .
docker run -d --name iq-dashboard -p 8088:8088 -v iq_data:/data --restart unless-stopped iq-dashboard
http://<host>:8088 (simple before/after view) · http://<host>:8088/full (detailed).curl -XPOST http://<host>:8088/api/resetdocker stop iq-dashboard && docker rm iq-dashboard,
then re-run with the same -v iq_data:/data.Input — filler removed by the optimizer (after = billed input).
Output — median reply length with trimming on (CONCISE=1) vs off (CONCISE=0),
over the concise replies served. For a true same-prompt baseline:
ANTHROPIC_API_KEY=… ./core-engine/calibrate.py.
All from Anthropic's real usage object — not estimates.
🧠 Key clarification (important)
CONCISE (1/0 reply-trimming) · ROUTE_MODELS (on/advise/off) · CACHE_ENABLED (1/0) ·
INFERENCEIQ_DASHBOARD (where to report; off disables) · IQ_TOKEN (shared secret for a
token-protected collector) · IQ_PERSIST_PATH (dashboard persistence file, default /data/tally.json
in the image).
See CLAUDE.md for architecture/safety invariants and roadmap.md for what's applied vs planned.
No model invocation
Executes directly as bash, bypassing the AI model
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub svuillaume/inferenceiq --plugin inferenceiqMulti-model consensus engine integrating OpenAI Codex CLI, Gemini CLI, and Claude CLI for collaborative code review and problem-solving.
Memory compression system for Claude Code - persist context across sessions
Editorial "Web Designer" bundle for Claude Code from Antigravity Awesome Skills.
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use