cortex

A proxy layer for long-context retrieval on top of any LLM. Cortex embeds and cosine-ranks cold chat-history message-groups against a reformulated query and injects the top-K verbatim into the system prompt. Inference-time only — no model modification, no fine-tuning, no KV-cache surgery. Drop-in proxy for the Anthropic Messages API and the OpenAI Chat Completions API.

Calibration: Anthropic reports Opus 4.6 (1M-token native context) scoring 76% on the MRCR v2 8-needle 1M variant; Sonnet 4.5 scores 18.5%. Opus 4.7 + cortex scores 100% at 1M tokens on the same benchmark and stays at 100% through 10M tokens. Same result on RULER niah_multikey_3 from 64K to 10M llama3 tokens. A local Qwen3.5-9B routed through cortex hits 100% on 30 stratified MRCR rows (raw 9B: 67%, raw Opus 4.7: 73%).

Demo 1 — MRCR v2 8-needle at Anthropic's published scales

Opus + cortex MRCR v2 8-needle scaling: 256K / 1M / 5M / 10M tokens

Same four context lengths Anthropic measures in the claude-opus-4-6 announcement. Vanilla Opus 4.7 (200K native context) scores 16% at 256K and the Anthropic API rejects every request at 1M+. Opus 4.7 + cortex scores 100% at every scale, including past Opus 4.6's 1M-token native limit, by compressing up to ~39K cold messages into 7 verbatim turns + a ~6K-token recap.

context	vanilla Opus 4.7	Opus 4.7 + cortex	reference (Anthropic)
256K	16%	100%	—
1M	OVERFLOW	100%	Opus 4.6: 76%
5M	OVERFLOW	100%	beyond Opus 4.6 limit
10M	OVERFLOW	100%	beyond Opus 4.6 limit

n=4, seed=42, lenient rubric (response.lstrip() then strict random-string prefix check + SequenceMatcher ratio). 256K and 1M are real MRCR 8-needle rows from openai/mrcr (dataset max ≈ 625K tokens per row); 5M and 10M are synthesized by stitching real rows with the gold needle preserved in the base row. Token counts via char/4 convention. Reproduce:

# After Path A install (below), with cortex.server running on :8080:
PYTHONPATH=. .venv/Scripts/python.exe bench/pilot_opus/run.py \
  --targets 256000,1000000,5000000,10000000 --unit tokens \
  --n-needles 8 --seed 42 \
  --out results/opus_vs_cortex/mrcr_v3.json

Raw output: results/opus_vs_cortex/mrcr_v3.json. Chart code: bench/pilot_opus/make_chart_v3.py.

Same result, second benchmark: RULER niah_multikey_3 to 10M tokens

Opus + cortex on RULER stays perfect through 10M tokens

To rule out MRCR-specific quirks, we re-ran the experiment on RULER — a different long-context benchmark, different rubric, different needle shape. Cortex stays 100% perfect at every scale from 64K to 10M llama3 tokens. Vanilla Opus 4.7 matches cortex through 512K, then the Anthropic API rejects every request at 1M+ outright.

Two independent benchmarks, two orders of magnitude past Anthropic's published context window, same result.

n=1 per scale (preflight slice of the RULER niah_multikey_3 subtask). The 64K-1M rows are real RULER-llama3-1M samples; 2M/5M/10M are synthesized by stitching the 1M base row with additional RULER distractor lines. Cortex's verbatim_recall_k is tuned per scale (K=16 default, K=200 at 512K-5M, K=2000 at 10M) — high-cardinality NIAH needs more recall candidates as the haystack grows. K is a config knob, not a fundamental limit; the recap budget bounds insertion regardless of K (still ~65K tokens at 10M). Methodology and raw data: bench/pilot_opus/run_ruler.py, results/opus_vs_cortex/ruler_all.json.

A local 9B + cortex matches Opus on MRCR

9B + cortex matches Opus on 30 MRCR rows

Same proxy, different model. Qwen3.5-9B + cortex hits 100% perfect on 30 MRCR rows. Vanilla Opus 4.7 hits 73%. The 9B catches the frontier on retrieval-shaped tasks because cortex pre-locates the needles — the model only has to read the recap.

n=30 single-seed pilot. Strict-rubric is 0 across both 9B arms (qwen3.5 prepends \n\n); the lenient rubric is the headline. Full caveats: bench/pilot_cortex/PAPER.md.

What it is

Cortex is an HTTP proxy in front of any OpenAI- or Anthropic-compatible LLM endpoint. When a conversation exceeds the upstream model's window, cortex:

timegraph-cortex

Popularity

What's Inside

README

cortex

Demo 1 — MRCR v2 8-needle at Anthropic's published scales

Same result, second benchmark: RULER niah_multikey_3 to 10M tokens

A local 9B + cortex matches Opus on MRCR

What it is

Confidence

Similar Plugins

claude-mem

nanobanana

llm-council-plugin

product-management

claude-dashboard