cortex
A proxy layer for long-context retrieval on top of any LLM. Cortex
embeds and cosine-ranks cold chat-history message-groups against a
reformulated query and injects the top-K verbatim into the system
prompt. Inference-time only — no model modification, no fine-tuning,
no KV-cache surgery. Drop-in proxy for the Anthropic Messages API and
the OpenAI Chat Completions API.
Calibration: Anthropic
reports Opus 4.6
(1M-token native context) scoring 76% on the MRCR v2 8-needle 1M
variant; Sonnet 4.5 scores 18.5%. Opus 4.7 + cortex scores 100%
at 1M tokens on the same benchmark and stays at 100% through 10M
tokens. Same result on RULER niah_multikey_3 from 64K to 10M llama3
tokens. A local Qwen3.5-9B routed through cortex hits 100% on 30
stratified MRCR rows (raw 9B: 67%, raw Opus 4.7: 73%).
Demo 1 — MRCR v2 8-needle at Anthropic's published scales

Same four context lengths Anthropic measures in the
claude-opus-4-6 announcement.
Vanilla Opus 4.7 (200K native context) scores 16% at 256K and the
Anthropic API rejects every request at 1M+. Opus 4.7 + cortex scores
100% at every scale, including past Opus 4.6's 1M-token native limit,
by compressing up to ~39K cold messages into 7 verbatim turns + a
~6K-token recap.
| context | vanilla Opus 4.7 | Opus 4.7 + cortex | reference (Anthropic) |
|---|
| 256K | 16% | 100% | — |
| 1M | OVERFLOW | 100% | Opus 4.6: 76% |
| 5M | OVERFLOW | 100% | beyond Opus 4.6 limit |
| 10M | OVERFLOW | 100% | beyond Opus 4.6 limit |
n=4, seed=42, lenient rubric (response.lstrip() then strict
random-string prefix check + SequenceMatcher ratio). 256K and 1M are
real MRCR 8-needle rows from openai/mrcr (dataset max ≈ 625K tokens
per row); 5M and 10M are synthesized by stitching real rows with the
gold needle preserved in the base row. Token counts via char/4
convention. Reproduce:
# After Path A install (below), with cortex.server running on :8080:
PYTHONPATH=. .venv/Scripts/python.exe bench/pilot_opus/run.py \
--targets 256000,1000000,5000000,10000000 --unit tokens \
--n-needles 8 --seed 42 \
--out results/opus_vs_cortex/mrcr_v3.json
Raw output: results/opus_vs_cortex/mrcr_v3.json.
Chart code: bench/pilot_opus/make_chart_v3.py.
Same result, second benchmark: RULER niah_multikey_3 to 10M tokens

To rule out MRCR-specific quirks, we re-ran the experiment on
RULER — a
different long-context benchmark, different rubric, different needle
shape. Cortex stays 100% perfect at every scale from 64K to 10M
llama3 tokens. Vanilla Opus 4.7 matches cortex through 512K, then
the Anthropic API rejects every request at 1M+ outright.
Two independent benchmarks, two orders of magnitude past Anthropic's
published context window, same result.
n=1 per scale (preflight slice of the RULER niah_multikey_3 subtask).
The 64K-1M rows are real RULER-llama3-1M samples; 2M/5M/10M are
synthesized by stitching the 1M base row with additional RULER
distractor lines. Cortex's verbatim_recall_k is tuned per scale
(K=16 default, K=200 at 512K-5M, K=2000 at 10M) — high-cardinality NIAH
needs more recall candidates as the haystack grows. K is a config knob,
not a fundamental limit; the recap budget bounds insertion regardless
of K (still ~65K tokens at 10M). Methodology and raw data:
bench/pilot_opus/run_ruler.py,
results/opus_vs_cortex/ruler_all.json.
A local 9B + cortex matches Opus on MRCR

Same proxy, different model. Qwen3.5-9B + cortex hits 100% perfect
on 30 MRCR rows. Vanilla Opus 4.7 hits 73%. The 9B catches the
frontier on retrieval-shaped tasks because cortex pre-locates the
needles — the model only has to read the recap.
n=30 single-seed pilot. Strict-rubric is 0 across both 9B arms (qwen3.5
prepends \n\n); the lenient rubric is the headline. Full caveats:
bench/pilot_cortex/PAPER.md.
What it is
Cortex is an HTTP proxy in front of any OpenAI- or Anthropic-compatible
LLM endpoint. When a conversation exceeds the upstream model's window,
cortex: