A deterministic benchmark of the context-rot arc across 6 models. Under full-context replay, cumulative input tokens grow O(n²); with retrieval, they grow O(n). By turn 100 that's ~21× fewer input tokens — at 100% recall. This is the full picture: the workload, the two failure modes, the per-turn data, and exactly how to reproduce every number.
Every turn plants one unique code inside a short memo — a needle like k9f3a2. Then a probe asks for a specific memo's code, and grading is done at the answer level: an exact string match, nothing fuzzy. Same workload, run two ways.
raw re-sends the entire transcript each turn. memory retrieves only the single most relevant memo (top-1). Both must still answer the probe correctly.
There's no free lunch with raw context. context-clock reproduces each failure mode deterministically before measuring the way out.
Re-read the whole transcript every turn and input tokens pile up as n²/2. By turn 100 a single gpt-5.4 raw session has read 1,326,734 cumulative input tokens — for one conversation.
cumulative input tokens · gpt-5.4 · 100 turns · no memory
Cap the window to bound cost and the oldest turns truncate. Recall decays in a staircase — and it's identical from a 3B model to a 671B one. Truncation is mechanical, not a function of model size.
recall, by turn 10 · capped window · every model
Same workload, run two ways, both holding 100% recall: re-send everything (raw) vs retrieve only what's needed (memory). Cumulative input tokens — the part memory actually controls. gpt-5.4, native window.
The cumulative curve is just the running sum of what each turn reads. Raw's per-call input climbs toward the window; memory's stays flat near ~209 tokens the entire session.
Each turn re-reads the whole transcript, so the per-call read grows linearly — and the cumulative sum grows quadratically.
Retrieve top-1 and the per-call read is flat — prompt + one memo + probe — from turn 1 to turn 100.
Six models — open and closed, 3B-class to frontier — all trace the identical reduction curve. The number is set by the shape of the workload, not the model under it.
5.6× (t24) → 11.0× (t50) → 16.2× (t75) → ~21–22.5× (t100). A straight line against a parabola → the ratio is unbounded, and it doesn't care which model you put under it.
Cost does drop — but the multiple is distorted by two things that have nothing to do with the workload: provider prompt caching and answer verbosity. Tokens are the universal, model-independent number; cost is downstream of billing quirks.
OpenAI auto-caches the repeated raw prefix, so the raw side is already cheap → cost only drops ~3–4×. Anthropic and DeepSeek don't cache here, so the raw side is billed in full → ~17–19.5× cost reduction. The token count doesn't change; only the bill does.
kimi-k2.6 is chatty — its answer-side output tokens dominate, which memory doesn't touch — so its cost ratio compresses to 2.5× even though its input × is 21.0×. Same workload geometry, different output behavior.
Cumulative input tokens for gpt-5.4, both configs, with the running reduction. Memory adds a constant ~647 input tokens per turn (linear); raw's per-turn add grows every turn (quadratic).
Read the two middle columns as functions of n: the memory column is a straight line (+~647/turn); the raw column is a parabola (its per-turn increment itself grows every turn). The reduction is their ratio — which is why it keeps climbing with no ceiling.
Fork it, run the rot stress test, run the fix, then point the same harness at any model via OpenRouter. Locally it runs on Ollama with no API keys.
The arc is real and it reproduces; the scope is deliberately narrow. Here's exactly where the edges are.
One unique code per memo, retrieved top-1, no distractors. This isolates "does the system still have the fact?" — it is not a claim about hard multi-hop or conflicting-fact retrieval.
Models run at full context capacity. Capping is shown only to reproduce the cap-and-rot failure mode — the headline 21× numbers are all at native window.
memory's win depends on a small retrieval budget. A fat budget — retrieving many memos per turn — drags the memory curve back toward quadratic. top-1 is doing real work here.
memory cuts the input tokens (what the model reads). Output — the model's answer — is unchanged across configs. The headline number is specifically about input.
Dropped for provider rate-limiting during the run, not for disagreeing. The remaining 6 models already agree on the identical curve, so the conclusion stands.
Reported cost is real billed usage.cost, but it's distorted by caching and verbosity. That's why the report leads with input tokens, the universal number.
This benchmark proves the problem; Attestor is the open-source memory layer that delivers it — flat ~200 tokens per call, ~21× fewer input tokens, 100% recall, two API calls.
← Back to the landing page · github.com/bolnet/context-clock →