Benchmark · Full methodology & results

How much does it cost to remember? We measured it.

A deterministic benchmark of the context-rot arc across 6 models. Under full-context replay, cumulative input tokens grow O(n²); with retrieval, they grow O(n). By turn 100 that's ~21× fewer input tokens — at 100% recall. This is the full picture: the workload, the two failure modes, the per-turn data, and exactly how to reproduce every number.

# reproduce the whole arc locally — no API keys
$ git clone github.com/bolnet/context-clock
$ python -m context_clock.run --until-rotted --turns 100
$ python -m context_clock.run --memory --turns 100
0
models · open + closed
0
turns per session
0×
fewer input tokens at turn 100
0
recall held, both configs
The setup · the workload

An inject-and-probe needle in a haystack.

Every turn plants one unique code inside a short memo — a needle like k9f3a2. Then a probe asks for a specific memo's code, and grading is done at the answer level: an exact string match, nothing fuzzy. Same workload, run two ways.

The two configs

raw · re-send all
O(n²)
memory · top-1
O(n)

raw re-sends the entire transcript each turn. memory retrieves only the single most relevant memo (top-1). Both must still answer the probe correctly.

Run parameters · held fixed

  • Native context window — models run at full capacity.
  • temperature 0 — deterministic, repeatable.
  • cadence 1 — one inject + one probe every turn.
  • 100 turns — long enough for the curves to separate.
  • Grading: answer-level exact string on the planted code.
Two ways to run out of road

Grow it and you pay quadratically. Cap it and it forgets.

There's no free lunch with raw context. context-clock reproduces each failure mode deterministically before measuring the way out.

Grow it → O(n²)

Re-read the whole transcript every turn and input tokens pile up as n²/2. By turn 100 a single gpt-5.4 raw session has read 1,326,734 cumulative input tokens — for one conversation.

1,326,734

cumulative input tokens · gpt-5.4 · 100 turns · no memory

Cap it → rot

Cap the window to bound cost and the oldest turns truncate. Recall decays in a staircase — and it's identical from a 3B model to a 671B one. Truncation is mechanical, not a function of model size.

100% → 0%

recall, by turn 10 · capped window · every model

The cap-and-rot staircase · identical 3B → 671B
turns 1–7
100%
turn 8
67%
turn 9
33%
turn 10+
0%
The proof · linear vs quadratic

A parabola becomes a straight line.

Same workload, run two ways, both holding 100% recall: re-send everything (raw) vs retrieve only what's needed (memory). Cumulative input tokens — the part memory actually controls. gpt-5.4, native window.

t100
turn
1.33M
raw input tokens
61.8K
memory input tokens
21.5×
fewer input tokens
0 700K 1.4M t1 t50 t100 t0 21.5×
raw — re-send all (quadratic → 1.33M) memory — retrieve top-1 (linear → 61.8K) gpt-5.4 · native window · 100% recall both sides
Per-call context · what the model reads each turn

It's the per-call read that diverges.

The cumulative curve is just the running sum of what each turn reads. Raw's per-call input climbs toward the window; memory's stays flat near ~209 tokens the entire session.

raw · input tokens read on a single call

turn 1
132
turn 24
2,122
turn 50
4,380
turn 100
8,709

Each turn re-reads the whole transcript, so the per-call read grows linearly — and the cumulative sum grows quadratically.

memory · input tokens read on a single call

turn 1
~209
turn 24
~209
turn 50
~209
turn 100
~209

Retrieve top-1 and the per-call read is flat — prompt + one memo + probe — from turn 1 to turn 100.

6-model study · input × at turn 100 · all 100% recall

Input × is workload geometry — model-independent.

Six models — open and closed, 3B-class to frontier — all trace the identical reduction curve. The number is set by the shape of the workload, not the model under it.

modelinput ×recall
gpt-5.421.5×100%
gpt-5.4-mini21.5×100%
claude-sonnet-422.1×100%
claude-opus-422.2×100%
kimi-k2.6 open21.0×100%
deepseek-v3.2 open22.5×100%

Input-token reduction at turn 100 · per model

gpt-5.4
21.5×
gpt-5.4-mini
21.5×
claude-sonnet-4
22.1×
claude-opus-4
22.2×
kimi-k2.6 open
21.0×
deepseek-v3.2 open
22.5×

The reduction curve is identical across all 6 models

t24
5.6×
t50
11.0×
t75
16.2×
t100
~21–22.5×

5.6× (t24) → 11.0× (t50) → 16.2× (t75) → ~21–22.5× (t100). A straight line against a parabola → the ratio is unbounded, and it doesn't care which model you put under it.

Cost · a footnote, not the headline

Why we lead with input tokens, not dollars.

Cost does drop — but the multiple is distorted by two things that have nothing to do with the workload: provider prompt caching and answer verbosity. Tokens are the universal, model-independent number; cost is downstream of billing quirks.

modelcost ×billed @ t100 · raw → mem
gpt-5.43.0×$0.5701 → $0.1902
gpt-5.4-mini3.8×$0.2191 → $0.0571
claude-sonnet-419.5×$4.8304 → $0.2482
claude-opus-419.5×$24.1521 → $1.2392
kimi-k2.6 open2.5×$0.4263 → $0.1677
deepseek-v3.2 open17.4×$0.1617 → $0.0093

Prompt caching flattens the closed-source spread

OpenAI auto-caches the repeated raw prefix, so the raw side is already cheap → cost only drops ~3–4×. Anthropic and DeepSeek don't cache here, so the raw side is billed in full → ~17–19.5× cost reduction. The token count doesn't change; only the bill does.

Verbosity distorts the open-source spread

kimi-k2.6 is chatty — its answer-side output tokens dominate, which memory doesn't touch — so its cost ratio compresses to 2.5× even though its input × is 21.0×. Same workload geometry, different output behavior.

claude-opus-4: $24.15 raw → $1.24 memory  — for a single 100-turn session. Input tokens are the cause; the bill is just one of its shadows.
Per-turn data · gpt-5.4 · cumulative input tokens

The numbers, turn by turn.

Cumulative input tokens for gpt-5.4, both configs, with the running reduction. Memory adds a constant ~647 input tokens per turn (linear); raw's per-turn add grows every turn (quadratic).

turnraw · cum. inputmemory · cum. inputreduction
t11321251.1×
t2480,89714,4055.6×
t50338,14730,70611.0×
t75751,40646,26816.2×
t1001,326,73461,81821.5×

Read the two middle columns as functions of n: the memory column is a straight line (+~647/turn); the raw column is a parabola (its per-turn increment itself grows every turn). The reduction is their ratio — which is why it keeps climbing with no ceiling.

Reproduce it · 100% local

Every number here is re-runnable.

Fork it, run the rot stress test, run the fix, then point the same harness at any model via OpenRouter. Locally it runs on Ollama with no API keys.

# clone
$ git clone github.com/bolnet/context-clock

# 1 · grow until recall dies — the O(n²) failure mode
$ python -m context_clock.run --until-rotted --turns 100

# 2 · the fix — retrieved memory, flat context, 100% recall
$ python -m context_clock.run --memory --turns 100

# 3 · any model, same harness — real billed cost from usage.cost
$ python -m context_clock.run --provider openrouter --model openai/gpt-5.4
0
tests — reproduce every number here
Ollama
runs locally · no API keys needed
usage.cost
real billed cost from OpenRouter
MIT
open source · no lock-in · fork freely
Honest scope · the fine print

What this benchmark does and doesn't claim.

The arc is real and it reproduces; the scope is deliberately narrow. Here's exactly where the edges are.

Single-fact NIAH, top-1

One unique code per memo, retrieved top-1, no distractors. This isolates "does the system still have the fact?" — it is not a claim about hard multi-hop or conflicting-fact retrieval.

Native window throughout

Models run at full context capacity. Capping is shown only to reproduce the cap-and-rot failure mode — the headline 21× numbers are all at native window.

The recall budget must be tight

memory's win depends on a small retrieval budget. A fat budget — retrieving many memos per turn — drags the memory curve back toward quadratic. top-1 is doing real work here.

Input vs total tokens

memory cuts the input tokens (what the model reads). Output — the model's answer — is unchanged across configs. The headline number is specifically about input.

Qwen3-Max excluded

Dropped for provider rate-limiting during the run, not for disagreeing. The remaining 6 models already agree on the identical curve, so the conclusion stands.

Cost is downstream

Reported cost is real billed usage.cost, but it's distorted by caching and verbosity. That's why the report leads with input tokens, the universal number.

From measurement to fix

context-clock measures it; Attestor fixes it.

This benchmark proves the problem; Attestor is the open-source memory layer that delivers it — flat ~200 tokens per call, ~21× fewer input tokens, 100% recall, two API calls.

Get the fix → attestor.dev

Back to the landing page  ·  github.com/bolnet/context-clock →