Benchmark · Full methodology & results

How much does it cost to remember? We measured it.

A deterministic benchmark of the context-rot arc across 6 models. Under full-context replay, cumulative input tokens grow O(n²); with retrieval, they grow O(n). By turn 100 that's ~21× fewer input tokens — at 100% recall. This is the full picture: the workload, the two failure modes, the per-turn data, and exactly how to reproduce every number.

Fork the benchmark → Jump to per-turn data ↓

# reproduce the whole arc locally — no API keys
$ git clone github.com/bolnet/context-clock
$ python -m context_clock.run --until-rotted --turns 100
$ python -m context_clock.run --memory --turns 100

models · open + closed

turns per session

0×

fewer input tokens at turn 100

recall held, both configs

The setup · the workload

An inject-and-probe needle in a haystack.

Every turn plants one unique code inside a short memo — a needle like k9f3a2. Then a probe asks for a specific memo's code, and grading is done at the answer level: an exact string match, nothing fuzzy. Same workload, run two ways.

The two configs

raw · re-send all

O(n²)

memory · top-1

O(n)

raw re-sends the entire transcript each turn. memory retrieves only the single most relevant memo (top-1). Both must still answer the probe correctly.

Run parameters · held fixed

Native context window — models run at full capacity.
temperature 0 — deterministic, repeatable.
cadence 1 — one inject + one probe every turn.
100 turns — long enough for the curves to separate.
Grading: answer-level exact string on the planted code.

Two ways to run out of road

Grow it and you pay quadratically. Cap it and it forgets.

There's no free lunch with raw context. context-clock reproduces each failure mode deterministically before measuring the way out.

Grow it → O(n²)

Re-read the whole transcript every turn and input tokens pile up as n²/2. By turn 100 a single gpt-5.4 raw session has read 1,326,734 cumulative input tokens — for one conversation.

1,326,734

cumulative input tokens · gpt-5.4 · 100 turns · no memory

Cap it → rot

Cap the window to bound cost and the oldest turns truncate. Recall decays in a staircase — and it's identical from a 3B model to a 671B one. Truncation is mechanical, not a function of model size.

100% → 0%

recall, by turn 10 · capped window · every model

The cap-and-rot staircase · identical 3B → 671B

turns 1–7

100%

turn 8

67%

turn 9

33%

turn 10+

The proof · linear vs quadratic

A parabola becomes a straight line.

Same workload, run two ways, both holding 100% recall: re-send everything (raw) vs retrieve only what's needed (memory). Cumulative input tokens — the part memory actually controls. gpt-5.4, native window.

t100

turn

1.33M

raw input tokens

61.8K

memory input tokens

21.5×

fewer input tokens

raw — re-send all (quadratic → 1.33M) memory — retrieve top-1 (linear → 61.8K) gpt-5.4 · native window · 100% recall both sides

Per-call context · what the model reads each turn

It's the per-call read that diverges.

The cumulative curve is just the running sum of what each turn reads. Raw's per-call input climbs toward the window; memory's stays flat near ~209 tokens the entire session.

raw · input tokens read on a single call

turn 1

132

turn 24

2,122

turn 50

4,380

turn 100

8,709

Each turn re-reads the whole transcript, so the per-call read grows linearly — and the cumulative sum grows quadratically.

memory · input tokens read on a single call

turn 1

~209

turn 24

~209

turn 50

~209

turn 100

~209

Retrieve top-1 and the per-call read is flat — prompt + one memo + probe — from turn 1 to turn 100.

6-model study · input × at turn 100 · all 100% recall

Input × is workload geometry — model-independent.

Six models — open and closed, 3B-class to frontier — all trace the identical reduction curve. The number is set by the shape of the workload, not the model under it.

modelinput ×recall

gpt-5.421.5×100%

gpt-5.4-mini21.5×100%

claude-sonnet-422.1×100%

claude-opus-422.2×100%

kimi-k2.6 open21.0×100%

deepseek-v3.2 open22.5×100%

Input-token reduction at turn 100 · per model

gpt-5.4

21.5×

gpt-5.4-mini

21.5×

claude-sonnet-4

22.1×

claude-opus-4

22.2×

kimi-k2.6 open

21.0×

deepseek-v3.2 open

22.5×

The reduction curve is identical across all 6 models

t24

5.6×

t50

11.0×

t75

16.2×

t100

~21–22.5×

5.6× (t24) → 11.0× (t50) → 16.2× (t75) → ~21–22.5× (t100). A straight line against a parabola → the ratio is unbounded, and it doesn't care which model you put under it.

Cost · a footnote, not the headline

Why we lead with input tokens, not dollars.

Cost does drop — but the multiple is distorted by two things that have nothing to do with the workload: provider prompt caching and answer verbosity. Tokens are the universal, model-independent number; cost is downstream of billing quirks.

modelcost ×billed @ t100 · raw → mem

gpt-5.43.0×$0.5701 → $0.1902

gpt-5.4-mini3.8×$0.2191 → $0.0571

claude-sonnet-419.5×$4.8304 → $0.2482

claude-opus-419.5×$24.1521 → $1.2392

kimi-k2.6 open2.5×$0.4263 → $0.1677

deepseek-v3.2 open17.4×$0.1617 → $0.0093

Prompt caching flattens the closed-source spread

OpenAI auto-caches the repeated raw prefix, so the raw side is already cheap → cost only drops ~3–4×. Anthropic and DeepSeek don't cache here, so the raw side is billed in full → ~17–19.5× cost reduction. The token count doesn't change; only the bill does.

Verbosity distorts the open-source spread

kimi-k2.6 is chatty — its answer-side output tokens dominate, which memory doesn't touch — so its cost ratio compresses to 2.5× even though its input × is 21.0×. Same workload geometry, different output behavior.

claude-opus-4: $24.15 raw → $1.24 memory — for a single 100-turn session. Input tokens are the cause; the bill is just one of its shadows.

Per-turn data · gpt-5.4 · cumulative input tokens

The numbers, turn by turn.

Cumulative input tokens for gpt-5.4, both configs, with the running reduction. Memory adds a constant ~647 input tokens per turn (linear); raw's per-turn add grows every turn (quadratic).

turnraw · cum. inputmemory · cum. inputreduction

t11321251.1×

t2480,89714,4055.6×

t50338,14730,70611.0×

t75751,40646,26816.2×

t1001,326,73461,81821.5×

Read the two middle columns as functions of n: the memory column is a straight line (+~647/turn); the raw column is a parabola (its per-turn increment itself grows every turn). The reduction is their ratio — which is why it keeps climbing with no ceiling.

Reproduce it · 100% local

Every number here is re-runnable.

Fork it, run the rot stress test, run the fix, then point the same harness at any model via OpenRouter. Locally it runs on Ollama with no API keys.

# clone
$ git clone github.com/bolnet/context-clock

# 1 · grow until recall dies — the O(n²) failure mode
$ python -m context_clock.run --until-rotted --turns 100

# 2 · the fix — retrieved memory, flat context, 100% recall
$ python -m context_clock.run --memory --turns 100

# 3 · any model, same harness — real billed cost from usage.cost
$ python -m context_clock.run --provider openrouter --model openai/gpt-5.4

tests — reproduce every number here

Ollama

runs locally · no API keys needed

usage.cost

real billed cost from OpenRouter

MIT

open source · no lock-in · fork freely

Honest scope · the fine print

What this benchmark does and doesn't claim.

The arc is real and it reproduces; the scope is deliberately narrow. Here's exactly where the edges are.

Single-fact NIAH, top-1

One unique code per memo, retrieved top-1, no distractors. This isolates "does the system still have the fact?" — it is not a claim about hard multi-hop or conflicting-fact retrieval.

Native window throughout

Models run at full context capacity. Capping is shown only to reproduce the cap-and-rot failure mode — the headline 21× numbers are all at native window.

The recall budget must be tight

memory's win depends on a small retrieval budget. A fat budget — retrieving many memos per turn — drags the memory curve back toward quadratic. top-1 is doing real work here.

Input vs total tokens

memory cuts the input tokens (what the model reads). Output — the model's answer — is unchanged across configs. The headline number is specifically about input.

Qwen3-Max excluded

Dropped for provider rate-limiting during the run, not for disagreeing. The remaining 6 models already agree on the identical curve, so the conclusion stands.

Cost is downstream

Reported cost is real billed usage.cost, but it's distorted by caching and verbosity. That's why the report leads with input tokens, the universal number.

From measurement to fix

context-clock measures it; Attestor fixes it.

This benchmark proves the problem; Attestor is the open-source memory layer that delivers it — flat ~200 tokens per call, ~21× fewer input tokens, 100% recall, two API calls.

Get the fix → attestor.dev

← Back to the landing page · github.com/bolnet/context-clock →