Open-source memory for agents · tokenmaxing · MIT

Cut your agent's token burn 21×. Two API calls.

Full-context replay re-reads the whole conversation every turn — input tokens that grow O(n²) and a bill that compounds with every session. Attestor retrieves only what's needed: flat ~200 tokens per call, 21× fewer input tokens by turn 100, 100% recall — measured across six models, open and closed.

Get started → See the benchmark →

# when new information arrives
await attestor.add(namespace, content)

# before each model call — returns ~200 flat tokens
facts = await attestor.recall(namespace, query)

# working context is now flat. session caps come off.

21×

fewer input tokens @ 100 turns

~200

tokens / call, flat — turn 1 to 100

100%

recall held the whole session

O(n)

linear, not O(n²) — by architecture

The token math · linear vs quadratic

Same answer. 21× fewer input tokens to get it.

Re-sending the transcript makes input tokens a parabola; retrieving the one fact you need makes them a straight line. Both hold 100% recall — the only thing that changes is how much the model has to read.

t100

turn

1.33M

replay input tokens

61.8K

attestor input tokens

21.5×

fewer

replay — re-send all (→ 1.33M) attestor — retrieve (→ 61.8K) gpt-5.4 · 100% recall both sides

$24.15

Claude Opus 4 · one 100-turn session
full-context replay

→

$1.24

Claude Opus 4 · same session
with Attestor

6-model study · input-token reduction @ turn 100 · 100% recallinput ×cost (raw → mem)

gpt-5.421.5×$0.57 → $0.19

gpt-5.4-mini21.5×$0.22 → $0.06

claude-opus-422.2×$24.15 → $1.24

claude-sonnet-422.1×$4.83 → $0.25

kimi-k2.6 open21.0×$0.43 → $0.17

deepseek-v3.2 open22.5×$0.16 → $0.009

Input × is model-independent — it's workload geometry. Every model: 5.6× (t24) → 11× (t50) → ~21–22.5× (t100). Verify it yourself with context-clock.

Who it's for

One architecture. Three rooms.

The token curve is the same problem whether you're underwriting it, operating it, or building it. Pick your seat.

The diligence question: is cost per session linear or quadratic?

Full-context replay is O(n²). Your most engaged users become your most expensive liabilities — and the curve compounds with every session they run.

Token governance is now a CFO-level line item. Salesforce reported 20 trillion tokens consumed in a quarter; teams burn annual AI budgets in months. A product whose cost grows quadratically with engagement has a margin problem hiding in plain sight.

One Claude Opus 4 session, 100 turns: $24.15 → $1.24 with Attestor.
The fix is architectural, not a model swap — it survives the next model.
Both tools are open-source: the benchmark is the diligence, and it's free.

Review the benchmark →Audit a pipeline: context-clock

Deterministic recall and a bitemporal audit trail — without the exponential token tax.

Cap context to control cost and you trigger silent memory loss: recall decays 100% → 0% by turn 10, identical across every model. Truncation is mechanical — bigger models don't save you.

Attestor replaces transcript replay with targeted retrieval over a graph + vector store with bitemporal semantics: every fact is timestamped across valid-time and transaction-time, so agent state is immutable, auditable, and reproducible. Per-call context stays flat at ~200 tokens whether the agent is on turn 1 or turn 100.

Flat ~200 tok/call → cost fixed per turn, not compounding.
Bitemporal log → replayable, auditable agent state.
100% deterministic recall — no black-box forgetting at scale.

Read the architecture →Run the benchmark

Cut token burn 21× with two API calls. No model change, no prompt rewrite.

By turn 100 your agent reads 8,700 input tokens per call just to answer one question. You built that in. Here's the entire fix:

await attestor.add(namespace, content)
facts = await attestor.recall(namespace, query)

That's it. Per-call context drops from a climbing parabola to a flat ~200 tokens, your session-length caps come off, and recall holds at 100% across all 100 turns. Measure first — context-clock runs locally on Ollama, no API keys.

Drop-in: no pipeline overhaul, no lock-in (MIT).
DeepSeek-v3.2 session cost: $0.16 → $0.009.
99 tests — reproduce every number on this page.

Ship the fix →Fork the benchmark

How it works

Stop replaying. Start retrieving.

Two failure modes, one architecture that beats both — without sacrificing what the model remembers.

Full-context replay — the default

O(n²) input-token growth
Cap the window → silent memory loss (100%→0% by t10)
Non-deterministic at scale, no audit trail
Cost compounds the more your users engage

Attestor — retrieved memory

O(n) linear growth — flat ~200 tok/call, always
100% recall, no truncation, no forgetting
Bitemporal audit log — reproducible state
Cost fixed per turn · two API calls · MIT

Under the hood · 100% reproducible

The benchmark, in detail.

Every claim here is measured on a deterministic, single-fact workload — verify it yourself with the open-source context-clock harness.

0models benchmarked · open + closed

0turns per session

0tests · every number reproducible

0recall held, all turns

Per-call context — what the model reads each turn

replay · turn 100

8,709

attestor · any turn

~209

Replay climbs every turn toward the window; Attestor stays flat at ~209 tokens from turn 1 to turn 100.

The gap widens with session length

t24

5.6×

t50

11.0×

t75

16.2×

t100

21.5×

Input-token reduction (gpt-5.4). A straight line vs a parabola → the ratio is unbounded.

Method · honest scope

One unique code per memo (needle-in-a-haystack), retrieved top-1, no distractors — isolates "does the agent still have the fact?"
Native context window — models run at full capacity; capping is shown only to reproduce the cap-and-rot failure mode.
100% recall = the right memo is reliably found and read — not a claim about hard multi-hop or conflicting-fact retrieval.
temperature 0, deterministic; real billed cost from OpenRouter usage.cost, never list-price math.

Full methodology + per-turn data →

Tokenmaxing isn't a model swap. It's an architecture.

Measure your agent's token waste with context-clock, then fix it with Attestor — flat context, 21× fewer input tokens, 100% recall. Both free, both open source.

Get started → attestor.dev Benchmark first → context-clock