Open-source memory for agents · tokenmaxing · MIT

Cut your agent's token burn 21×. Two API calls.

Full-context replay re-reads the whole conversation every turn — input tokens that grow O(n²) and a bill that compounds with every session. Attestor retrieves only what's needed: flat ~200 tokens per call, 21× fewer input tokens by turn 100, 100% recall — measured across six models, open and closed.

# when new information arrives
await attestor.add(namespace, content)

# before each model call — returns ~200 flat tokens
facts = await attestor.recall(namespace, query)

# working context is now flat. session caps come off.
21×
fewer input tokens @ 100 turns
~200
tokens / call, flat — turn 1 to 100
100%
recall held the whole session
O(n)
linear, not O(n²) — by architecture
The token math · linear vs quadratic

Same answer. 21× fewer input tokens to get it.

Re-sending the transcript makes input tokens a parabola; retrieving the one fact you need makes them a straight line. Both hold 100% recall — the only thing that changes is how much the model has to read.

t100
turn
1.33M
replay input tokens
61.8K
attestor input tokens
21.5×
fewer
0 700K 1.4M t1 t50 t100 t0 21.5×
replay — re-send all (→ 1.33M) attestor — retrieve (→ 61.8K) gpt-5.4 · 100% recall both sides
$24.15
Claude Opus 4 · one 100-turn session
full-context replay
$1.24
Claude Opus 4 · same session
with Attestor
6-model study · input-token reduction @ turn 100 · 100% recallinput ×cost (raw → mem)
gpt-5.421.5×$0.57 → $0.19
gpt-5.4-mini21.5×$0.22 → $0.06
claude-opus-422.2×$24.15 → $1.24
claude-sonnet-422.1×$4.83 → $0.25
kimi-k2.6 open21.0×$0.43 → $0.17
deepseek-v3.2 open22.5×$0.16 → $0.009

Input × is model-independent — it's workload geometry. Every model: 5.6× (t24) → 11× (t50) → ~21–22.5× (t100). Verify it yourself with context-clock.

Who it's for

One architecture. Three rooms.

The token curve is the same problem whether you're underwriting it, operating it, or building it. Pick your seat.

The diligence question: is cost per session linear or quadratic?

Full-context replay is O(n²). Your most engaged users become your most expensive liabilities — and the curve compounds with every session they run.

Token governance is now a CFO-level line item. Salesforce reported 20 trillion tokens consumed in a quarter; teams burn annual AI budgets in months. A product whose cost grows quadratically with engagement has a margin problem hiding in plain sight.

  • One Claude Opus 4 session, 100 turns: $24.15 → $1.24 with Attestor.
  • The fix is architectural, not a model swap — it survives the next model.
  • Both tools are open-source: the benchmark is the diligence, and it's free.

Deterministic recall and a bitemporal audit trail — without the exponential token tax.

Cap context to control cost and you trigger silent memory loss: recall decays 100% → 0% by turn 10, identical across every model. Truncation is mechanical — bigger models don't save you.

Attestor replaces transcript replay with targeted retrieval over a graph + vector store with bitemporal semantics: every fact is timestamped across valid-time and transaction-time, so agent state is immutable, auditable, and reproducible. Per-call context stays flat at ~200 tokens whether the agent is on turn 1 or turn 100.

  • Flat ~200 tok/call → cost fixed per turn, not compounding.
  • Bitemporal log → replayable, auditable agent state.
  • 100% deterministic recall — no black-box forgetting at scale.

Cut token burn 21× with two API calls. No model change, no prompt rewrite.

By turn 100 your agent reads 8,700 input tokens per call just to answer one question. You built that in. Here's the entire fix:
await attestor.add(namespace, content)
facts = await attestor.recall(namespace, query)

That's it. Per-call context drops from a climbing parabola to a flat ~200 tokens, your session-length caps come off, and recall holds at 100% across all 100 turns. Measure first — context-clock runs locally on Ollama, no API keys.

  • Drop-in: no pipeline overhaul, no lock-in (MIT).
  • DeepSeek-v3.2 session cost: $0.16 → $0.009.
  • 99 tests — reproduce every number on this page.
How it works

Stop replaying. Start retrieving.

Two failure modes, one architecture that beats both — without sacrificing what the model remembers.

Full-context replay — the default

  • O(n²) input-token growth
  • Cap the window → silent memory loss (100%→0% by t10)
  • Non-deterministic at scale, no audit trail
  • Cost compounds the more your users engage

Attestor — retrieved memory

  • O(n) linear growth — flat ~200 tok/call, always
  • 100% recall, no truncation, no forgetting
  • Bitemporal audit log — reproducible state
  • Cost fixed per turn · two API calls · MIT
Under the hood · 100% reproducible

The benchmark, in detail.

Every claim here is measured on a deterministic, single-fact workload — verify it yourself with the open-source context-clock harness.

0models benchmarked · open + closed
0turns per session
0tests · every number reproducible
0recall held, all turns

Per-call context — what the model reads each turn

replay · turn 100
8,709
attestor · any turn
~209

Replay climbs every turn toward the window; Attestor stays flat at ~209 tokens from turn 1 to turn 100.

The gap widens with session length

t24
5.6×
t50
11.0×
t75
16.2×
t100
21.5×

Input-token reduction (gpt-5.4). A straight line vs a parabola → the ratio is unbounded.

Method · honest scope
  • One unique code per memo (needle-in-a-haystack), retrieved top-1, no distractors — isolates "does the agent still have the fact?"
  • Native context window — models run at full capacity; capping is shown only to reproduce the cap-and-rot failure mode.
  • 100% recall = the right memo is reliably found and read — not a claim about hard multi-hop or conflicting-fact retrieval.
  • temperature 0, deterministic; real billed cost from OpenRouter usage.cost, never list-price math.
Full methodology + per-turn data →

Tokenmaxing isn't a model swap. It's an architecture.

Measure your agent's token waste with context-clock, then fix it with Attestor — flat context, 21× fewer input tokens, 100% recall. Both free, both open source.