Auditable memory for agent teams. Deterministic. Bitemporal. Self‑hosted, with no LLM in the critical path.
One tier your whole pipeline shares — planner writes, executor reads, reviewer sees both. Namespaces, RBAC, provenance, temporal correctness, ranked retrieval, token budgets — built in, not bolted on. Python library, REST API, or containerized service. No SaaS middleman. No per‑seat fees. No vendor lock‑in. It’s yours.
Eight hundred memories in the store. Six light up. Not the six it searched — the six this moment needs. Your planner’s decision from Monday. The researcher’s finding from Wednesday. The reviewer’s objection from yesterday. Ranked, deduped, fit to budget. Zero LLM calls in the critical path.
Four writes across Mon–Thu land as persisted memories. Friday morning a fresh Portfolio Planner wakes up to a new task, calls mem.load_context(), and Attestor ranks + dedupes + budget-fits all four back into the context window. The agent resumes with full continuity — earnings signal, risk cap, prior stance, compliance precedent.
Single agents rediscover the same facts every run. Multi-agent pipelines are worse — the planner’s decisions never reach the executor, the researcher’s findings never reach the reviewer. So teams do the only thing they can: stuff giant prompts between agents. Burn tokens. Hope nothing important fell off the edge. That’s not an architecture. That’s a workaround.
We had a planner, a coder, a reviewer, a deployer — four agents in a pipeline. None of them knew what the others learned. We were passing giant prompts between them and burning tokens on stale information. Overheard · Engineering Lead, Fortune 100 Bank
More agents, more sessions, more memories — retrieval gets better while context cost stays flat.
Every recall and write is scoped to an AgentContext — identity, role, namespace, parent trail, token budget, write quota, visibility. Contexts are immutable. Spawning a sub-agent returns a new context with inherited provenance. Planner writes. Executor reads. Reviewer sees both. Every call authorised before it touches storage.
Every agent, project, or tenant gets its own namespace. Planner writes, coder reads, reviewer sees both. Isolated by default, shared when you configure it.
Orchestrator, Planner, Executor, Researcher, Reviewer, Monitor. Read-only observers to full admins. Control who can read, write, or supersede in each namespace.
Know which agent wrote which memory, when, and under which parent session. The reviewer can trace a decision back to the planner three sessions ago — not some unknown source.
Agent A learns "user works at Google." Agent B learns "user works at Meta." Attestor auto-supersedes. Full history preserved. Zero inference calls in the critical path.
recall(query, budget=2000) — a lightweight summarizer uses 500 tokens; a deep reasoner uses 5,000. Each agent receives exactly what fits in its context window.
Prevent a runaway agent from flooding the store with noise. Rate-limit writes per namespace, flag writes for human review, add compliance tags for audit.
When an agent calls recall(query, budget), six cooperating steps find, narrow, rank, diversify, decay, and fit the most relevant memories into the requested token ceiling. Ten million memories in the store. Your context window never sees more than the budget. And because there’s no LLM in the path, the same query always returns the same ranking. You can unit‑test it.
Fig. 3 — Live trace. Same pipeline. Ten memories or ten million. Replay it; it ranks the same.
Embed the query (local Ollama bge-m3 by default; cloud providers as fallback) and fetch the 50 nearest memories from pgvector by cosine similarity.
BFS depth ≤ 2 on the entity graph from each candidate's entity to the question entities. Boosts hits by hop distance; penalises unreachable.
Inject typed-edge triples (uses, authored-by, supersedes) as synthetic memories so consumers reason over relations, not just text.
Maximal Marginal Relevance rerank with λ=0.7 trades relevance against diversity, eliminating near-duplicate candidates while preserving topical coverage.
Confidence decay penalises stale, low-confidence rows; temporal boost lifts recent writes. Optional BM25/FTS lane fuses with vector via RRF (k=60).
Greedy monotonic-by-score selection packs the surviving memories into the caller's token budget. The rest never enter context.
Every memory is persisted across three complementary stores. Every supported backend combo is just a different technology choice for one or more of these roles.
Source of truth. Content, tags, entity, category, timestamps, provenance, confidence. Where add() commits and recall() hydrates final text.
Dense embedding per memory, keyed by memory ID. Finds memories by meaning when no tag or word overlaps the query.
Entity nodes + typed edges (uses, authored-by, supersedes). Query “Python” can surface “Django” via the graph.
add()
mem.add(content, tags, entity, ...)
│
┌────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Document │ │ Vector │ │ Graph │
│ store │ │ store │ │ store │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ insert row │ │ embed(text) │ │ extract │
│ content, │ │ → 384-d vec │ │ entities + │
│ tags, meta, │ │ insert keyed │ │ relations │
│ provenance │ │ by memory ID │ │ upsert nodes, │
│ │ │ │ │ add edges │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
▼
Contradiction check per entity
(older conflicting facts → superseded,
kept in timeline)
│
▼
done
The three writes commit as one logical transaction. On SQL backends it’s a real DB transaction; on distributed backends it’s sequenced with best-effort rollback.
recall()
mem.recall(query, budget=2000)
│
▼
┌──────────────────────────────────┐
│ 01 Vector top-K → vec store │
│ pgvector · cosine · k=50 │
└─────────────────────────┬─────────┘
▼
┌──────────────────────────────────┐
│ 02 Graph narrow → graph │
│ Neo4j BFS depth ≤ 2 │
│ hop-affinity boost / penalty │
└─────────────────────────┬─────────┘
▼
┌──────────────────────────────────┐
│ 03 Triples inject │
│ uses · authored-by · supersedes │
│ synthetic-memory facts │
└─────────────────────────┬─────────┘
▼
┌──────────────────────────────────┐
│ 04 MMR diversity λ = 0.7 │
│ 05 Decay + temporal boost │
│ 06 Greedy token pack ≤ budget │
└─────────────────────────┬─────────┘
▼
List[RetrievalResult] ≤ budget
(zero LLM calls)
Optional BM25 / FTS lane → fused with the vector lane via RRF k=60
when the document store has the content_tsv tsvector column populated.
Each step takes the previous step’s candidate set as input. Only memory IDs travel until the final budget-fit step hydrates the keepers from the document store. A store with ten million rows still returns a tight result set inside the caller’s token ceiling.
Isolation. Temporal correctness. Provenance. Not features we bolted on. Primitives we started from.
Every memory lives inside a namespace. Every agent is bound to one of six roles. Every call is authorised before it touches storage. Enforced at the row level, not in application code.
Attestor doesn’t overwrite. It supersedes. Every fact has a validity window. The timeline is replayable to any point in the past.
Ask recall(as_of=…) to replay the past. Nothing is deleted. Auditors can reconstruct what the desk knew and when.
Every sentence the agent writes back is traceable to its source. A cryptographic chain from raw feed ingest to grounded answer.
The citation isn’t a string the LLM chose. It’s the memory_id carried through the pipeline. Click it, land on the raw wire, verify the hash. Grounded, not plausible.
Real agent teams have a planner, an executor, and a reviewer — often on different model families. They write to the same memory store. Without a contract for how those writes interact, you get duplicate facts, lost edits, and silent overwrites. Attestor ships five mechanisms that turn multi-LLM writes into an auditable transaction log.
Deterministic. Zero LLM in the loop. On every add(), attestor checks for active rows with the same (entity, category, namespace) and different content. Each match is auto-marked superseded with valid_until=now and superseded_by=<new_id>.
Old facts are never deleted. recall(as_of=…) still returns them.
For conversation ingest, every newly extracted fact gets one of four decisions from the LLM. Each carries an evidence_episode_id — every supersession is auditable.
ADD no existing match — write fresh
UPDATE refined value, keep id
INVALIDATE old contradicted — mark superseded
NOOP already represented — skip
Failsafe: any LLM/parse failure falls back to ADD-by-default. Better a duplicate-ish row than a silent drop.
Two passes per round, with separate prompts. User-turn extractor only emits facts attributable to the user; assistant-turn extractor only emits facts the assistant introduced. Stops cross-attribution — the “Mem0 +53.6” fix in our LongMemEval scores.
Periodic synthesis across a user’s recent facts. One LLM call distills many episodes into structured outputs:
stable_preferences — patterns appearing in 3+ episodesstable_constraints — rules the user repeatedly invokeschanged_beliefs — preferences shifted (old → new)contradictions_for_review — flagged, not auto-resolvedHard contradictions in regulated chat systems must be a human decision — the prompt is explicit: do NOT auto-resolve.
Every LongMemEval answer is scored by two independent judges from different model families — the second judge anchors out answerer-judge collusion. Same model judging its own answer rewards itself; cross-family judging surfaces real disagreement.
The scripts/lme_smoke_local.py driver runs this exact dual-judge stack against your install. See the Quickstart » Smoke benchmark tab.
Every memory carries agent_id, session_id, and source_episode_id. Every supersede writes the prior id into superseded_by. Cryptographic signing is opt-in (Phase 8.1). The result: any disputed answer can be traced back to which agent wrote which fact, when, and what it superseded.
One Python library. One Starlette ASGI container. Six deployment targets — laptop, dev VM, AWS, Azure, GCP, on‑prem. The three storage roles swap per column. Your agent code never learns which backend it’s talking to. Terraform templates live under agent_memory/infra/. Clone. Set your variables. terraform apply.
$ pip install attestor $ attestor api --host 0.0.0.0 --port 8080
Starlette ASGI on http://localhost:8080. Run attestor setup local to bring up Postgres 16 + Neo4j (with GDS) via the bundled Docker Compose. Point every agent in your stack at the same URL — they share memory instantly. No SaaS. No API keys. Air-gap it behind your firewall and walk away.
App Runner with Starlette ASGI. Auto-scaling, HTTPS, custom domains.
Container Apps with Cosmos DB DiskANN. Scale-to-zero. Same API, same results.
Cloud Run with AlloyDB. Scale-to-zero. Google's managed infrastructure.
pgvector + Apache AGE. Neon serverless or any Postgres 16.
Multi-model: graph + document + vector in one engine. Oasis or self-hosted.
Postgres + pgvector + Neo4j via Docker Compose. Air-gapped deployments. Full data sovereignty.
One Attestor image. Six targets. The three storage roles swap per column. That’s the whole trick.
attestor setup localEvery deployment is the same Python library wrapped in the same Starlette ASGI container. DocumentStore, VectorStore, and GraphStore are three interfaces; each column above is one implementation triple.
Prototype on a laptop. Promote to Docker Compose on a dev VM. Promote to a managed container runtime. Same API throughout. Only the storage URLs change.
Same engine. Same storage. Same retrieval. Different coupling. Pick by latency budget and how many agents share the tier.
AgentMemory(’./store’) in‑process. Sub‑millisecond. Right for a single agent prototyping on a laptop.
attestor api on localhost:8080. Any language. Go or Rust agents call HTTP without Python in the image.
One Attestor service in front of an agent mesh. App Runner · Cloud Run · Container Apps. Managed storage behind. This is the production shape.
Code path is identical across all three. Only configuration changes.
Not a config file. Not an MCP setup guide. Three words in your terminal. Attestor interviews you, installs itself, wires the hooks, and runs a health check before you’ve put the kettle on.
Paste into Claude Code · global or project scope · auto‑merges MCP config
Same call surface whether Attestor runs in‑process on your laptop or behind a Cloud Run service. Pick the interface. Ship.
What we won’t compromise on — no matter how loud the pressure gets.
Your data stays in your infrastructure. Self‑hosted, always.
Retrieval is deterministic. No LLM judges. No hidden inference in the path.
One recall(). Every backend. Swap without rewriting.
Agent teams are first‑class. Not a bolt‑on.
Boring where it counts. Proven, debuggable, no magic.
Attestor's LongMemEval-S verdicts — per question, per judge, per release — are tracked on Braintrust. Every release reruns the four LME-S categories (temporal‑reasoning, multi‑session, knowledge‑update, single‑session-user) against the canonical configured stack. CI fails any PR that regresses past threshold.