LongMemEval notes: Q retrieval is now above the production-fit line

Our latest memory-eval run reached 90% post-curation LLM-judged correctness while staying under the retrieval latency threshold.

Q is built around a simple promise: when an AI-native project team asks for context, the system should return the decisions, constraints, and history that actually matter. That sounds obvious until you try to do it across weeks of Slack messages, voice notes, commits, docs, and client requests.

We use LongMemEval-style tests as one of the gates for that promise. The latest production-fit run cleared the bar we set for initial release: post-curation answer correctness reached 90%, with retrieval latency still under the target threshold.

90% post-curation LLM-judged correctness

990.1ms treatment retrieval p95

0 API 4xx/5xx during the run

What we measured

The run compares retrieval quality before and after curation on a 20-entry LongMemEval production-fit set. We judge correctness with an LLM-as-judge because exact string matching undercounts correct answers when the system gives the right evidence in different wording.

The architecture is organized around a few stable responsibilities:

project memory is stored canonically in Postgres with vector search
graph expansion is a derived layer for richer context, not the source of truth
retrieved evidence is reranked before it is given to the user or agent
curation improves the memory layer by turning raw artifacts into structured, provenance-aware knowledge

This separation lets us evaluate retrieval quality, curation quality, and response latency independently.

Latest run

Metric	Baseline	Treatment	Target	Verdict
Retrieval p95	1431.4ms	990.1ms	under 1500ms	Pass
Post-curation correctness	90.0%	90.0%	at least 70%	Pass
Extraction p95 / max	517s / 536s	n/a	under 600s	Pass
Worst curation stage	76s	n/a	under 180s	Pass
API 4xx/5xx	0	0	0	Pass

Correctness target: 70%

70%

What changed our confidence

String matching reported a lower score than the judge. That is expected in memory retrieval: if the system answers "the OAuth work was descoped because of infrastructure constraints" while the fixture expected a slightly different phrase, a strict string metric can mark a useful answer as wrong.

For Q, the more important question is whether the returned evidence is sufficient for an owner, developer, or coding agent to make the next decision. That is why we treat LLM-judged correctness as the primary public quality signal, with latency and failure rates as guardrails.

Current limits

The latest run is encouraging, but it does not remove the need for continued measurement. The main areas we are watching are:

tail latency needs continued watching
lineage density around superseded decisions is still lower than we want
the benchmark is a gate, not a substitute for pilot usage on real agency data

The practical conclusion is that Q's retrieval layer is now above our initial production-fit line. The next work is to keep validating it in pilot workflows while improving digest quality and the owner-facing surfaces around the memory.