Q is built around a simple promise: when an AI-native project team asks for context, the system should return the decisions, constraints, and history that actually matter. That sounds obvious until you try to do it across weeks of Slack messages, voice notes, commits, docs, and client requests.
We use LongMemEval-style tests as one of the gates for that promise. The latest production-fit run cleared the bar we set for initial release: post-curation answer correctness reached 90%, with retrieval latency still under the target threshold.
What we measured
The run compares retrieval quality before and after curation on a 20-entry LongMemEval production-fit set. We judge correctness with an LLM-as-judge because exact string matching undercounts correct answers when the system gives the right evidence in different wording.
The architecture is organized around a few stable responsibilities:
- project memory is stored canonically in Postgres with vector search
- graph expansion is a derived layer for richer context, not the source of truth
- retrieved evidence is reranked before it is given to the user or agent
- curation improves the memory layer by turning raw artifacts into structured, provenance-aware knowledge
This separation lets us evaluate retrieval quality, curation quality, and response latency independently.
Latest run
| Metric | Baseline | Treatment | Target | Verdict |
|---|---|---|---|---|
| Retrieval p95 | 1431.4ms | 990.1ms | under 1500ms | Pass |
| Post-curation correctness | 90.0% | 90.0% | at least 70% | Pass |
| Extraction p95 / max | 517s / 536s | n/a | under 600s | Pass |
| Worst curation stage | 76s | n/a | under 180s | Pass |
| API 4xx/5xx | 0 | 0 | 0 | Pass |
The useful part is not just that the number is above target. It is that the system cleared the quality bar while staying inside a latency budget that can work in an interactive product.
What changed our confidence
String matching reported a lower score than the judge. That is expected in memory retrieval: if the system answers "the OAuth work was descoped because of infrastructure constraints" while the fixture expected a slightly different phrase, a strict string metric can mark a useful answer as wrong.
For Q, the more important question is whether the returned evidence is sufficient for an owner, developer, or coding agent to make the next decision. That is why we treat LLM-judged correctness as the primary public quality signal, with latency and failure rates as guardrails.
Current limits
The latest run is encouraging, but it does not remove the need for continued measurement. The main areas we are watching are:
- tail latency needs continued watching
- lineage density around superseded decisions is still lower than we want
- the benchmark is a gate, not a substitute for pilot usage on real agency data
The practical conclusion is that Q's retrieval layer is now above our initial production-fit line. The next work is to keep validating it in pilot workflows while improving digest quality and the owner-facing surfaces around the memory.