Benchmarks
We evaluate Noumenon by asking the same 40 questions about 9 repos in 7 languages, across three context conditions: raw source, the full knowledge graph, and TF-IDF retrieval alone. Same model, same prompt template, same judge. The numbers are real, but the design choices behind them favor Noumenon in specific ways. See Known Flaws and Limitations before quoting.
Headline Numbers
More Accurate
Without 47.7% → With 62.0% mean across 9 repos.
Better Per Token
TF-IDF-only context delivers more quality per input token than the full KG.
Seven Languages
Same 40 questions per repo. Reproducible with noum bench.
Run date 2026-04-03, embed-tfidf branch. Speed and per-question token figures from earlier runs are intentionally omitted; numbers below come from a single benchmark run on this date and method.
Methodology
Nine repositories, forty hand-written questions per repo, three context conditions (:raw, :full, :embedded), one answerer model, one judge. The point of holding everything else still is to isolate the effect of the context, not the model or the agent loop.
How Each Question Is Asked
Every condition uses the same answerer model with the same temperature and the same prompt template. The prompt is a single turn:
You are answering a question about a software codebase.
Context (...):
<context block>
Question: <question>
Provide a detailed, accurate answer based on the context provided.The model has no tools and no agent loop. It cannot read files, run shell commands, call queries, follow up with another turn, or browse the codebase. The only thing it sees beyond the question is whatever sits in the context block. That's deliberate: the benchmark measures context quality, not agent behavior.
This is not noum ask. The interactive Ask agent in regular use has TF-IDF seeding, a routing-model hint, and iterative tool use. The benchmark freezes that surface so the comparison is apples-to-apples.
What :raw Means
The context block is the repo's source code. Files are listed via git ls-tree HEAD, read in order, and concatenated wrapped in <file-content> delimiters. Per-file content is capped at 10,000 characters. Total context is capped at the API budget; when the budget runs out, the remaining files are dropped from the tail.
This is more honest than "empty context" or "only the README," but it is not the strongest baseline you could build. A developer using Claude Code or Cursor against the same repo would let the model call read_file and grep on demand instead of pasting everything up front. We have not benchmarked against that baseline. Take :raw as "raw paste," not "best you can do without Noumenon."
What :full Means
Each question carries a :query-name in the question file: the named Datalog query whose result contains the facts the question is about. For the :full run, that query executes against the knowledge graph, and the structured result becomes the context block.
The result is typically far smaller than the raw source dump and contains pre-extracted facts (file paths, complexity ratings, layer assignments, dependency edges, contributor counts, etc.) instead of code. The model still answers from context only, but the context is the answer-shaped data Noumenon's pipeline already pulled out.
What :embedded Means
A third condition added in the TF-IDF work. The context block is the top fifteen results from a TF-IDF cosine-similarity search of the question against per-file and per-component summaries: file paths plus their summaries plus a relevance score. No graph traversal. No Datalog. Just retrieval.
It runs much smaller than :full context-wise (typically thousands rather than tens of thousands of input tokens) and captures roughly three quarters of the :full mean accuracy on its own. By the per-question token measure, :embedded delivers about 3.7× more quality per input token than :full. This isn't a replacement for the graph; it's the cheapest tier for cost-bounded use, and it's what the production Ask agent warms up with before a single Datalog query runs. See Ask for how the two compose at runtime.
Question Set
Forty questions, hand-written, covering three categories. The same set runs against every repo. Concrete examples from the current questions.edn:
- Single-hop (deterministic). Answerable from one query result. "Which source files are rated as 'complex' or 'very-complex'?" — " Who are the top three contributors by commit count?"
- Multi-hop (LLM-judged). Combine multiple facts. "Which files are most frequently changed together? What does this suggest about coupling?" — "Which files are classified as 'trivial' complexity AND in the 'core' architectural layer?"
- Architectural (LLM-judged). Synthesis and reasoning. "Describe the overall architecture in terms of its layers and how they relate." — "Based on the file complexity distribution, where would you focus a code review? Why?"
How Answers Are Scored
Two scoring paths. Single-hop questions score deterministically: the same Datalog query that produced the context (or, in the :raw case, would have) is the ground-truth. The scorer pulls the expected file paths, layer keywords, or contributor names out of that result and checks (with word-boundary regex) whether the answer mentions them. :correct, :partial, or :wrong is purely mechanical.
Multi-hop and architectural questions go through an LLM judge using a fixed rubric template that ships with the benchmark. The judge sees only the question, the per-question rubric, and the answer text. It does not see the source code, the knowledge graph, or which condition produced the answer. The rubric includes calibration examples to anchor the score scale.
Repo and Question Selection
Nine repositories chosen for language coverage, not for outcomes:
ring(Clojure),flask(Python),express(JavaScript),fresh(TypeScript).ripgrep(Rust),fzf(Go),redis(C),guava(Java) from the repo manifest at resources/benchmark/repos.edn.- The
noumenonrepo runs against itself as a sanity check, and is included in the table below.
Question selection followed dev-relevance, not where Noumenon happens to be strong. The set has been kept stable so the numbers are comparable across releases. The redis and express rows are illustrative: large codebases the LLM already knows well from training data, where Noumenon's lift is smaller.
What's Counted
- Accuracy. Mean score across all 40 questions per repo, weighted equally between deterministic and LLM-judged.
- Layers. Three context conditions per question:
:raw,:full,:embedded. Per-layer means feed the headline numbers;:embeddedalso enables the cost-efficiency comparison.
Speed and per-question token figures from earlier reports are not represented here. We dropped them rather than carry forward numbers we cannot vouch for under the current harness.
Known Flaws and Limitations
The point of this section is to be specific, not to wave a hand at uncertainty. Anyone reading the headline numbers should also read this list.
- The
:fullcondition is given a pre-computed answer. Every question carries a:query-namewhose result is, by design, close to the answer. The:fullmodel is mostly summarizing a structured result. The:rawmodel is asked to find the same fact inside tens of thousands of characters of source. That is a real design choice in Noumenon's favor, not just a context-size difference. - The question set was written to match existing queries. We have not benchmarked anything Noumenon doesn't already have a named query for. Coverage is biased toward questions the pipeline produces clean context for. Harder or less-mappable questions are absent because they didn't get written, not because the pipeline answers them well.
- Single run per repo, no variance bars. Run-to-run variance from API sampling, judge nondeterminism, and concurrent extraction order is not quantified. A repeat of any row would land at a slightly different number.
:fullis reliably ahead of:raw, but the per-repo gap should not be read past one significant figure. - The judge is an LLM. LLM-as-judge has well-documented biases: agreement with itself, preference for fluent or verbose answers, drift across runs. The rubric ships with calibration examples but does not eliminate these. We do not currently have a human-graded sample to anchor against.
- The
:rawtruncation can drop the relevant file. When the raw source exceeds the API budget, the tail ofgit ls-treeis dropped. If the file containing the answer is late in that order, the:rawrun is being scored on a context that genuinely cannot answer the question. We do not currently report what fraction of questions are affected per repo. - The
noumenonrepo is in the test set. We benchmark our own codebase as one of the nine repos. The pipeline has been iterated on with that data in the loop, so that row is double-dipping. We keep it because the trend across the other eight repos is what matters, but thenoumenonrow is not a blind result. - Cost numbers exclude the upfront analyze pass. When we report a per-question token number for the
:fullor:embeddedcondition, that's only the answerer's input. It does not include the LLM tokens spent during analyze and synthesize to build the graph. For a one-off question on a fresh repo,:rawis cheaper. The break-even depends on how many questions you ask per repo before the upfront cost amortizes. - The
:embeddedadvantage is partly trivial. TF-IDF context is much smaller than full-graph context, so per-token efficiency is partly a consequence of context size, not Noumenon being clever. The interesting result is that small context still captures roughly three quarters of the:fullmean. - Deterministic scoring is regex-based. Single-hop scoring checks for file paths or layer keywords appearing in the answer text. A correct answer that paraphrases or uses a synonym scores wrong. This biases against the
:rawcondition less than the:fullcondition (raw-source answers tend to quote filenames verbatim) but the noise is real either way. - Garden-path risk in authorship. Both the question set and the named queries were written by the same person who wrote Noumenon. Questions probing aspects of code understanding the pipeline doesn't do well are likely under-represented just because they didn't get written. An external question set would be more telling and we don't have one.
- One model, one temperature. Numbers are from a single answerer model. The lift may shrink with a larger frontier model that needs less context-shaping help, or grow with a smaller one. We have not swept the model axis.
- Out of scope entirely: real agentic Ask. The interactive Ask agent in production seeds with TF-IDF, calls tools, and refines across turns. The benchmark deliberately holds that loop still so the comparison is apples-to-apples on context. It is silent on whether agentic Ask is better or worse than single-turn KG context.
We list these because the alternative is selling the result. The knowledge graph has real benefits that survive the caveats: structured facts compose, queries are reproducible, and per-question cost is predictable once the graph exists. Treat the headline numbers as directional and reproduce locally on your own repos before quoting them.
Per-Repository Results
| Repository | Language | :raw | :full | :embedded | Δ vs raw |
|---|---|---|---|---|---|
ripgrep | Rust | 41.9% | 75.0% | 47.2% | +33.1pp |
ring | Clojure | 52.6% | 75.7% | 45.9% | +23.1pp |
flask | Python | 44.9% | 65.4% | 46.1% | +20.5pp |
fresh | TypeScript | 48.7% | 67.1% | 47.4% | +18.4pp |
noumenon | Clojure | 44.7% | 56.8% | 35.1% | +12.1pp |
guava | Java | 43.1% | 51.5% | 22.2% | +8.4pp |
redis | C | 38.2% | 46.2% | 39.7% | +8.0pp |
fzf | Go | 52.7% | 56.9% | 48.6% | +4.2pp |
express | JavaScript | 62.2% | 63.5% | 50.0% | +1.3pp |
| Average | 47.7% | 62.0% | 42.5% | +14.3pp |
Run date 2026-04-03, embed-tfidf branch. Reproduce with noum bench <repo> on a digested database.
Run the benchmark yourself with noum bench <repo>, retrieve a past run via noum results <run-id>, or compare two runs with noum compare <a> <b>. The MCP equivalents are noumenon_benchmark_run, noumenon_benchmark_results, and noumenon_benchmark_compare. Results land in the same Datomic graph as everything else.