Pipeline | Noumenon

Five stages turn a git repository into a queryable knowledge graph. Each stage is idempotent: re-running it costs nothing if nothing changed.

Import

Commits, files, authors, diffs, and directory structure parsed from git history into Datomic. No LLM calls. Fully reproducible: re-running on the same git state produces the same graph.

Enrich

Cross-file import and dependency edges resolved by parsing source code. Supports Clojure, Python, JS/TS, Rust, Java, C#, C/C++, Go, Elixir, and Erlang. No LLM calls. This is structural extraction, not interpretation.

Analyze

An LLM reads each file and extracts code segments (functions, classes, types) with complexity ratings, code smells, safety concerns, purity analysis, and architectural hints. Parallelized with configurable concurrency. The most expensive stage.

Synthesize

Queries the graph to identify logical components, classify files into architectural layers, and map component dependencies. Uses hierarchical map-reduce so it scales to repos with thousands of files.

Embed

Builds a TF-IDF vector index from file and component summaries. Powers semantic search via noumenon_search, and seeds the Ask agent with relevant files before any query runs. No LLM calls.

Scoping the Work

Pipeline commands accept selectors so you can run a subset of the repo without re-doing everything. Useful for big monorepos and tight loops while you're tuning prompts.

--path src/foo limits to a directory.
--include "src/**/*.clj" is a glob whitelist.
--exclude "**/*_test.clj" is a glob blacklist.
--lang clojure restricts to one language.

Selectors apply to analyze, enrich, update, and digest.

Promotion (Content-Addressed Cache)

Experimental — interfaces may change between releases. Before analyze calls the LLM on a file, it checks the current database for a previously-analyzed file whose :file/blob-sha matched the same content under the same :prov/prompt-hash and :prov/model-version. On a hit, the donor's analysis attrs are copied onto the recipient with :prov/promoted-from lineage and no LLM call is made. The result map reports files-analyzed alongside files-promoted so the cache hit rate is visible.

Pass --no-promote to bypass the cache and always invoke the LLM. Cross-DB promotion (a delta promoting from trunk) records the donor's db-name in :prov/promoted-from-db-name — the foreign tx-id is meaningless in the recipient DB, so the ref attr is omitted and the db-name acts as the breadcrumb.

Prompt and Model Drift

When you change a prompt template or switch LLM models, prior analysis results are still valid until you decide otherwise. Drift is advisory by default. Noumenon logs which files were analyzed with a different prompt or model; pass --reanalyze prompt-changed, --reanalyze model-changed, --reanalyze stale, or --reanalyze all to refresh.

After embed, the graph is ready to query. The iterative commands noum ask, noum query, and the MCP server all read from the same Datomic database. noum introspect uses the graph to improve itself; see Introspect.