Indexing a corpus

Corpus = a folder

A corpus can be a repo, docs tree, mono-repo subtree, or any folder you point at.
Persisted in Postgres

Chunks, embeddings, and sparse search live in PostgreSQL (pgvector + FTS).
Optional graph context

Neo4j can store additional context to improve cross-file retrieval.

Quickstart Searching Indexing pipeline (deep dive)

Use stable corpus ids

Use lowercase slugs like myapp, docs, customer-a. Avoid spaces and special characters.

What indexing does

Indexing turns a folder into a set of retrieval primitives:

Chunks (text/code spans) with file paths and line ranges
Embeddings (vector search) stored in Postgres/pgvector
Sparse index (Postgres FTS/BM25-style scoring)
Graph context (optional) stored in Neo4j

flowchart LR
  A["Folder"] --> L["Load"]
  L --> C["Chunk"]
  C --> E["Embed"]
  C --> S["Sparse index"]
  E --> P["Postgres (pgvector)"]
  S --> P
  C --> G["Neo4j (optional)"]

Before you index: estimate size/time (optional)

Use the estimate endpoint to catch “oops, this repo is huge” early:

curl -sS -X POST "http://127.0.0.1:8012/api/index/estimate" \
  -H "Content-Type: application/json" \
  -d '{
    "corpus_id": "demo",
    "repo_path": "/absolute/path/to/your/project",
    "force_reindex": false
  }' | jq .

Estimates are heuristics

The estimate is intentionally rough (machine speed, provider latency, DB IO, and corpus makeup all matter). Use it for sizing, not for SLAs.

Start indexing

curl -sS -X POST "http://127.0.0.1:8012/api/index" \
  -H "Content-Type: application/json" \
  -d '{
    "corpus_id": "demo",
    "repo_path": "/absolute/path/to/your/project",
    "force_reindex": false
  }' | jq .

Monitor progress

curl -sS "http://127.0.0.1:8012/api/index/demo/status" | jq .
curl -sS "http://127.0.0.1:8012/api/index/demo/stats" | jq .

In the UI, this typically maps to RAG → Indexing and Dashboard → Storage.

Reindexing safely

Common reasons to reindex:

you changed chunking rules
you changed embedding model/dimensions
you changed inclusion/exclusion patterns
you upgraded graph building logic

Recommended workflow:

Confirm the corpus is not currently indexing (/api/index/<corpus>/status)
Decide whether you need a full rebuild (force_reindex=true)
Start indexing and monitor
Validate with a few known-good queries after completion

Embeddings are not always compatible

If you change embedding dimensions or switch providers/models, you usually need a full reindex. Mixing incompatible embeddings can silently degrade retrieval quality.

The knobs that matter (where to tune)

You tune indexing through config (Pydantic-first). For deep reference, see:

Here’s the short list of “most likely to matter” knobs:

Goal	Knobs to look at
Better recall	chunk size/overlap, candidate top-k, include more file types
Better precision	tighter chunking, better reranking, raise confidence gates
Faster indexing	larger batches, skip graph build, skip expensive summarization
Lower cost	deterministic embeddings, smaller models, disable optional stages

Troubleshooting indexing

Indexing never reaches complete

Check /api/ready first (DB connectivity).
Look at backend logs (in UI: Infrastructure → Docker or terminal output).
If you see repeated failures on one file, temporarily exclude that file type and re-run.

Indexing is slow

Large corpora + cloud embeddings will be bound by provider latency.
On Apple Silicon, local/MLX paths may be faster for some stages.
Disable optional graph stages until you have baseline search working.

I’m missing chunks / the index looks empty

Verify the repo_path exists inside the environment that’s indexing (host vs container path mismatch is the classic failure).
Confirm you’re querying the correct corpus_id (corpora are isolated).