Indexing a corpus
-
Corpus = a folder
A corpus can be a repo, docs tree, mono-repo subtree, or any folder you point at.
-
Persisted in Postgres
Chunks, embeddings, and sparse search live in PostgreSQL (pgvector + FTS).
-
Optional graph context
Neo4j can store additional context to improve cross-file retrieval.
Quickstart Searching Indexing pipeline (deep dive)
Use stable corpus ids
Use lowercase slugs like myapp, docs, customer-a. Avoid spaces and special characters.
What indexing does
Indexing turns a folder into a set of retrieval primitives:
- Chunks (text/code spans) with file paths and line ranges
- Embeddings (vector search) stored in Postgres/pgvector
- Sparse index (Postgres FTS/BM25-style scoring)
- Graph context (optional) stored in Neo4j
flowchart LR
A["Folder"] --> L["Load"]
L --> C["Chunk"]
C --> E["Embed"]
C --> S["Sparse index"]
E --> P["Postgres (pgvector)"]
S --> P
C --> G["Neo4j (optional)"] Before you index: estimate size/time (optional)
Use the estimate endpoint to catch “oops, this repo is huge” early:
curl -sS -X POST "http://127.0.0.1:8012/api/index/estimate" \
-H "Content-Type: application/json" \
-d '{
"corpus_id": "demo",
"repo_path": "/absolute/path/to/your/project",
"force_reindex": false
}' | jq .
Estimates are heuristics
The estimate is intentionally rough (machine speed, provider latency, DB IO, and corpus makeup all matter). Use it for sizing, not for SLAs.
Start indexing
curl -sS -X POST "http://127.0.0.1:8012/api/index" \
-H "Content-Type: application/json" \
-d '{
"corpus_id": "demo",
"repo_path": "/absolute/path/to/your/project",
"force_reindex": false
}' | jq .
Monitor progress
curl -sS "http://127.0.0.1:8012/api/index/demo/status" | jq .
curl -sS "http://127.0.0.1:8012/api/index/demo/stats" | jq .
In the UI, this typically maps to RAG → Indexing and Dashboard → Storage.
Reindexing safely
Common reasons to reindex:
- you changed chunking rules
- you changed embedding model/dimensions
- you changed inclusion/exclusion patterns
- you upgraded graph building logic
Recommended workflow:
- Confirm the corpus is not currently indexing (
/api/index/<corpus>/status) - Decide whether you need a full rebuild (
force_reindex=true) - Start indexing and monitor
- Validate with a few known-good queries after completion
Embeddings are not always compatible
If you change embedding dimensions or switch providers/models, you usually need a full reindex. Mixing incompatible embeddings can silently degrade retrieval quality.
The knobs that matter (where to tune)
You tune indexing through config (Pydantic-first). For deep reference, see:
Here’s the short list of “most likely to matter” knobs:
| Goal | Knobs to look at |
|---|---|
| Better recall | chunk size/overlap, candidate top-k, include more file types |
| Better precision | tighter chunking, better reranking, raise confidence gates |
| Faster indexing | larger batches, skip graph build, skip expensive summarization |
| Lower cost | deterministic embeddings, smaller models, disable optional stages |
Troubleshooting indexing
Indexing never reaches complete
- Check
/api/readyfirst (DB connectivity). - Look at backend logs (in UI: Infrastructure → Docker or terminal output).
- If you see repeated failures on one file, temporarily exclude that file type and re-run.
Indexing is slow
- Large corpora + cloud embeddings will be bound by provider latency.
- On Apple Silicon, local/MLX paths may be faster for some stages.
- Disable optional graph stages until you have baseline search working.
I’m missing chunks / the index looks empty
- Verify the
repo_pathexists inside the environment that’s indexing (host vs container path mismatch is the classic failure). - Confirm you’re querying the correct
corpus_id(corpora are isolated).