Skip to content

Indexing a corpus

  • Corpus = a folder


    A corpus can be a repo, docs tree, mono-repo subtree, or any folder you point at.

  • Persisted in Postgres


    Chunks, embeddings, and sparse search live in PostgreSQL (pgvector + FTS).

  • Optional graph context


    Neo4j can store additional context to improve cross-file retrieval.

Quickstart Searching Indexing pipeline (deep dive)

Use stable corpus ids

Use lowercase slugs like myapp, docs, customer-a. Avoid spaces and special characters.

What indexing does

Indexing turns a folder into a set of retrieval primitives:

  • Chunks (text/code spans) with file paths and line ranges
  • Embeddings (vector search) stored in Postgres/pgvector
  • Sparse index (Postgres FTS/BM25-style scoring)
  • Graph context (optional) stored in Neo4j
flowchart LR
  A["Folder"] --> L["Load"]
  L --> C["Chunk"]
  C --> E["Embed"]
  C --> S["Sparse index"]
  E --> P["Postgres (pgvector)"]
  S --> P
  C --> G["Neo4j (optional)"]

Before you index: estimate size/time (optional)

Use the estimate endpoint to catch “oops, this repo is huge” early:

curl -sS -X POST "http://127.0.0.1:8012/api/index/estimate" \
  -H "Content-Type: application/json" \
  -d '{
    "corpus_id": "demo",
    "repo_path": "/absolute/path/to/your/project",
    "force_reindex": false
  }' | jq .

Estimates are heuristics

The estimate is intentionally rough (machine speed, provider latency, DB IO, and corpus makeup all matter). Use it for sizing, not for SLAs.

Start indexing

curl -sS -X POST "http://127.0.0.1:8012/api/index" \
  -H "Content-Type: application/json" \
  -d '{
    "corpus_id": "demo",
    "repo_path": "/absolute/path/to/your/project",
    "force_reindex": false
  }' | jq .

Monitor progress

curl -sS "http://127.0.0.1:8012/api/index/demo/status" | jq .
curl -sS "http://127.0.0.1:8012/api/index/demo/stats" | jq .

In the UI, this typically maps to RAG → Indexing and Dashboard → Storage.

Reindexing safely

Common reasons to reindex:

  • you changed chunking rules
  • you changed embedding model/dimensions
  • you changed inclusion/exclusion patterns
  • you upgraded graph building logic

Recommended workflow:

  • Confirm the corpus is not currently indexing (/api/index/<corpus>/status)
  • Decide whether you need a full rebuild (force_reindex=true)
  • Start indexing and monitor
  • Validate with a few known-good queries after completion

Embeddings are not always compatible

If you change embedding dimensions or switch providers/models, you usually need a full reindex. Mixing incompatible embeddings can silently degrade retrieval quality.

The knobs that matter (where to tune)

You tune indexing through config (Pydantic-first). For deep reference, see:

Here’s the short list of “most likely to matter” knobs:

Goal Knobs to look at
Better recall chunk size/overlap, candidate top-k, include more file types
Better precision tighter chunking, better reranking, raise confidence gates
Faster indexing larger batches, skip graph build, skip expensive summarization
Lower cost deterministic embeddings, smaller models, disable optional stages

Troubleshooting indexing

Indexing never reaches complete
  • Check /api/ready first (DB connectivity).
  • Look at backend logs (in UI: Infrastructure → Docker or terminal output).
  • If you see repeated failures on one file, temporarily exclude that file type and re-run.
Indexing is slow
  • Large corpora + cloud embeddings will be bound by provider latency.
  • On Apple Silicon, local/MLX paths may be faster for some stages.
  • Disable optional graph stages until you have baseline search working.
I’m missing chunks / the index looks empty
  • Verify the repo_path exists inside the environment that’s indexing (host vs container path mismatch is the classic failure).
  • Confirm you’re querying the correct corpus_id (corpora are isolated).