Indexing Pipeline
-
Loader
Git-aware discovery honoring
.gitignorewith root-relative patterns. -
Chunker
Fixed, AST-aware, or hybrid chunk strategies with line attribution.
-
Embedder
Deterministic local or provider-backed embeddings configured in Pydantic.
-
Chunk Summaries
Optional LLM-generated
chunk_summariesto improve sparse search. -
Graph Builder
Entity/relationship extraction and Neo4j persistence.
Idempotent Indexing
Use force_reindex=false for incremental updates. The indexer skips unchanged files using mtime/hash checks where available.
Storage Layout
Chunks, embeddings, and FTS are in PostgreSQL. Graph artifacts are in Neo4j. Sizes are summarized via dashboard endpoints.
Large Corpora
Configure Neo4j heap and page cache via environment for multi-million edge graphs. Monitor Postgres disk growth for pgvector indexes.
Pipeline Flow
flowchart LR
L["FileLoader"] --> C["Chunker"]
C --> E["Embedder"]
E --> P["PostgreSQL"]
C --> S["ChunkSummarizer"]
S --> P
C --> GB["GraphBuilder"]
GB --> N["Neo4j"] Chunking & Embedding Controls (Selected)
| Section | Field | Default | Notes |
|---|---|---|---|
| chunking | chunk_size | 1000 | Target chars per chunk |
| chunking | chunk_overlap | 200 | Overlap for continuity |
| chunking | chunking_strategy | ast | ast \| greedy \| hybrid |
| chunking | max_chunk_tokens | 8000 | Split recursively if larger |
| embedding | embedding_type | openai | Provider selector |
| embedding | embedding_model | text-embedding-3-large | Model id |
| embedding | embedding_dim | 3072 | Must match model outputs |
| indexing | bm25_tokenizer | stemmer | Tokenizer for FTS |
Start Indexing via API (Annotated)
import httpx
base = "http://127.0.0.1:8012/api"
req = {
"corpus_id": "tribrid", # (1)!
"repo_path": "/work/src/tribrid",
"force_reindex": False
}
httpx.post(f"{base}/index", json=req).raise_for_status() # (2)!
status = httpx.get(f"{base}/index/tribrid/status").json()
print(status["status"], status.get("progress")) # (3)!
- Create/refresh a specific corpus
- Start indexing
- Poll progress
BASE=http://127.0.0.1:8012/api
curl -sS -X POST "$BASE/index" -H 'Content-Type: application/json' -d '{
"corpus_id":"tribrid","repo_path":"/work/src/tribrid","force_reindex":false
}'
curl -sS "$BASE/index/tribrid/status" | jq .
import type { IndexRequest, IndexStatus } from "./web/src/types/generated";
async function reindex(path: string) {
const req: IndexRequest = { corpus_id: "tribrid", repo_path: path, force_reindex: false } as any;
await fetch("/api/index", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(req) }); // (2)!
const status: IndexStatus = await (await fetch("/api/index/tribrid/status")).json(); // (3)!
console.log(status.status, status.progress);
}
Graph Indexing (Neo4j)
| Field | Default | Meaning |
|---|---|---|
graph_indexing.enabled | true | Enable graph building during indexing |
graph_indexing.build_lexical_graph | true | Add Chunk/NEXT_CHUNK structure |
graph_indexing.store_chunk_embeddings | true | Store chunk vectors for Neo4j vector search |
graph_indexing.semantic_kg_enabled | false | Extract concept relations (heuristic or LLM) |
Failure Modes
- File decoding errors: logged and skipped.
- Embedding timeouts: retried with backoff; chunk remains un-embedded if persistent.
- Graph build failures: retrieval continues with vector/sparse; flagged in logs.