Indexing Pipeline

Loader

Git-aware discovery honoring .gitignore with root-relative patterns.
Chunker

Fixed, AST-aware, or hybrid chunk strategies with line attribution.
Embedder

Deterministic local or provider-backed embeddings configured in Pydantic.
Chunk Summaries

Optional LLM-generated chunk_summaries to improve sparse search.
Graph Builder

Entity/relationship extraction and Neo4j persistence.

Idempotent Indexing

Use force_reindex=false for incremental updates. The indexer skips unchanged files using mtime/hash checks where available.

Storage Layout

Chunks, embeddings, and FTS are in PostgreSQL. Graph artifacts are in Neo4j. Sizes are summarized via dashboard endpoints.

Large Corpora

Configure Neo4j heap and page cache via environment for multi-million edge graphs. Monitor Postgres disk growth for pgvector indexes.

Pipeline Flow

flowchart LR
    L["FileLoader"] --> C["Chunker"]
    C --> E["Embedder"]
    E --> P["PostgreSQL"]
    C --> S["ChunkSummarizer"]
    S --> P
    C --> GB["GraphBuilder"]
    GB --> N["Neo4j"]

Chunking & Embedding Controls (Selected)

Section	Field	Default	Notes
chunking	`chunk_size`	1000	Target chars per chunk
chunking	`chunk_overlap`	200	Overlap for continuity
chunking	`chunking_strategy`	ast	`ast \\| greedy \\| hybrid`
chunking	`max_chunk_tokens`	8000	Split recursively if larger
embedding	`embedding_type`	openai	Provider selector
embedding	`embedding_model`	text-embedding-3-large	Model id
embedding	`embedding_dim`	3072	Must match model outputs
indexing	`bm25_tokenizer`	stemmer	Tokenizer for FTS

Start Indexing via API (Annotated)

Python

import httpx
base = "http://127.0.0.1:8012/api"

req = {
    "corpus_id": "tribrid",   # (1)!
    "repo_path": "/work/src/tribrid",
    "force_reindex": False
}
httpx.post(f"{base}/index", json=req).raise_for_status()  # (2)!

status = httpx.get(f"{base}/index/tribrid/status").json()
print(status["status"], status.get("progress"))          # (3)!

Create/refresh a specific corpus
Start indexing
Poll progress

curl

BASE=http://127.0.0.1:8012/api
curl -sS -X POST "$BASE/index" -H 'Content-Type: application/json' -d '{
  "corpus_id":"tribrid","repo_path":"/work/src/tribrid","force_reindex":false
}'
curl -sS "$BASE/index/tribrid/status" | jq .

TypeScript

import type { IndexRequest, IndexStatus } from "./web/src/types/generated";

async function reindex(path: string) {
  const req: IndexRequest = { corpus_id: "tribrid", repo_path: path, force_reindex: false } as any;
  await fetch("/api/index", { method: "POST", headers: {"Content-Type":"application/json"}, body: JSON.stringify(req) }); // (2)!
  const status: IndexStatus = await (await fetch("/api/index/tribrid/status")).json(); // (3)!
  console.log(status.status, status.progress);
}

Graph Indexing (Neo4j)

Field	Default	Meaning
`graph_indexing.enabled`	true	Enable graph building during indexing
`graph_indexing.build_lexical_graph`	true	Add Chunk/NEXT_CHUNK structure
`graph_indexing.store_chunk_embeddings`	true	Store chunk vectors for Neo4j vector search
`graph_indexing.semantic_kg_enabled`	false	Extract concept relations (heuristic or LLM)

Failure Modes

File decoding errors: logged and skipped.
Embedding timeouts: retried with backoff; chunk remains un-embedded if persistent.
Graph build failures: retrieval continues with vector/sparse; flagged in logs.