Config reference: indexing
-
Enterprise tuning surface
Defaults + constraints are rendered directly from Pydantic.
-
Env keys when available
Many fields have an env-style alias (from
TriBridConfig.to_flat_dict()). -
Tooltip-level guidance
If a matching glossary entry exists, you’ll see deeper tuning notes.
Config reference Config API & workflow Glossary
Total parameters: 16
Group index
(root)
(root)
| JSON key | Env key(s) | Type | Default | Constraints | Summary |
|---|---|---|---|---|---|
indexing.bm25_stemmer_lang | BM25_STEMMER_LANG | str | "english" | — | Stemmer language |
indexing.bm25_tokenizer | BM25_TOKENIZER | str | "stemmer" | pattern=^(stemmer|lowercase|whitespace)$ | BM25 tokenizer type |
indexing.estimated_tokens_per_second_local | ESTIMATED_TOKENS_PER_SECOND_LOCAL | int \| None | null | ≥ 100, ≤ 500000 | Optional local embedding throughput override for index-time estimates (tokens/sec). |
indexing.index_excluded_exts | INDEX_EXCLUDED_EXTS | str | ".png,.jpg,.gif,.ico,.svg,.woff,.ttf" | — | Excluded file extensions (comma-separated) |
indexing.index_max_file_size_mb | INDEX_MAX_FILE_SIZE_MB | int | 250 | ≥ 1, ≤ 1024 | Max file size to index (MB) |
indexing.indexing_batch_size | INDEXING_BATCH_SIZE | int | 100 | ≥ 10, ≤ 1000 | Batch size for indexing |
indexing.indexing_workers | INDEXING_WORKERS | int | 4 | ≥ 1, ≤ 16 | Parallel workers for indexing |
indexing.large_file_mode | — | Literal["read_all", "stream"] | "stream" | allowed="read_all", "stream" | How to ingest very large text files. 'stream' avoids loading entire files into memory. |
indexing.large_file_stream_chunk_chars | — | int | 2000000 | ≥ 100000, ≤ 50000000 | When large_file_mode='stream', read text files in bounded char blocks (best-effort). |
indexing.parquet_extract_include_column_names | PARQUET_EXTRACT_INCLUDE_COLUMN_NAMES | int | 1 | ≥ 0, ≤ 1 | Include column headers when extracting Parquet text |
indexing.parquet_extract_max_cell_chars | PARQUET_EXTRACT_MAX_CELL_CHARS | int | 20000 | ≥ 100, ≤ 200000 | Max characters per extracted Parquet cell (best-effort) |
indexing.parquet_extract_max_chars | PARQUET_EXTRACT_MAX_CHARS | int | 2000000 | ≥ 10000, ≤ 50000000 | Max characters to extract from a single Parquet file during indexing (best-effort) |
indexing.parquet_extract_max_rows | PARQUET_EXTRACT_MAX_ROWS | int | 5000 | ≥ 1, ≤ 200000 | Max rows to extract from a single Parquet file during indexing (best-effort) |
indexing.parquet_extract_text_columns_only | PARQUET_EXTRACT_TEXT_COLUMNS_ONLY | int | 1 | ≥ 0, ≤ 1 | Extract only text/string-like columns from Parquet files when possible |
indexing.postgres_url | POSTGRES_URL | str | "postgresql://postgres:postgres@localhost:5432/tribrid_rag" | — | PostgreSQL connection string (DSN) for pgvector + FTS storage |
indexing.skip_dense | SKIP_DENSE | int | 0 | ≥ 0, ≤ 1 | Skip dense vector indexing |
Details (glossary)
indexing.bm25_stemmer_lang (BM25_STEMMER_LANG) — BM25 Stemmer Language
Category: retrieval
BM25_STEMMER_LANG chooses the stemming or morphological normalization profile applied before sparse indexing. Correct language normalization improves recall by unifying inflected word forms, while incorrect stemming can collapse distinct technical terms and reduce precision. Multilingual corpora often need language-aware analyzers by field rather than one global stemmer, especially when prose and code identifiers coexist. Any change here requires reindexing and targeted multilingual relevance checks because token statistics and BM25 behavior shift across the entire corpus.
Badges: - Linguistics
Links: - Milco: Multilingual Sparse Retrieval via Connector (arXiv) - Elasticsearch Language Analyzers - Snowball Stemming Algorithms - Lucene Analysis Common Module
indexing.bm25_tokenizer (BM25_TOKENIZER) — BM25 Tokenizer
Category: retrieval
BM25_TOKENIZER determines how text is split into sparse terms, and this often has a larger impact than small parameter tweaks. Conservative tokenization preserves exact symbols and identifier fragments useful for code retrieval, while aggressive normalization helps natural-language matching. The right choice depends on corpus composition: APIs and filenames benefit from symbol-aware token boundaries, whereas narrative documents benefit from linguistic normalization. Because tokenizer behavior changes term frequencies and document lengths, retune BM25 parameters after tokenizer changes instead of carrying old values forward.
Badges: - Tokenization
Links: - Multilingual Generative Retrieval via Semantic Compression (arXiv) - Elasticsearch Tokenizers - Hugging Face Tokenizers - Lucene WhitespaceTokenizer
indexing.index_excluded_exts (INDEX_EXCLUDED_EXTS) — Excluded Extensions
Category: infrastructure
Defines a denylist of file extensions that should be skipped before ingestion so the index is not polluted by binaries, build artifacts, media blobs, and other low-signal assets. In code and docs RAG, good exclusion rules improve both precision and indexing cost by avoiding irrelevant tokens and expensive parsing failures. Keep this list aligned with your repository layout and parser capabilities, because extension-only filtering can miss mislabeled files unless combined with MIME or content checks. Review exclusions after major stack changes, especially when adding documentation generators or notebook-heavy workflows. Overly broad exclusions can silently remove valuable domain knowledge from retrieval.
Badges: - Corpus hygiene
Links: - Vision-Guided Chunking Improves RAG in Multimodal Long Context Scenarios - gitignore Pattern Format - Unstructured Open Source Overview - Azure AI Search: Chunk Large Documents
indexing.index_max_file_size_mb (INDEX_MAX_FILE_SIZE_MB) — Index max file size (MB)
Category: chunking
Sets a hard upper bound on file size for indexing to prevent memory spikes and long-tail ingestion delays caused by extremely large documents. In RAG pipelines this value protects indexing stability, but if set too low it can remove high-value sources such as architecture guides, policy manuals, or API bundles. Use corpus stats to choose a threshold, typically around the P95 or P99 file size, then special-case known large files with streaming or sectioned ingestion. This setting interacts with chunking strategy, parser behavior, and total token budget, so tune it alongside chunk size and overlap rather than in isolation. Periodic audits of skipped-file lists help avoid accidental knowledge gaps.
Badges: - Stability guardrail
Links: - HiFi-RAG: Enhancing Retrieval-Augmented Generation through High-Fidelity Contextual Chunking and Reasoning - Azure AI Search: Chunk Large Documents - Unstructured Open Source Overview - Weaviate Data Import
indexing.indexing_batch_size (INDEXING_BATCH_SIZE) — Indexing Batch Size
Category: embedding
INDEXING_BATCH_SIZE sets how many chunks or records are processed together per indexing step, affecting throughput, memory pressure, and failure blast radius. Larger batches generally improve GPU and network utilization for embeddings and vector upserts, but they also increase peak memory and make retries more expensive. Smaller batches are slower but more resilient when providers rate-limit, vector stores throttle writes, or occasional malformed records appear. The best value depends on embedding latency, vector DB ingest speed, and available RAM, so it should be tuned with real pipeline telemetry. Start conservatively, then increase until throughput gains flatten or error rates begin rising.
Badges: - Throughput
Links: - Qdrant Bulk Upload Tutorial - pgvector Repository - PostgreSQL COPY Command - LightRetriever (2025): Faster Query Inference
indexing.indexing_workers (INDEXING_WORKERS) — Indexing Workers
Category: infrastructure
Controls how many parallel workers execute indexing stages such as parsing, chunking, sparse indexing, and embedding preparation. In RAG systems this is a throughput lever, but only up to the point where CPU cores, memory bandwidth, disk I/O, or embedding-provider rate limits become the bottleneck. A practical baseline is physical cores minus one or two so interactive tasks and background services still have headroom. If this value is set too high, context switching, queue contention, and retry pressure can increase total wall-clock time rather than reduce it. Tune with real run metrics, especially files-per-second, average chunk latency, and failed-task retries.
Badges: - Throughput tuning
Links: - GraphAnchor: Graph-Enhanced and Attention-Driven Retrieval for RAG - Python concurrent.futures - Docker CPU Resource Constraints - FAISS Documentation
indexing.parquet_extract_include_column_names (PARQUET_EXTRACT_INCLUDE_COLUMN_NAMES) — Parquet Include Column Names
Category: indexing
When enabled, column headers are injected into extracted Parquet text so retrieval can align values with field semantics (for example, distinguishing price from discount_price). This generally improves schema-aware search and downstream answer grounding, especially for wide analytical tables. The downside is extra tokens and potentially noisier chunks if column names are verbose or system-generated. Keep this on by default for mixed tabular + natural-language corpora, then validate index size impact on large datasets.
Links: - TGR: Table Graph Reasoner for Dense Tables (arXiv 2026) - Apache Parquet Documentation - DuckDB Parquet Overview - Polars scan_parquet API
indexing.parquet_extract_max_cell_chars (PARQUET_EXTRACT_MAX_CELL_CHARS) — Parquet Extract Max Cell Chars
Category: indexing
Upper bound for characters extracted from any single Parquet cell before truncation. This prevents rare long values (JSON blobs, stack traces, raw HTML, encoded payloads) from dominating chunk budgets and crowding out other rows. A low cap improves throughput and keeps chunks balanced, but may clip high-value context in long descriptive fields. Choose a cap that protects indexing stability while preserving enough per-cell signal for your query patterns.
Links: - Efficient Table Retrieval from Massive Data Lakes (arXiv 2026) - Apache Parquet Format Repository - DuckDB Parquet Performance Tips - pandas.read_parquet Reference
indexing.parquet_extract_max_chars (PARQUET_EXTRACT_MAX_CHARS) — Parquet Extract Max Chars
Category: indexing
Global character budget for text extracted from one Parquet file during indexing. Once this threshold is reached, extraction stops (best effort), giving predictable upper bounds on memory, ingestion time, and index growth. This setting is critical for very large tables where full-file extraction is unnecessary or too expensive. Pair it with row limits and cell caps so your truncation strategy is intentional rather than accidental.
Links: - Scalable Tabular In-Context Learning (arXiv 2025) - Parquet Implementation Status - DuckDB Querying Parquet Files - pyarrow.parquet.read_table Reference
indexing.parquet_extract_max_rows (PARQUET_EXTRACT_MAX_ROWS) — Parquet Extract Max Rows
Category: indexing
Best-effort cap on the number of rows read from a Parquet file during extraction. It is a coarse but effective control for ingestion cost when a dataset is too large to fully materialize into text. Higher values improve coverage and long-tail recall, while lower values reduce indexing time and memory pressure. If row order is meaningful (for example, temporal logs), this cap also determines which slice of data becomes searchable first.
Links: - Scalable Tabular In-Context Learning (arXiv 2025) - Polars scan_parquet API (row limiting) - DuckDB Parquet Overview - pyarrow.parquet.read_table Reference
indexing.parquet_extract_text_columns_only (PARQUET_EXTRACT_TEXT_COLUMNS_ONLY) — Parquet Text Columns Only
Category: indexing
Controls whether the Parquet ingestion path indexes only text-like columns (strings, long text blobs, comments, descriptions) instead of every column in the table. Keeping this enabled usually improves retrieval quality because numeric IDs, sparse codes, and high-cardinality counters often add noise without helping semantic recall. For mixed analytics datasets, this setting is a cost and relevance lever: you reduce token volume, embedding spend, and index size while preserving the fields that actually answer natural-language questions. Disable it only when numeric or categorical columns are first-class search targets and you have evaluation evidence that broader indexing improves recall more than it harms precision.
Links: - Text-to-SQL in the Wild: Benchmarking LLMs on Semi-structured Tables (arXiv 2025) - Apache Parquet Documentation - DuckDB Parquet Integration Overview - pandas read_parquet Reference
indexing.postgres_url (POSTGRES_URL) — PostgreSQL pgvector URL
Category: infrastructure
Connection DSN used to reach PostgreSQL for relational storage and pgvector-backed similarity retrieval. This single string determines host, port, database, credentials, SSL behavior, and optional connection parameters, so parsing mistakes or stale credentials can break indexing and retrieval simultaneously. Keep secrets out of committed config and inject this value at runtime via environment management; then validate connectivity and extension availability (pgvector) during startup checks. If you operate multiple environments, treat DSN changes as deploy-time infrastructure changes with explicit migration and rollback plans.
Links: - Text2VectorSQL: Bridging SQL and Vector Retrieval (arXiv 2025) - PostgreSQL libpq Connection Strings - PostgreSQL Connection Settings - pgvector Extension (GitHub)
indexing.skip_dense (SKIP_DENSE) — Skip Dense Embeddings
Category: retrieval
When enabled, indexing skips dense embedding generation and vector-store writes, leaving retrieval fully lexical (BM25/FTS). This is useful for fast local iteration, constrained CI environments, or deployments where vector infrastructure is unavailable. The tradeoff is predictable: lower indexing cost and simpler ops, but weaker semantic recall for paraphrases and concept-level matches. Use this mode when exact term matching dominates your workload (file names, identifiers, error strings), and disable it for natural-language-heavy corpora where semantic expansion materially improves first-pass recall.
Badges: - Much faster - Keyword-only - No semantic search
Links: - Mixture of Retrieval (MoR): Integrating Sparse and Dense Retrieval for RAG (arXiv 2025) - PostgreSQL Full Text Search - Elasticsearch Reciprocal Rank Fusion (RRF) - Search in PostgreSQL: Full Text Search (ParadeDB)