Config reference: `embedding`

Enterprise tuning surface

Defaults + constraints are rendered directly from Pydantic.
Env keys when available

Many fields have an env-style alias (from TriBridConfig.to_flat_dict()).
Tooltip-level guidance

If a matching glossary entry exists, you’ll see deeper tuning notes.

Config reference Config API & workflow Glossary

Total parameters: 18

Group index

(root)

`(root)`

JSON key	Env key(s)	Type	Default	Constraints	Summary
`embedding.auto_set_dimensions`	—	`bool`	`true`	—	When true, the UI auto-syncs embedding_dim from data/models.json when model changes.
`embedding.contextual_chunk_embeddings`	—	`Literal["off", "prepend_context", "late_chunking_local_only"]`	`"off"`	allowed="off", "prepend_context", "late_chunking_local_only"	Contextual chunk embedding mode. 'late_chunking_local_only' requires local/HF provider backend.
`embedding.embed_text_prefix`	—	`str`	`""`	—	Prefix added before chunk text prior to embedding (stable document context).
`embedding.embed_text_suffix`	—	`str`	`""`	—	Suffix added after chunk text prior to embedding.
`embedding.embedding_backend`	—	`Literal["deterministic", "provider"]`	`"deterministic"`	allowed="deterministic", "provider"	Embedding execution backend. 'deterministic' is offline/test-friendly; 'provider' calls real providers.
`embedding.embedding_batch_size`	`EMBEDDING_BATCH_SIZE`	`int`	`64`	≥ 1, ≤ 256	Batch size for embedding generation
`embedding.embedding_cache_enabled`	`EMBEDDING_CACHE_ENABLED`	`int`	`1`	≥ 0, ≤ 1	Enable embedding cache
`embedding.embedding_dim`	`EMBEDDING_DIM`	`int`	`3072`	≥ 128, ≤ 4096	Embedding dimensions
`embedding.embedding_max_tokens`	`EMBEDDING_MAX_TOKENS`	`int`	`8000`	≥ 512, ≤ 8192	Max tokens per embedding chunk
`embedding.embedding_model`	`EMBEDDING_MODEL`	`str`	`"text-embedding-3-large"`	—	OpenAI embedding model
`embedding.embedding_model_local`	`EMBEDDING_MODEL_LOCAL`	`str`	`"all-MiniLM-L6-v2"`	—	Local SentenceTransformer model
`embedding.embedding_model_mlx`	`EMBEDDING_MODEL_MLX`	`str`	`"mlx-community/all-MiniLM-L6-v2-4bit"`	—	MLX-optimized embedding model (used when embedding_type=mlx)
`embedding.embedding_retry_max`	`EMBEDDING_RETRY_MAX`	`int`	`3`	≥ 1, ≤ 5	Max retries for embedding API
`embedding.embedding_timeout`	`EMBEDDING_TIMEOUT`	`int`	`30`	≥ 5, ≤ 120	Embedding API timeout (seconds)
`embedding.embedding_type`	`EMBEDDING_TYPE`	`str`	`"openai"`	—	Embedding provider (dynamic - validated against models.json at runtime)
`embedding.input_truncation`	—	`Literal["error", "truncate_end", "truncate_middle"]`	`"truncate_end"`	allowed="error", "truncate_end", "truncate_middle"	What to do when text exceeds embedding/token limits.
`embedding.late_chunking_max_doc_tokens`	—	`int`	`8192`	≥ 256, ≤ 65536	Max tokens per document segment for local late chunking.
`embedding.voyage_model`	`VOYAGE_MODEL`	`str`	`"voyage-code-3"`	—	Voyage embedding model

Details (glossary)

embedding.embedding_batch_size (EMBEDDING_BATCH_SIZE) — Embedding Batch Size

Category: embedding

Controls how many chunks are embedded in each request or inference pass. Larger batches usually improve throughput by reducing per-request overhead and increasing accelerator utilization, but they raise peak memory pressure and can hit rate or timeout limits. Smaller batches are safer on constrained hosts and unstable networks but increase total indexing time. Tune this setting from observed throughput and error rates, not fixed defaults.

Badges: - Throughput tuning

Links: - Hugging Face Text Embeddings Inference - Voyage embeddings docs - Qdrant points and upserts - Dynamic batching for LLM throughput (2025)

embedding.embedding_cache_enabled (EMBEDDING_CACHE_ENABLED) — Embedding Cache

Category: embedding

Enables reuse of previously computed embeddings for identical normalized text, reducing repeated compute and API spend during reindex cycles. Cache hits are most beneficial when rerunning ingestion on mostly stable corpora or during iterative chunking tests. Cache keys should include model identifier, model revision, and preprocessing policy to prevent stale vectors from contaminating retrieval quality comparisons. Disable cache only when validating backend or model changes end-to-end.

Badges: - Cost control

Links: - Redis client-side caching - Qdrant points and upserts - Pinecone semantic search guide - ContextPilot context reuse (2025)

embedding.embedding_dim (EMBEDDING_DIM) — Embedding Dimension

Category: embedding

Defines vector dimensionality in the index and must match model output exactly. Larger dimensions can preserve more semantic detail and improve hard-case recall, but they increase memory, storage, and approximate-nearest-neighbor compute cost. Smaller dimensions reduce cost and can speed search, especially when using embeddings designed for compression. Treat this as a quality-versus-efficiency control and rebenchmark whenever dimension changes.

Badges: - Vector schema

Links: - Qdrant collections and vector size - Weaviate vector search concepts - SentenceTransformer API - Dimensionality reduction impact study (2025)

embedding.embedding_max_tokens (EMBEDDING_MAX_TOKENS) — Embedding Max Tokens

Category: embedding

This sets the maximum token count sent to the embedding model for each chunk. Content beyond the limit is truncated, so the value directly controls how much semantic evidence is preserved in each vector. Higher limits can improve recall for long code blocks and docs, but they increase indexing cost, latency, and the chance of mixing multiple topics into one embedding. Lower limits are cheaper and often cleaner semantically, but can drop critical tail context. Tune this against your chunk size distribution and monitor truncation rate so most chunks fit without clipping.

Badges: - Affects cost

Links: - HiChunk (arXiv 2025) - OpenAI Cookbook: Embedding Long Inputs - tiktoken README - Voyage Embeddings Docs

embedding.embedding_model (EMBEDDING_MODEL) — Embedding Model (OpenAI)

Category: embedding

This names the OpenAI embedding model used for indexing and query encoding when the OpenAI provider is selected. Model choice sets the quality, speed, vector shape options, and cost profile that downstream retrieval depends on. Because embedding spaces are model-specific, changing this value after indexing requires a full rebuild to keep similarity search valid. Treat model upgrades as versioned infrastructure changes: pin model ids, benchmark on your query set, and roll forward only with measured quality and latency impact. Avoid ad hoc switching between runs.

Badges: - Requires reindex

Links: - jina-embeddings-v5-text (arXiv 2026) - OpenAI Cookbook: Get Embeddings - openai-python API Reference - MTEB Leaderboard

embedding.embedding_model_local (EMBEDDING_MODEL_LOCAL) — Local Embedding Model

Category: embedding

This specifies the local embedding model (usually SentenceTransformers or Hugging Face) used when running without hosted embedding APIs. It is a core quality and performance lever: larger models often improve semantic recall but consume more memory and index slower. Different local models also use different dimensions and training objectives, so changing models requires reindexing. Pin exact model revisions to avoid drift across machines and CI jobs. Use your own benchmark queries to choose a model, since leaderboard rank alone may not match your codebase or domain vocabulary.

Badges: - Local inference

Links: - jina-embeddings-v5-text (arXiv 2026) - mxbai-embed-large-v1 Model Card - BGE Small v1.5 Model Card - SentenceTransformers Pretrained Models

embedding.embedding_model_mlx (EMBEDDING_MODEL_MLX) — MLX Embedding Model

Category: embedding

This sets the MLX-compatible embedding model used on Apple Silicon. MLX uses Metal-optimized kernels, so it can provide strong local throughput for private or offline indexing pipelines. As with every embedding backend, the model id and dimension define the vector space; changing either requires full reindexing to keep search comparable. Quantized variants can reduce memory and speed up inference, but you should validate recall on representative queries before adopting them broadly. Record model id and quantization in index metadata for reproducible builds.

Badges: - Apple Silicon

Links: - jina-embeddings-v5-text (arXiv 2026) - MLX Repository - MLX Examples Repository - mlx-community all-MiniLM-L6-v2-4bit

embedding.embedding_retry_max (EMBEDDING_RETRY_MAX) — Embedding Max Retries

Category: embedding

This controls how many times the system retries a failed embedding call before marking the operation failed. It protects indexing from transient failures such as short network interruptions, temporary overload, and bursty rate-limit responses. Too few retries makes jobs brittle; too many retries can mask persistent faults and dramatically increase end-to-end indexing time. Pair this setting with exponential backoff and jitter so workers do not retry in synchronized waves. Track retry exhaustion in telemetry and fix root causes rather than continually raising the retry ceiling.

Badges: - Reliability

Links: - MINES: Web API Invariant Anomaly Detection (arXiv 2025) - AWS Builders Library: Timeouts, Retries, Backoff with Jitter - Google Cloud Retry Strategy - openai-python API Reference

embedding.embedding_timeout (EMBEDDING_TIMEOUT) — Embedding Timeout

Category: embedding

This is the maximum wait time for an embedding request before the call is treated as failed. It defines how long indexing workers can block on slow upstream responses and strongly affects throughput under load. If timeout is too low, valid requests fail and trigger unnecessary retries; if too high, stuck calls reduce parallelism and delay incident detection. Tune this with retry count, concurrency, and observed p95 and p99 latency, not mean latency alone. Separate timeout profiles for interactive queries versus bulk indexing jobs when possible.

Badges: - Latency control

Links: - LO2: Microservice API Anomaly Dataset (arXiv 2025) - AWS Builders Library: Timeouts, Retries, Backoff with Jitter - Google Cloud Retry Strategy - openai-python API Reference

embedding.embedding_type (EMBEDDING_TYPE) — Embedding Provider

Category: embedding

This selects the embedding backend family and therefore the core operating mode of retrieval: hosted API providers versus local inference runtimes. The choice drives quality, cost, privacy boundaries, tokenizer behavior, dimensionality, and operational dependencies such as network availability or local model files. Switching type usually changes vector space and requires reindexing to preserve ranking validity. Decide type at architecture level by balancing security and compliance constraints against latency and budget. Record provider and model together in index metadata so deployments remain reproducible.

Badges: - Requires reindex

Links: - jina-embeddings-v5-text (arXiv 2026) - OpenAI Cookbook: Get Embeddings - Voyage Embeddings Docs - Gemini Embeddings Docs

embedding.voyage_model (VOYAGE_MODEL) — Voyage Embedding Model

Category: generation

Selects which Voyage embedding model generates vectors for indexing and retrieval. The model choice determines embedding behavior (for example code bias vs. general text behavior), output dimensionality, and operational cost/latency characteristics, so it directly affects both relevance quality and infra footprint.

Change this deliberately and evaluate with a fixed benchmark query set. Because model changes alter vector semantics, switching models should be treated as a reindex event: regenerate vectors, rebuild the index, and compare recall@k, reranked precision, and p95 latency before promoting to production.

Badges: - Requires reindex - Code-optimized

Links: - Llama-Embed-Nemotron-8B (arXiv 2025) - Voyage AI Embeddings API - Voyage Contextualized Chunk Embeddings - Voyage AI FAQ

Config reference: embedding

(root)

Details (glossary)

Config reference: `embedding`

`(root)`