Skip to content

Config reference: generation

  • Enterprise tuning surface


    Defaults + constraints are rendered directly from Pydantic.

  • Env keys when available


    Many fields have an env-style alias (from TriBridConfig.to_flat_dict()).

  • Tooltip-level guidance


    If a matching glossary entry exists, you’ll see deeper tuning notes.

Config reference Config API & workflow Glossary

Total parameters: 20

Group index
  • (root)

(root)

JSON key Env key(s) Type Default Constraints Summary
generation.enrich_backend ENRICH_BACKEND str "openai" pattern=^(openai|ollama|mlx)$ Enrichment backend
generation.enrich_disabled ENRICH_DISABLED int 0 ≥ 0, ≤ 1 Disable code enrichment
generation.enrich_model ENRICH_MODEL str "gpt-4o-mini" Model for code enrichment
generation.enrich_model_ollama ENRICH_MODEL_OLLAMA str "" Ollama enrichment model
generation.gen_backend GEN_BACKEND str "openai" pattern=^(openai|anthropic|ollama|mlx|openrouter)$ Provider backend for gen_model and channel overrides
generation.gen_max_tokens GEN_MAX_TOKENS int 2048 ≥ 100, ≤ 8192 Max tokens for generation
generation.gen_model GEN_MODEL str "gpt-4o-mini" Primary generation model
generation.gen_model_cli GEN_MODEL_CLI str "qwen3-coder:14b" CLI generation model
generation.gen_model_http GEN_MODEL_HTTP str "" HTTP transport generation model override
generation.gen_model_mcp GEN_MODEL_MCP str "" MCP transport generation model override
generation.gen_model_ollama GEN_MODEL_OLLAMA str "qwen3-coder:30b" Ollama generation model
generation.gen_retry_max GEN_RETRY_MAX int 2 ≥ 1, ≤ 5 Max retries for generation
generation.gen_temperature GEN_TEMPERATURE float 0.0 ≥ 0.0, ≤ 2.0 Generation temperature
generation.gen_timeout GEN_TIMEOUT int 60 ≥ 10, ≤ 300 Generation timeout (seconds)
generation.gen_top_p GEN_TOP_P float 1.0 ≥ 0.0, ≤ 1.0 Nucleus sampling threshold
generation.ollama_num_ctx OLLAMA_NUM_CTX int 8192 ≥ 2048, ≤ 32768 Context window for Ollama
generation.ollama_request_timeout OLLAMA_REQUEST_TIMEOUT int 300 ≥ 30, ≤ 1200 Maximum total time to wait for a local (Ollama) generation request to complete (seconds)
generation.ollama_stream_idle_timeout OLLAMA_STREAM_IDLE_TIMEOUT int 60 ≥ 5, ≤ 300 Maximum idle time allowed between streamed chunks from local (Ollama) during generation (seconds)
generation.ollama_url OLLAMA_URL str "http://127.0.0.1:11434/api" Ollama API URL
generation.openai_base_url OPENAI_BASE_URL str "" OpenAI API base URL override (for proxies)

Details (glossary)

generation.enrich_backend (ENRICH_BACKEND) — Enrichment Backend

Category: general

Backend service for generating code summaries and enrichment metadata during indexing. Options: "openai" (GPT models, highest quality), "ollama" (local models, free), "mlx" (Apple Silicon optimized). Enrichment adds per-chunk summaries and keywords used by features like cards and improved reranking. Disable to speed up indexing or reduce costs.

Badges: - Optional feature - Increases index time

Links: - MLX on Apple Silicon - Ollama Local Models

generation.enrich_disabled (ENRICH_DISABLED) — Disable Enrichment

Category: general

Completely disable code enrichment (summaries, keywords, cards) during indexing (1=disable, 0=enable). When disabled, indexing is much faster and cheaper (no LLM API calls) but you lose card search, enriched metadata, and semantic boosting. Use this for quick re-indexing during development, CI/CD pipelines, or when working with non-code content. Re-enable for production to get full retrieval quality benefits.

Recommended: 0 (enrichment ON) for production, 1 (enrichment OFF) for fast iteration and testing.

Badges: - Much faster indexing - Loses card search

generation.enrich_model (ENRICH_MODEL) — Enrichment Model

Category: generation

Specific model name for code enrichment when ENRICH_BACKEND is set. For OpenAI: "gpt-4o-mini" (recommended, cheap), "gpt-4o" (higher quality, costly). For Ollama: specify via ENRICH_MODEL_OLLAMA instead. Smaller models (gpt-4o-mini, qwen2.5-coder:7b) balance cost and quality for summaries. Enrichment happens during indexing, not at query time.

Badges: - Affects index cost

Links: - OpenAI Models - Cost Comparison

generation.enrich_model_ollama (ENRICH_MODEL_OLLAMA) — Enrichment Model (Ollama)

Category: generation

Ollama model name for code enrichment when ENRICH_BACKEND=ollama. Recommended: "qwen2.5-coder:7b" (fast, code-focused), "deepseek-coder:6.7b" (excellent code understanding), "codellama:13b" (high quality, slower). Model must be pulled via "ollama pull " before use. Local enrichment is free but slower than cloud APIs.

Badges: - Free (local) - Requires model download

Links: - Ollama Models - Pull Models - Code-Focused Models

generation.gen_backend (GEN_BACKEND) — Generation Backend

Category: generation

Provider backend for the primary generation model (gen_model) and channel overrides (gen_model_cli, gen_model_http, gen_model_mcp). Options: "openai" (GPT models), "anthropic" (Claude models), "ollama" (local models), "mlx" (Apple Silicon optimized), "openrouter" (multi-provider gateway). This controls which API the generation model is routed to. Ollama-specific model overrides (gen_model_ollama, enrich_model_ollama) always use the Ollama backend regardless of this setting. Enrichment uses its own enrich_backend setting.

Links: - OpenAI Models - Anthropic Claude - OpenRouter

generation.gen_max_tokens (GEN_MAX_TOKENS) — Max Tokens

Category: generation

Maximum number of tokens the LLM can generate in a single response. Higher values allow longer answers but increase cost and latency. Typical: 512-1024 for concise answers, 2048-4096 for detailed explanations.

Links: - OpenAI Token Limits - Token Counting

generation.gen_model (GEN_MODEL) — Generation Model

Category: generation

Primary generation model used for answer synthesis in the retrieval pipeline. This model receives retrieved context and is responsible for final response quality, style, and latency profile.

Choose the model based on your target balance of correctness, speed, and cost. Keep this aligned with provider credentials and endpoint overrides so fallback behavior is predictable.

  • Affects: answer quality, latency, token cost, tool-call behavior
  • Re-evaluate when switching corpus/domain or SLA target

Badges: - Affects latency

Links: - OpenAI Models - Ollama API (GitHub)

generation.gen_model_cli (GEN_MODEL_CLI) — CLI Channel Model

Category: generation

Override GEN_MODEL for CLI chat sessions only. Allows using different models for terminal vs web interface - e.g., faster models for CLI iteration, higher quality for production GUI. Useful for developer workflows where CLI is for quick testing and HTTP is for end users. If not set, uses GEN_MODEL.

Badges: - Channel-specific

generation.gen_model_http (GEN_MODEL_HTTP) — HTTP Channel Model

Category: generation

Override GEN_MODEL specifically for HTTP API requests (GUI, external API calls). Useful for serving different models to different channels - e.g., use gpt-4o for production HTTP but qwen-coder locally. If not set, falls back to GEN_MODEL. Example use case: cheaper models for public API, expensive models for internal tools.

Badges: - Channel-specific

Links: - Model Selection

generation.gen_model_mcp (GEN_MODEL_MCP) — MCP Channel Model

Category: generation

Override GEN_MODEL for MCP tool invocations only. Use a lighter/cheaper model for MCP tools since tool calls are typically simpler than complex reasoning. Example: gpt-4o-mini for MCP, gpt-4o for main chat. Reduces costs when tools are called frequently (search, file operations, etc.). If not set, uses GEN_MODEL.

Badges: - Cost savings - Channel-specific

Links: - Model Pricing

generation.gen_model_ollama (GEN_MODEL_OLLAMA) — Generation Model (Ollama)

Category: generation

Local Ollama model override used when generation is routed through Ollama. This allows a local model choice that differs from the cloud/default generation model while keeping the same retrieval flow.

Use explicit model tags (including version/size) so behavior is reproducible across machines. Confirm the model is pulled and compatible with your configured context window/timeouts.

  • Example: qwen3-coder:30b
  • Pair with: OLLAMA_NUM_CTX, OLLAMA_REQUEST_TIMEOUT, OLLAMA_STREAM_IDLE_TIMEOUT
generation.gen_retry_max (GEN_RETRY_MAX) — Generation Max Retries

Category: generation

Number of retry attempts for failed LLM API calls due to rate limits, network errors, or transient failures. Higher values improve reliability but increase latency on failures. Typical: 2-3 retries.

Links: - Retry Strategies - Exponential Backoff

generation.gen_temperature (GEN_TEMPERATURE) — Default Response Creativity

Category: generation

Default sampling temperature for generation. Lower values produce more deterministic answers; higher values increase variability and creative paraphrasing.

For technical retrieval QA, start near 0.0-0.3 to reduce hallucinated variation. Increase only when you explicitly want brainstorming-style or stylistically varied outputs.

  • 0.0-0.3: stable, factual, repeatable
  • 0.4-0.8: more diverse phrasing, higher drift risk
  • 0.8: rarely ideal for grounded code retrieval

Links: - Sampling Controls - Nucleus/Top‑p

generation.gen_timeout (GEN_TIMEOUT) — Generation Timeout

Category: generation

Maximum seconds to wait for LLM response before timing out. Prevents hanging on slow models or network issues. Increase for large models or slow connections. Typical: 30-120 seconds.

Links: - Timeout Best Practices - HTTP Timeouts

generation.gen_top_p (GEN_TOP_P) — Top-P (Nucleus Sampling)

Category: generation

Controls randomness via nucleus sampling (0.0-1.0). Lower values (0.1-0.5) make output more focused and deterministic. Higher values (0.9-1.0) increase creativity and diversity. Recommended: 0.9 for general use.

Links: - Nucleus Sampling - Top-P Explanation

generation.ollama_num_ctx (OLLAMA_NUM_CTX) — Ollama Context Window

Category: generation

Context window size requested from Ollama for generation calls. This defines how many tokens of prompt + retrieved context + response budget the model can handle in one request.

Set high enough to fit your retrieval payload and answer target, but avoid unnecessary inflation because larger contexts increase memory usage and can degrade latency.

  • Too low: truncation, missing context, weaker answers
  • Too high: slower inference, higher local resource pressure
generation.ollama_request_timeout (OLLAMA_REQUEST_TIMEOUT) — Local Request Timeout (seconds)

Category: generation

Maximum end-to-end request timeout (seconds) for Ollama generation calls. This is the hard upper bound for waiting on local model inference before failing the request.

Choose based on model size and hardware capability. Large local models can require higher values, but extremely high timeouts hide operational problems and increase user wait time.

  • Lower values: faster failure detection
  • Higher values: tolerate heavy local inference

Links: - Ollama API: Generate - HTTP Timeouts

generation.ollama_stream_idle_timeout (OLLAMA_STREAM_IDLE_TIMEOUT) — Local Stream Idle Timeout (seconds)

Category: generation

Maximum idle time (seconds) allowed between streamed tokens/chunks from Ollama before considering the stream stalled. This protects clients from hanging connections when generation stops mid-response.

Increase only if models legitimately pause for long intervals on your hardware. If idle timeouts trigger often, inspect model load, GPU/CPU saturation, and prompt size.

  • Too low: premature stream cancellation
  • Too high: slower detection of stuck streams

Links: - Streaming Basics - Ollama Streaming

generation.ollama_url (OLLAMA_URL) — Ollama URL

Category: generation

Local inference endpoint for Ollama running on your machine (e.g., http://127.0.0.1:11434/api). Used when GEN_MODEL targets a local model like llama2, mistral, qwen, or neural-chat. Requires Ollama installed and running: ollama serve

Links: - Ollama REST API - Ollama Docker Setup - Ollama Model Library

generation.openai_base_url (OPENAI_BASE_URL) — OpenAI Base URL

Category: generation

ADVANCED: Override the OpenAI API base URL for OpenAI-compatible endpoints. Use cases: local inference servers (LM Studio, vLLM, text-generation-webui), Azure OpenAI (https://YOUR_RESOURCE.openai.azure.com/), proxy services. Default: https://api.openai.com/v1. Useful for development, air-gapped environments, or cost optimization via self-hosted models.

Badges: - Advanced - For compatible endpoints only

Links: - OpenAI API Reference - Azure OpenAI - LM Studio Setup - vLLM Compatibility