Config reference: generation
-
Enterprise tuning surface
Defaults + constraints are rendered directly from Pydantic.
-
Env keys when available
Many fields have an env-style alias (from
TriBridConfig.to_flat_dict()). -
Tooltip-level guidance
If a matching glossary entry exists, you’ll see deeper tuning notes.
Config reference Config API & workflow Glossary
Total parameters: 20
Group index
(root)
(root)
| JSON key | Env key(s) | Type | Default | Constraints | Summary |
|---|---|---|---|---|---|
generation.enrich_backend | ENRICH_BACKEND | str | "openai" | pattern=^(openai|ollama|mlx)$ | Enrichment backend |
generation.enrich_disabled | ENRICH_DISABLED | int | 0 | ≥ 0, ≤ 1 | Disable code enrichment |
generation.enrich_model | ENRICH_MODEL | str | "gpt-4o-mini" | — | Model for code enrichment |
generation.enrich_model_ollama | ENRICH_MODEL_OLLAMA | str | "" | — | Ollama enrichment model |
generation.gen_backend | GEN_BACKEND | str | "openai" | pattern=^(openai|anthropic|ollama|mlx|openrouter)$ | Provider backend for gen_model and channel overrides |
generation.gen_max_tokens | GEN_MAX_TOKENS | int | 2048 | ≥ 100, ≤ 8192 | Max tokens for generation |
generation.gen_model | GEN_MODEL | str | "gpt-4o-mini" | — | Primary generation model |
generation.gen_model_cli | GEN_MODEL_CLI | str | "qwen3-coder:14b" | — | CLI generation model |
generation.gen_model_http | GEN_MODEL_HTTP | str | "" | — | HTTP transport generation model override |
generation.gen_model_mcp | GEN_MODEL_MCP | str | "" | — | MCP transport generation model override |
generation.gen_model_ollama | GEN_MODEL_OLLAMA | str | "qwen3-coder:30b" | — | Ollama generation model |
generation.gen_retry_max | GEN_RETRY_MAX | int | 2 | ≥ 1, ≤ 5 | Max retries for generation |
generation.gen_temperature | GEN_TEMPERATURE | float | 0.0 | ≥ 0.0, ≤ 2.0 | Generation temperature |
generation.gen_timeout | GEN_TIMEOUT | int | 60 | ≥ 10, ≤ 300 | Generation timeout (seconds) |
generation.gen_top_p | GEN_TOP_P | float | 1.0 | ≥ 0.0, ≤ 1.0 | Nucleus sampling threshold |
generation.ollama_num_ctx | OLLAMA_NUM_CTX | int | 8192 | ≥ 2048, ≤ 32768 | Context window for Ollama |
generation.ollama_request_timeout | OLLAMA_REQUEST_TIMEOUT | int | 300 | ≥ 30, ≤ 1200 | Maximum total time to wait for a local (Ollama) generation request to complete (seconds) |
generation.ollama_stream_idle_timeout | OLLAMA_STREAM_IDLE_TIMEOUT | int | 60 | ≥ 5, ≤ 300 | Maximum idle time allowed between streamed chunks from local (Ollama) during generation (seconds) |
generation.ollama_url | OLLAMA_URL | str | "http://127.0.0.1:11434/api" | — | Ollama API URL |
generation.openai_base_url | OPENAI_BASE_URL | str | "" | — | OpenAI API base URL override (for proxies) |
Details (glossary)
generation.enrich_backend (ENRICH_BACKEND) — Enrichment Backend
Category: general
This chooses the runtime that generates enrichment metadata during indexing, such as chunk summaries, tags, and semantic hints. Backend choice changes quality, latency, cost, privacy posture, and operational complexity, so it can materially alter downstream retrieval and reranking behavior. Hosted backends generally reduce ops burden and may provide stronger quality, while local backends can improve data control and predictable marginal cost. Treat backend changes like model migrations: version prompts and settings, then rerun evaluation before production rollout. Do not assume enrichment outputs are interchangeable across backends.
Badges: - Index pipeline
Links: - Meta-RAG on Large Codebases Using Code Summarization (arXiv 2025) - openai-python API Reference - Ollama API Docs - MLX Repository
generation.enrich_disabled (ENRICH_DISABLED) — Disable Enrichment
Category: general
This switch disables enrichment generation entirely during indexing. It is useful for fast iteration, low-cost development cycles, and emergency backfills where raw embedding retrieval is acceptable. The cost of disabling is reduced semantic metadata for reranking, cards, and explanatory UX features, which can lower answer quality on abstract or architecture-level questions. Use it intentionally and record when it is active so benchmark comparisons remain meaningful. A common pattern is enrichment disabled for local loops and enabled for production-grade index builds.
Badges: - Faster indexing
Links: - Not All Tokens Matter: Efficient Code Summarization (arXiv 2026) - Ollama README - openai-python API Reference - MLX Repository
generation.enrich_model (ENRICH_MODEL) — Enrichment Model
Category: generation
This selects the exact model used by the configured enrichment backend. It is the main lever on the quality versus cost versus throughput tradeoff for generated summaries and keywords. Higher-capability models can improve semantic signals for reranking and explanation quality, while lighter models reduce expense and indexing time. Even without changing embeddings, enrichment model swaps can shift retrieval outcomes, so they should be benchmarked and version-controlled. Pin model ids and evaluate outputs on representative repositories before adopting changes in production pipelines.
Badges: - Affects quality/cost
Links: - Code vs Serialized AST Inputs for Code Summarization (arXiv 2026) - EyeLayer: Human Attention for Code Summarization (arXiv 2026) - openai-python API Reference - Ollama API Docs
generation.enrich_model_ollama (ENRICH_MODEL_OLLAMA) — Enrichment Model (Ollama)
Category: generation
Selects the local Ollama model used for enrichment steps such as code-card expansion, metadata extraction, and structure-aware summaries before retrieval. This choice is a quality versus latency tradeoff: larger coder models usually produce richer symbols and relationships, while smaller models reduce indexing time and hardware pressure. Keep the selected model pinned to an explicit tag so enrichment output stays reproducible across rebuilds. In production, validate the model on a fixed enrichment sample set and monitor drift in extracted fields after model upgrades.
Badges: - Local model tuning
Links: - rStar-Coder (arXiv) - Ollama Documentation - Ollama Model Library - Ollama Quickstart
generation.gen_backend (GEN_BACKEND) — Generation Backend
Category: generation
Generation backend selects the provider stack that executes model calls, which affects auth, parameter semantics, rate limits, timeouts, and tool-calling behavior. Treat backend choice as an operational contract, not a cosmetic model switch. In RAG systems, keep backend-specific defaults normalized so output length, safety behavior, and citation style stay predictable across providers. If you support multiple backends, define a deterministic fallback order and record backend metadata in logs for incident triage. Backend heterogeneity without observability is a common source of inconsistent answer quality.
Badges: - Provider routing
Links: - Universal Model Routing for Efficient LLM Inference (arXiv 2025) - OpenAI Python SDK - Anthropic Claude Models - Ollama API Reference
generation.gen_max_tokens (GEN_MAX_TOKENS) — Max Tokens
Category: generation
This is the upper bound on generated output length per request. In RAG, it directly controls cost and latency, but also determines whether answers can include full reasoning, citations, and edge-case handling without truncation. Set defaults by task class instead of one global value, then enforce stricter caps on interactive channels to protect tail latency. Pair this with context packing and answer format constraints so tokens are spent on grounded content rather than repetition. Monitor both truncation frequency and response quality, because either metric alone can hide a bad token budget.
Badges: - Cost and latency
Links: - TimeBill: Time-Budgeted Inference for LLMs (arXiv 2025) - Anthropic Messages API - Gemini Token Counting - OpenAI Cookbook: Count Tokens with tiktoken
generation.gen_model (GEN_MODEL) — Generation Model
Category: generation
This is the primary model used to synthesize answers from retrieved context, and it dominates quality, latency, and cost behavior. Choose it with workload-specific evaluation sets, not leaderboard intuition, because retrieval quality and prompt structure can change model rankings. Version model IDs explicitly so experiments are reproducible and regressions can be traced. Re-evaluate whenever provider releases shift default behavior, even if API names stay stable. Good retrieval can still underperform if the generation model is misaligned with your task style and response requirements.
Badges: - Primary quality lever
Links: - Lookahead Routing for Large Language Models (arXiv 2025) - OpenAI Python SDK - Anthropic Claude Models - OpenRouter Provider Selection
generation.gen_model_cli (GEN_MODEL_CLI) — CLI Channel Model
Category: generation
This override selects a model specifically for CLI sessions, which are usually iterative and speed-sensitive. Using a smaller or local model here can improve developer feedback loops while keeping production channels on a higher-capability model. Keep retrieval stack and system prompts aligned across channels so CLI debugging reflects real behavior. Log the active CLI model in run metadata to make test results reproducible. Use this for workflow optimization, not as an untracked fork of application behavior.
Badges: - Developer workflow
Links: - Universal Model Routing for Efficient LLM Inference (arXiv 2025) - Ollama Quickstart - LiteLLM Router - OpenAI Python SDK
generation.gen_model_http (GEN_MODEL_HTTP) — HTTP Channel Model
Category: generation
This override controls model selection for HTTP/API traffic, where SLOs, concurrency, and cost controls are usually stricter than interactive internal use. It enables channel-specific governance, such as serving public endpoints with stable low-variance models while reserving premium models for internal workflows. Treat changes here as API behavior changes and validate with canary rollouts. Align timeout and retry policies to the chosen model because latency profile varies significantly by provider and model class. Clear fallback order prevents unpredictable responses during upstream incidents.
Badges: - API channel
Links: - Lookahead Routing for Large Language Models (arXiv 2025) - Anthropic Messages API - LiteLLM Router - OpenRouter Provider Selection
generation.gen_model_mcp (GEN_MODEL_MCP) — MCP Channel Model
Category: generation
This override applies to MCP tool-invocation paths, where requests are structured and often latency-sensitive. A lighter model can be sufficient for tool selection and argument construction, reducing spend without degrading end-to-end quality. Prioritize schema adherence and tool-call reliability over open-ended generation fluency in this channel. Validate with tool-call success rate, argument validity, and recovery behavior after tool errors. If tool use regresses while chat quality remains stable, this override is the first place to inspect.
Badges: - Tool channel
Links: - INFERENCEDYNAMICS: Efficient Routing Across LLMs (arXiv 2025) - Model Context Protocol Introduction - Model Context Protocol Specification (2025-06-18) - MCP Transport Specification
generation.gen_model_ollama (GEN_MODEL_OLLAMA) — Generation Model (Ollama)
Category: generation
GEN_MODEL_OLLAMA selects the concrete local model tag used when generation is routed through Ollama instead of a hosted provider. In this configuration family, the default is qwen3-coder:30b, and changing it directly affects response quality, latency, memory pressure, and token-context behavior for all generation calls that use the Ollama path. Use explicit model tags and keep them consistent across environments so evaluation runs remain reproducible and regressions can be traced to model changes rather than retrieval or prompt drift. When updating this value, validate compatibility with your configured context and timeout settings before promoting to shared environments.
Links: - Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch (arXiv) - Ollama API Reference (repo docs) - Ollama Modelfile Reference - Ollama Context Length Guide
generation.gen_retry_max (GEN_RETRY_MAX) — Generation Max Retries
Category: generation
This sets how many times generation requests are retried after transient failures such as rate limits or temporary backend faults. Higher values can improve success rate but also increase latency and amplify traffic during outages if backoff is weak. Use bounded retries with exponential backoff and jitter, and track request IDs to avoid accidental duplicate side effects. Interactive channels usually need fewer retries than background jobs. If retries are frequent but final success stays low, reduce retries and fix timeout, routing, or provider health first.
Badges: - Resilience
Links: - KevlarFlow: Resiliency in LLM Serving (arXiv 2026) - OpenAI Cookbook: Handle Rate Limits - Anthropic API Errors - LiteLLM Reliability and Fallbacks
generation.gen_temperature (GEN_TEMPERATURE) — Default Response Creativity
Category: generation
Temperature controls sampling randomness. In retrieval-grounded QA, lower values usually improve consistency and factual stability, while higher values increase stylistic variation and drift risk. Keep defaults low for technical explanations, debugging steps, and config guidance where repeatability matters. Raise it only for explicitly creative tasks and monitor variance across repeated runs of the same query. If answer facts change across retries with identical context, temperature is likely set too high for your use case.
Badges: - Sampling control
Links: - Learning Temperature Policy from LLM Internal States (arXiv 2026) - Anthropic Prompt Engineering: Use Temperature - Hugging Face Text Generation Parameters - OpenAI Cookbook: Formatting Chat Inputs
generation.gen_timeout (GEN_TIMEOUT) — Generation Timeout
Category: generation
Timeout sets the maximum wait for generation before the request is aborted. This is a reliability boundary that protects workers and users during provider slowdowns; too low causes false failures, too high causes queue buildup and cascading retries. Tune it by model class and expected output length, then enforce stricter limits for interactive paths. Combine timeout with retry policy so slow requests do not create retry storms. Rising timeout rates usually indicate context bloat, backend saturation, or routing misconfiguration rather than a need for unlimited timeout.
Badges: - SLO guardrail
Links: - KevlarFlow: Resiliency in LLM Serving (arXiv 2026) - LiteLLM Timeout Controls - LiteLLM Reliability and Fallbacks - Anthropic API Errors
generation.gen_top_p (GEN_TOP_P) — Top-P (Nucleus Sampling)
Category: generation
Top-p applies nucleus sampling by limiting choices to the smallest token set whose cumulative probability reaches p. Lower values narrow the candidate set and improve determinism, while higher values increase lexical diversity. In RAG answers, top-p is usually tuned with temperature; high values for both can increase hallucination risk even with good retrieval context. Keep top-p conservative for technical and policy-sensitive responses. When troubleshooting unstable outputs, reduce top-p before redesigning prompts so you isolate sampling entropy effects first.
Badges: - Sampling control
Links: - Top-H Decoding: Bounded Entropy Text Generation (arXiv 2025) - Hugging Face Text Generation Parameters - Anthropic Messages API - OpenAI Cookbook: Formatting Chat Inputs
generation.ollama_num_ctx (OLLAMA_NUM_CTX) — Ollama Context Window
Category: generation
Ollama num_ctx sets the maximum context tokens used per generation request, directly affecting both quality headroom and memory footprint. Higher values let you include more retrieved evidence and longer instructions, but they increase KV-cache pressure and can reduce throughput on constrained hardware. Tune this parameter from measured prompt composition (system + query + retrieved chunks + expected answer) rather than guessing. If requests regularly approach the ceiling, improve chunk selection and compression before simply raising num_ctx, because blindly increasing window size can destabilize latency.
Links: - ParisKV: KV-Cache Compression for Long-Context LLMs (arXiv 2026) - Ollama Context Length - Ollama Modelfile Reference - Ollama FAQ
generation.ollama_request_timeout (OLLAMA_REQUEST_TIMEOUT) — Local Request Timeout (seconds)
Category: generation
Ollama request timeout is the hard client-side budget for a full generation call, including model warmup, first-token delay, and decode time. Set it too low and valid long-context responses fail prematurely; set it too high and stalled calls tie up worker capacity and degrade system responsiveness. Calibrate timeout from observed p95/p99 latency per model and hardware profile, and revisit after model swaps or context-window changes. Use streaming plus clear retry/abort policy so timeout behavior remains predictable during spikes.
Links: - Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse (arXiv 2025) - Ollama Generate API - Ollama Streaming API - vLLM Multi-LoRA Serving and Latency Metrics (2026)
generation.ollama_stream_idle_timeout (OLLAMA_STREAM_IDLE_TIMEOUT) — Local Stream Idle Timeout (seconds)
Category: generation
Maximum allowed silent gap (in seconds) between streamed tokens/chunks from the local Ollama endpoint before Crucible treats the response as stalled and aborts the request. This protects the UI from hanging indefinitely when a socket is half-open or a backend worker dies mid-generation. Set it too low and you will cut off valid long-prefill responses on larger context windows; set it too high and users wait too long on dead streams. Tune this together with network proxy timeouts and model size so cancellation behavior is fast but not trigger-happy.
Links: - Rethinking Latency DoS in KV Cache Systems (arXiv 2026) - Ollama Streaming API - Nginx proxy_read_timeout - Node.js AbortSignal.timeout()
generation.ollama_url (OLLAMA_URL) — Ollama URL
Category: generation
Base URL Crucible uses to reach Ollama over HTTP for local-model inference (for example, endpoints such as /api/chat, /api/generate, and /api/tags). This value controls where requests are sent from the app backend, so incorrect hostnames, ports, or path prefixes cause immediate model-routing failures. For remote or containerized setups, use a reachable address from the process that runs Crucible, not just from your browser. When multiple OpenAI-compatible backends are in play, keep this endpoint explicit so provider routing and debugging stay deterministic.
Links: - FlyingServing: Scalable and Fault-Tolerant LLM Serving (arXiv 2026) - Ollama API Reference - Ollama Model Tags Endpoint - vLLM OpenAI-Compatible Server
generation.openai_base_url (OPENAI_BASE_URL) — OpenAI Base URL
Category: generation
Advanced endpoint override for OpenAI-compatible APIs. Use this when routing Crucible through alternative backends such as Azure-hosted deployments, vLLM, or internal gateway proxies that implement OpenAI-style request/response contracts. Base URL mismatches are a common root cause of 404/401 errors because SDK path composition, versioning, and auth headers differ across providers. Keep this setting paired with explicit model/provider assignments so you can trace which endpoint handled each request during failures.
Badges: - Advanced - For compatible endpoints only
Links: - LMCache: Optimizing LLM Caching and Routing (arXiv 2025) - OpenAI Node SDK - vLLM OpenAI-Compatible Server - Azure OpenAI API Version and Endpoint Guidance