Config reference: tokenization
-
Enterprise tuning surface
Defaults + constraints are rendered directly from Pydantic.
-
Env keys when available
Many fields have an env-style alias (from
TriBridConfig.to_flat_dict()). -
Tooltip-level guidance
If a matching glossary entry exists, you’ll see deeper tuning notes.
Config reference Config API & workflow Glossary
Total parameters: 7
Group index
(root)
(root)
| JSON key | Env key(s) | Type | Default | Constraints | Summary |
|---|---|---|---|---|---|
tokenization.estimate_only | — | bool | false | — | If true, use fast approximate token counting. |
tokenization.hf_tokenizer_name | — | str | "gpt2" | — | HuggingFace tokenizer name (strategy='huggingface'). |
tokenization.lowercase | — | bool | false | — | Lowercase before tokenization. |
tokenization.max_tokens_per_chunk_hard | — | int | 8192 | ≥ 256, ≤ 65536 | Absolute hard limit for tokens per chunk (safety ceiling). |
tokenization.normalize_unicode | — | bool | true | — | Normalize unicode (NFKC) before tokenization for stability. |
tokenization.strategy | — | Literal["whitespace", "tiktoken", "huggingface"] | "tiktoken" | allowed="whitespace", "tiktoken", "huggingface" | Tokenization strategy used for chunking/budgeting. |
tokenization.tiktoken_encoding | — | str | "o200k_base" | — | tiktoken encoding name (strategy='tiktoken'). |