Config reference: chunking
-
Enterprise tuning surface
Defaults + constraints are rendered directly from Pydantic.
-
Env keys when available
Many fields have an env-style alias (from
TriBridConfig.to_flat_dict()). -
Tooltip-level guidance
If a matching glossary entry exists, you’ll see deeper tuning notes.
Config reference Config API & workflow Glossary
Total parameters: 18
Group index
(root)
(root)
| JSON key | Env key(s) | Type | Default | Constraints | Summary |
|---|---|---|---|---|---|
chunking.ast_overlap_lines | AST_OVERLAP_LINES | int | 20 | ≥ 0, ≤ 100 | Overlap lines for AST chunking |
chunking.chunk_overlap | CHUNK_OVERLAP | int | 200 | ≥ 0, ≤ 1000 | Overlap between chunks |
chunking.chunk_size | CHUNK_SIZE | int | 1000 | ≥ 200, ≤ 5000 | Target chunk size (non-whitespace chars) |
chunking.chunking_strategy | CHUNKING_STRATEGY | str | "ast" | pattern=^(ast|hybrid|greedy|fixed_chars|fixed_tokens|recursive|markdown|sentence|qa_blocks|semantic)$ | Chunking strategy (document + code) |
chunking.emit_chunk_ordinal | — | bool | true | — | Emit chunk ordinal metadata for neighbor-window retrieval. |
chunking.emit_parent_doc_id | — | bool | true | — | Emit parent document id metadata for neighbor-window retrieval. |
chunking.greedy_fallback_target | GREEDY_FALLBACK_TARGET | int | 800 | ≥ 200, ≤ 2000 | Target size for greedy chunking |
chunking.markdown_include_code_fences | — | bool | true | — | Whether to include fenced code blocks in markdown sections. |
chunking.markdown_max_heading_level | — | int | 4 | ≥ 1, ≤ 6 | Max heading level to split on for markdown chunking. |
chunking.max_chunk_tokens | MAX_CHUNK_TOKENS | int | 8000 | ≥ 100, ≤ 32000 | Maximum tokens per chunk - chunks exceeding this are split recursively |
chunking.max_indexable_file_size | MAX_INDEXABLE_FILE_SIZE | int | 250000000 | ≥ 10000, ≤ 2000000000 | Max file size to index (bytes) - files larger than this are skipped |
chunking.min_chunk_chars | MIN_CHUNK_CHARS | int | 50 | ≥ 10, ≤ 500 | Minimum chunk size |
chunking.overlap_tokens | — | int | 64 | ≥ 0, ≤ 2048 | Token overlap between chunks (token-based strategies) |
chunking.preserve_imports | PRESERVE_IMPORTS | int | 1 | ≥ 0, ≤ 1 | Include imports in chunks |
chunking.recursive_max_depth | — | int | 10 | ≥ 1, ≤ 50 | Max recursion depth for recursive chunking. |
chunking.separator_keep | — | Literal["none", "prefix", "suffix"] | "suffix" | allowed="none", "prefix", "suffix" | Whether to keep separators when splitting (recursive strategy). |
chunking.separators | — | list[str] | ["\n\n", "\n", ". ", " ", ""] | — | Separators for recursive chunking, in priority order. |
chunking.target_tokens | — | int | 512 | ≥ 64, ≤ 8192 | Target tokens per chunk (token-based strategies) |
Details (glossary)
chunking.ast_overlap_lines (AST_OVERLAP_LINES) — AST Overlap Lines
Category: chunking
Number of overlapping lines between consecutive AST-based code chunks. Overlap ensures context continuity across chunk boundaries, preventing loss of meaning when functions or classes are split. Higher overlap (5-15 lines) improves retrieval quality by providing more context but increases index size and duplicate content. Lower overlap (0-5 lines) reduces redundancy but risks fragmenting logical units.
Sweet spot: 3-5 lines for balanced context preservation. Use 5-10 lines for codebases with large functions or complex nested structures where context matters heavily. Use 0-2 lines for memory-constrained environments or when chunk boundaries align well with natural code structure (e.g., clean function boundaries). AST-aware chunking (cAST method) respects syntax boundaries, so overlap supplements structural chunking.
Example: With 5-line overlap, if chunk 1 ends at line 100, chunk 2 starts at line 96, creating a 5-line bridge. This helps when a query matches content near chunk boundaries - the overlapping region appears in both chunks, improving recall. The cAST paper (EMNLP 2025) shows overlap significantly improves code retrieval accuracy.
• Range: 0-15 lines (typical) • Minimal: 0-2 lines (tight memory, clean boundaries) • Balanced: 3-5 lines (recommended for most codebases) • High context: 5-10 lines (complex nested code) • Very high: 10-15 lines (maximum context, high redundancy) • Trade-off: More overlap = better recall, larger index
Badges: - Advanced chunking - Requires reindex
Links: - cAST Chunking Paper (EMNLP 2025) - AST Chunking Toolkit - Context Window in RAG
chunking.chunk_overlap (CHUNK_OVERLAP) — Chunk Overlap
Category: chunking
Number of characters overlapped between adjacent chunks. Overlap reduces boundary effects and improves recall at the cost of a larger index and slower indexing.
Links: - LangChain: Text Splitters
chunking.chunk_size (CHUNK_SIZE) — Chunk Size
Category: chunking
Target size (in characters) for each indexed chunk. For AST chunking this acts as a guardrail when nodes are large. Larger chunks preserve more context but reduce recall; smaller chunks improve recall but may fragment semantics.
Badges: - Affects recall/precision
Links: - LangChain: Text Splitters - Okapi BM25 (context windows)
chunking.chunking_strategy (CHUNKING_STRATEGY) — Chunking Strategy
Category: chunking
Primary strategy for splitting code into chunks during indexing. Options: "ast" (AST-aware, syntax-respecting, recommended for code), "greedy" (line-based splitting, simpler), "hybrid" (AST with greedy fallback). AST chunking uses the cAST method (EMNLP 2025) to respect function/class boundaries, preserving semantic units. Greedy chunking splits at line breaks to hit target size, ignoring syntax. Hybrid uses AST primarily with greedy fallback for unparseable files.
"ast" (recommended for code): Parses syntax tree and chunks at natural boundaries (functions, classes, methods). Produces semantically coherent chunks. Best for code retrieval. Requires parseable syntax - fails gracefully on malformed code.
"greedy": Simple line-based splitting at target character count. Fast, always works, but may split mid-function or mid-class, fragmenting semantic units. Use for non-code (markdown, text) or when AST parsing is too slow.
"hybrid": Tries AST first, falls back to greedy on parse errors. Balanced approach - gets AST benefits for well-formed code, handles edge cases gracefully. Recommended for mixed codebases (code + docs + config).
• ast: Syntax-aware, best retrieval quality, code-only, requires parseable syntax (recommended for code) • greedy: Fast, always works, ignores syntax, lower quality chunks, good for non-code • hybrid: AST + greedy fallback, balanced, handles all files (recommended for mixed repos) • Effect: Fundamental impact on chunk quality, retrieval precision, index structure • Requires reindex: Changes take effect after full rebuild
Badges: - Core indexing choice - Requires reindex
Links: - cAST Chunking Paper (EMNLP 2025) - AST Chunking Toolkit - RAG Chunking Best Practices
chunking.greedy_fallback_target (GREEDY_FALLBACK_TARGET) — Greedy Fallback Target (Chars)
Category: general
Target chunk size (in characters) for greedy fallback chunking when AST-based chunking fails or encounters oversized logical units. Greedy chunking splits text at line boundaries to hit this approximate size. Used as a safety mechanism when: (1) file syntax is unparseable, (2) a single function/class exceeds MAX_CHUNK_SIZE, (3) non-code files (markdown, text) are indexed.
Sweet spot: 500-800 characters for fallback chunks. This roughly corresponds to 100-150 tokens, providing reasonable context when AST chunking isn't possible. Use 800-1200 for larger fallback chunks (more context but less precise boundaries). Use 300-500 for smaller fallback chunks (tighter boundaries, less context). Greedy chunking is less semantic than AST chunking - it splits at line breaks regardless of code structure.
Example: If a 3000-char function exceeds MAX_CHUNK_SIZE and can't be split structurally, greedy fallback divides it into ~4 chunks of ~750 chars each (based on GREEDY_FALLBACK_TARGET=800). This preserves some of the function in each chunk. Greedy fallback is rare in well-formed code but essential for robustness.
• Range: 300-1500 characters (typical) • Small: 300-500 chars (tight boundaries, less context) • Balanced: 500-800 chars (recommended, ~100-150 tokens) • Large: 800-1200 chars (more context per fallback chunk) • Very large: 1200-1500 chars (maximum context, rare use) • When used: Syntax errors, oversized units, non-code files
Badges: - Fallback mechanism - Requires reindex
Links: - Chunking Robustness - Greedy Chunking
chunking.max_chunk_tokens (MAX_CHUNK_TOKENS) — Max Chunk Tokens
Category: chunking
Maximum token length for a single code chunk during AST-based chunking. Limits chunk size to fit within embedding model token limits (typically 512-8192 tokens). Larger chunks (1000-2000 tokens) capture more context per chunk, reducing fragmentation of large functions/classes. Smaller chunks (200-512 tokens) create more granular units, improving precision but potentially losing broader context.
Sweet spot: 512-768 tokens for balanced chunking. This fits most embedding models (e.g., OpenAI text-embedding-3 supports up to 8191 tokens, but 512-768 is practical). Use 768-1024 for code with large docstrings or complex classes where context matters. Use 256-512 for tight memory budgets or when targeting very specific code snippets. AST chunking respects syntax, so chunks won't split mid-function even if size limit is hit (falls back to greedy chunking).
Token count is approximate (based on whitespace heuristics, not exact tokenization). Actual embedding input may vary slightly. If a logical unit (function, class) exceeds MAX_CHUNK_TOKENS, the chunker splits it using GREEDY_FALLBACK_TARGET for sub-chunking while preserving structure where possible.
• Range: 200-2000 tokens (typical) • Small: 256-512 tokens (precision, tight memory) • Balanced: 512-768 tokens (recommended, fits most models) • Large: 768-1024 tokens (more context, larger functions) • Very large: 1024-2000 tokens (maximum context, risky for some models) • Constraint: Must not exceed embedding model token limit
Badges: - Advanced chunking - Requires reindex
Links: - Token Limits by Model - cAST Paper - Chunking Size Tradeoffs - Token Estimation
chunking.max_indexable_file_size (MAX_INDEXABLE_FILE_SIZE) — Max Indexable File Size
Category: infrastructure
Maximum file size in bytes that will be indexed. Files larger than this limit are skipped during indexing to prevent memory issues and avoid indexing large binary or generated files. Default is 2MB (2,000,000 bytes). Increase for codebases with legitimately large source files; decrease to speed up indexing and reduce memory usage.
Sweet spot: 1-2 MB for most codebases. Use 500KB-1MB for memory-constrained environments or when you want to exclude large auto-generated files. Use 2-5MB for codebases with large source files (e.g., bundled assets, data files that should be searchable). Files exceeding this limit are logged as skipped.
Example: A 5MB SQL dump file would be skipped with MAX_INDEXABLE_FILE_SIZE=2000000. To include it, increase to 6000000 (6MB). Large files that are indexed will be chunked normally, but may take longer to process and consume more embedding API tokens.
• Range: 100KB - 10MB (typical) • Tight: 100KB - 500KB (skip most large files, fast indexing) • Balanced: 1MB - 2MB (recommended, handles normal source files) • Large: 2MB - 5MB (include larger source files) • Very large: 5MB - 10MB (include data files, maximum coverage) • Trade-off: Higher limit = more coverage, slower indexing, more memory
Badges: - File filtering - Requires reindex
chunking.min_chunk_chars (MIN_CHUNK_CHARS) — Min Chunk Chars
Category: chunking
Minimum characters required for a chunk to be kept. Very small fragments below this are dropped or merged to reduce indexing noise. Default: 50. Range: 10-500. Raise it for a cleaner index; lower it if you need tiny helpers and stubs searchable. Changing this requires reindexing.
Badges: - Index quality control - Requires reindex
Links: - Code Chunking Best Practices - cAST Filtering
chunking.preserve_imports (PRESERVE_IMPORTS) — Preserve Imports
Category: infrastructure
Include import and require blocks even when they are below MIN_CHUNK_CHARS (0/1). Improves dependency discovery queries like "where is X imported". Default: 1. Range: 0-1. Changing this requires reindexing.
Badges: - Dependency tracking - Requires reindex
Links: - Code Structure Analysis - Module Systems