Chunking for LLMs
Naive line-count or character-count splitting breaks code apart at random. A function split across two chunks loses its signature. A class split mid-method gives the model half a definition.
Syntax-aware chunking walks the concrete syntax tree and splits at natural boundaries. Here's the difference:
def process_order(order_id: str, quantity: int) -> dict:
"""Process an order and return the result."""
# validate input
if quantity <= 0:
raise ValueError("quantity must be positive")
item = fetch_item(order_id)
price = item["price"] * quantity
return {"order_id": order_id, "total": price, "status": "pending"}
Naive chunking at 100 bytes might split after raise ValueError(...), leaving the return statement in the next chunk. Syntax-aware chunking keeps process_order together as one unit. The chunker splits inside a function when that function alone exceeds the byte budget.
Basic usage¶
Set chunk_max_size in ProcessConfig to enable chunking:
from tree_sitter_language_pack import process, ProcessConfig
with open("src/service.py") as f:
source = f.read()
result = process(source, ProcessConfig(
language="python",
chunk_max_size=1000, # max bytes per chunk
structure=True, # include structure metadata
))
for i, chunk in enumerate(result["chunks"]):
print(f"Chunk {i + 1}: lines {chunk['start_line']}-{chunk['end_line']} "
f"({chunk['end_byte'] - chunk['start_byte']} bytes)")
import { process } from "@kreuzberg/tree-sitter-language-pack";
import { readFileSync } from "fs";
const source = readFileSync("src/service.ts", "utf8");
const result = await process(source, {
language: "typescript",
chunkMaxSize: 1000,
structure: true,
});
result.chunks.forEach((chunk, i) => {
console.log(`Chunk ${i + 1}: lines ${chunk.startLine}-${chunk.endLine} (${chunk.endByte - chunk.startByte} bytes)`);
});
use tree_sitter_language_pack::{process, ProcessConfig};
let mut config = ProcessConfig::new("rust").with_chunking(1000);
config.structure = true;
let result = process(&source, &config)?;
for (i, chunk) in result.chunks.iter().enumerate() {
println!("Chunk {}: lines {}-{} ({} bytes)",
i + 1, chunk.start_line, chunk.end_line, chunk.end_byte - chunk.start_byte);
}
Chunk fields¶
| Field | Type | Description |
|---|---|---|
content |
str | Source code text for this chunk |
start_byte |
int | Start byte offset in source |
end_byte |
int | End byte offset in source |
start_line |
int | First line (1-indexed) |
end_line |
int | Last line (1-indexed) |
node_types |
list[str] | Top-level tree-sitter node types in this chunk |
How it works¶
The chunker runs three passes:
- Collect top-level declarations (functions, classes, methods) as atomic units. Comments and docstrings above a declaration attach to it.
- Pack units into chunks without exceeding
chunk_max_size. When the current chunk would overflow, close it and start a new one. - For any single unit that exceeds
chunk_max_sizeon its own, split at the next logical sub-boundary — between methods in a class, or between statement blocks in a function.
The result: functions are never split unless they're individually too large, decorators stay with their function, and imports group into a single chunk at the top.
Byte budget¶
chunk_max_size is an upper bound in bytes, not a fixed size. The chunker may produce smaller chunks when a natural boundary falls before the limit.
Structure metadata with chunks¶
When structure=True is also set, each chunk's node_types field shows what kind of code it contains. This is useful for metadata-enriched vector store ingestion:
config = ProcessConfig(
language="python",
chunk_max_size=1000,
structure=True,
docstrings=True,
)
result = process(source, config)
documents = []
for chunk in result["chunks"]:
documents.append({
"content": chunk["content"],
"metadata": {
"language": "python",
"start_line": chunk["start_line"],
"end_line": chunk["end_line"],
"node_types": chunk["node_types"],
"size_bytes": chunk["end_byte"] - chunk["start_byte"],
}
})
Indexing a repository¶
A complete example that walks a codebase and produces LLM-ready chunks:
import os
from pathlib import Path
from tree_sitter_language_pack import process, ProcessConfig, has_language
LANGUAGE_MAP = {
".py": "python", ".js": "javascript", ".ts": "typescript",
".rs": "rust", ".go": "go", ".java": "java",
".rb": "ruby", ".ex": "elixir", ".php": "php",
".cs": "csharp", ".cpp": "cpp", ".c": "c",
}
def chunk_repository(repo_path: str, chunk_size: int = 800) -> list[dict]:
chunks = []
for root, _, files in os.walk(repo_path):
for filename in files:
ext = Path(filename).suffix
language = LANGUAGE_MAP.get(ext)
if not language or not has_language(language):
continue
filepath = os.path.join(root, filename)
try:
source = Path(filepath).read_text(encoding="utf-8", errors="ignore")
except OSError:
continue
result = process(source, ProcessConfig(
language=language,
chunk_max_size=chunk_size,
structure=True,
imports=True,
docstrings=True,
))
for chunk in result["chunks"]:
chunks.append({
"content": chunk["content"],
"file": filepath,
"start_line": chunk["start_line"],
"end_line": chunk["end_line"],
"language": language,
"node_types": chunk["node_types"],
"size_bytes": chunk["end_byte"] - chunk["start_byte"],
})
return chunks
docs = chunk_repository("./my-project")
print(f"{len(docs)} chunks from {len(set(d['file'] for d in docs))} files")
Next steps¶
- Code intelligence — the other
ProcessConfigfields that work alongside chunking - Concepts: Code intelligence — the extraction engine design