Chunking for LLMs¶
When feeding source code to a language model, naive line-count or character-count splitting produces broken, incoherent fragments. A function split across two chunks loses its signature. A class split mid-method gives the model half a definition. tree-sitter-language-pack solves this with syntax-aware chunking: it walks the concrete syntax tree and splits only at natural boundaries.
Why Syntax-Aware Chunking Matters¶
Consider this Python file:
def process_order(order_id: str, quantity: int) -> dict:
"""Process an order and return the result."""
# validate input
if quantity <= 0:
raise ValueError("quantity must be positive")
item = fetch_item(order_id)
price = item["price"] * quantity
return {"order_id": order_id, "total": price, "status": "pending"}
```text
Naive chunking at 100 tokens might split after `raise ValueError(...)`, leaving the return statement in the next chunk. The model sees an incomplete function in both chunks, with no way to understand the full intent.
Syntax-aware chunking keeps `process_order` together as one unit. Only when a single function exceeds the token budget does the chunker split inside it — and it marks this clearly.
## Basic Usage
=== "Python"
```python
from tree_sitter_language_pack import process, ProcessConfig
with open("src/service.py") as f:
source = f.read()
config = ProcessConfig(
language="python",
chunk_max_size=1000, # target tokens per chunk
structure=True, # optionally include structure info
)
result = process(source, config)
for i, chunk in enumerate(result["chunks"]):
print(f"Chunk {i + 1}: lines {chunk['start_line']}-{chunk['end_line']} "
f"({chunk['token_count']} tokens)")
print(chunk["content"][:80] + "...")
print()
```
=== "Node.js"
```typescript
import { process } from "@kreuzberg/tree-sitter-language-pack";
import { readFileSync } from "fs";
const source = readFileSync("src/service.ts", "utf8");
const result = await process(source, {
language: "typescript",
chunkMaxSize: 1000,
structure: true,
});
result.chunks.forEach((chunk, i) => {
console.log(`Chunk ${i + 1}: lines ${chunk.startLine}-${chunk.endLine} (${chunk.tokenCount} tokens)`);
});
```
=== "Rust"
```rust
use ts_pack_core::{process, ProcessConfig};
use std::fs;
let source = fs::read_to_string("src/service.rs")?;
let config = ProcessConfig::new("rust")
.chunk_max_size(1000)
.structure(true);
let result = process(&source, &config)?;
for (i, chunk) in result.chunks.iter().enumerate() {
println!("Chunk {}: lines {}-{} ({} tokens)",
i + 1, chunk.start_line, chunk.end_line, chunk.token_count);
}
```
=== "CLI"
```bash
ts-pack process src/service.py --chunk-size 1000 --format json \
| jq '.chunks[] | {lines: "\(.start_line)-\(.end_line)", tokens: .token_count}'
```
## Chunk Structure
Each chunk contains:
| Field | Type | Description |
|-------|------|-------------|
| `content` | string | The source code text for this chunk |
| `start_line` | int | First line of the chunk (1-indexed) |
| `end_line` | int | Last line of the chunk (1-indexed) |
| `token_count` | int | Estimated token count (cl100k approximation) |
| `node_types` | list[str] | Tree-sitter node types at the top of this chunk |
| `is_partial` | bool | `True` if a single construct was split across chunks |
## How the Chunker Works
The chunker operates in three passes:
**Pass 1: Collect leaf units.** Walk the syntax tree and collect all top-level declarations (functions, classes, methods, etc.) as atomic units. Comments and docstrings above a declaration are attached to it.
**Pass 2: Pack units into chunks.** Greedily pack units into chunks without exceeding `chunk_max_size`. When the current chunk would overflow, close it and start a new one.
**Pass 3: Split oversized units.** If a single unit (e.g., a very large function) exceeds `chunk_max_size` on its own, split it at the next logical sub-boundary (e.g., between methods in a class, or between statement blocks in a function).
This strategy ensures:
- Functions are never split unless they are individually too large.
- A decorator or docstring is always in the same chunk as the function it belongs to.
- Class definitions keep their method list together where possible.
- Imports are grouped into a single chunk at the top.
## Token Budget
The `chunk_max_size` parameter is an **upper bound** on tokens per chunk, not a fixed size. The chunker may produce smaller chunks when a natural boundary falls before the limit, and may slightly exceed the limit when the only split point is past it.
Token counting uses the `cl100k_base` approximation (4 characters ≈ 1 token), which is a close match for GPT-4, Claude, and Llama-family models. You can override this:
=== "Python"
```python
config = ProcessConfig(
language="python",
chunk_max_size=1000,
chunk_overlap=100, # overlap tokens between adjacent chunks
)
```
=== "Node.js"
```typescript
const result = await process(source, {
language: "python",
chunkMaxSize: 1000,
chunkOverlap: 100, // repeat last N tokens of previous chunk
});
```
## Chunk Overlap
For retrieval use cases, you may want adjacent chunks to share some context. Set `chunk_overlap` to repeat the last N tokens of the previous chunk at the start of the next:
```python
config = ProcessConfig(
language="python",
chunk_max_size=800,
chunk_overlap=150, # repeat ~150 tokens of context
)
```text
!!! warning "Overlap increases storage"
Overlap causes chunks to share content. For storage in a vector database, account for the increased total token count across chunks when planning your embedding budget.
## Including Structure Metadata
When `structure=True` is also set, each chunk's `node_types` field tells you what kind of code it contains, which is useful for metadata-enriched vector store ingestion:
```python
config = ProcessConfig(
language="python",
chunk_max_size=1000,
structure=True,
docstrings=True,
)
result = process(source, config)
# Build vector store documents
documents = []
for chunk in result["chunks"]:
documents.append({
"content": chunk["content"],
"metadata": {
"language": "python",
"start_line": chunk["start_line"],
"end_line": chunk["end_line"],
"node_types": chunk["node_types"],
"token_count": chunk["token_count"],
}
})
```text
## Real-World Example: Indexing a Repository
```python
import os
from pathlib import Path
from tree_sitter_language_pack import process, ProcessConfig, has_language
LANGUAGE_MAP = {
".py": "python",
".js": "javascript",
".ts": "typescript",
".rs": "rust",
".go": "go",
".java": "java",
".rb": "ruby",
".ex": "elixir",
".exs": "elixir",
".php": "php",
".cs": "csharp",
".cpp": "cpp",
".c": "c",
".kt": "kotlin",
".swift": "swift",
}
def chunk_repository(repo_path: str, chunk_size: int = 1000) -> list[dict]:
chunks = []
for root, _, files in os.walk(repo_path):
for filename in files:
ext = Path(filename).suffix
language = LANGUAGE_MAP.get(ext)
if not language or not has_language(language):
continue
filepath = os.path.join(root, filename)
try:
source = Path(filepath).read_text(encoding="utf-8", errors="ignore")
except OSError:
continue
config = ProcessConfig(
language=language,
chunk_max_size=chunk_size,
structure=True,
imports=True,
docstrings=True,
)
result = process(source, config)
for chunk in result["chunks"]:
chunks.append({
"content": chunk["content"],
"file": filepath,
"start_line": chunk["start_line"],
"end_line": chunk["end_line"],
"language": language,
"node_types": chunk["node_types"],
"token_count": chunk["token_count"],
})
return chunks
# Index a repository
docs = chunk_repository("./my-project", chunk_size=800)
print(f"Generated {len(docs)} chunks from {len(set(d['file'] for d in docs))} files")
```text
## Chunking vs. Splitting by File
For large codebases, you might consider sending entire small files as single chunks and only chunking large files. Here is a pattern:
```python
MAX_FILE_TOKENS = 600 # treat files under this as one chunk
CHUNK_SIZE = 800
config_full = ProcessConfig(language=language, structure=True, imports=True)
config_chunked = ProcessConfig(language=language, chunk_max_size=CHUNK_SIZE, structure=True, imports=True)
result = process(source, config_full)
file_tokens = result["metrics"].get("total_tokens", len(source) // 4)
if file_tokens <= MAX_FILE_TOKENS:
# Use the whole file as one chunk
chunks = [{"content": source, "start_line": 1, "end_line": result["metrics"]["total_lines"]}]
else:
result = process(source, config_chunked)
chunks = result["chunks"]