Quick Start
This guide walks you from install to parsing, code intelligence, and LLM chunking.
1. Install¶
!!! Tip "Other ecosystems" Go, Java, Ruby, Elixir, PHP, and WebAssembly are also supported. See Installation for the full list.
2. Download Parsers¶
Parsers download automatically on first use. For production, CI, Docker, or offline environments, pre-download them.
Specific languages¶
All 306 languages¶
By language group¶
Groups bundle related languages: web, systems, scripting, data, jvm, functional.
Docker and CI¶
Pre-download parsers during your build to avoid runtime network calls:
FROM python:3.12-slim
RUN pip install tree-sitter-language-pack
# Pre-download at build time — no network needed at runtime
RUN python -c "from tree_sitter_language_pack import download_all; download_all()"
- name: Install and pre-download parsers
run: |
pip install tree-sitter-language-pack
python -c "from tree_sitter_language_pack import download; download(['python', 'javascript', 'rust'])"
Configuration file¶
Declare which languages your project needs in a language-pack.toml:
languages = ["python", "javascript", "rust", "go"]
# groups = ["web", "systems"]
# cache_dir = "/tmp/parsers"
Then download everything declared in the config:
!!! Info "Cache location" Parsers cache to ~/.cache/tree-sitter-language-pack/ on Linux/macOS and %LOCALAPPDATA%\tree-sitter-language-pack\ on Windows. Override with cache_dir in language-pack.toml or the programmatic API. See Download Model for full details.
3. Parse Code¶
Build a concrete syntax tree from source code.
from tree_sitter_language_pack import get_parser
parser = get_parser("python")
source = b"""
def greet(name: str) -> str:
return f"Hello, {name}!"
result = greet("world")
"""
tree = parser.parse(source)
root = tree.root_node
print(root.type) # module
print(root.child_count) # 2
print(root.sexp()[:120]) # S-expression preview
import { parseString, treeRootNodeType, treeRootChildCount } from "@kreuzberg/tree-sitter-language-pack";
const source = `
function greet(name) {
return \`Hello, \${name}!\`;
}
greet("world");
`;
const tree = parseString("javascript", source);
console.log(treeRootNodeType(tree)); // program
console.log(treeRootChildCount(tree)); // 2
use tree_sitter_language_pack::get_parser;
fn main() -> anyhow::Result<()> {
let mut parser = get_parser("rust")?;
let source = r#"
fn greet(name: &str) -> String {
format!("Hello, {}!", name)
}
"#;
let tree = parser.parse(source, None).unwrap();
let root = tree.root_node();
println!("{}", root.kind()); // source_file
println!("{}", root.child_count()); // 1
println!("{}", root.to_sexp());
Ok(())
}
4. Extract Code Intelligence¶
Go beyond the raw syntax tree. Extract functions, classes, imports, docstrings, and more with process.
from tree_sitter_language_pack import process, ProcessConfig
source = """
import os
from pathlib import Path
def read_file(path: str) -> str:
\"\"\"Read and return the contents of a file.\"\"\"
return Path(path).read_text()
class FileManager:
def __init__(self, base_dir: str):
self.base_dir = base_dir
def get(self, name: str) -> str:
return read_file(os.path.join(self.base_dir, name))
"""
config = ProcessConfig(
language="python",
structure=True, # functions and classes
imports=True, # import statements
comments=True, # inline comments
docstrings=True, # docstring extraction
)
result = process(source, config)
print(f"Imports: {[i['name'] for i in result['imports']]}")
print(f"Symbols: {[s['name'] for s in result['structure']]}")
print(f"Docstring: {result['structure'][0]['docstring']}")
import { process } from "@kreuzberg/tree-sitter-language-pack";
const source = `
import fs from "fs";
import { join } from "path";
/**
* Read and return the contents of a file.
*/
function readFile(path: string): string {
return fs.readFileSync(path, "utf8");
}
class FileManager {
constructor(private baseDir: string) {}
get(name: string): string {
return readFile(join(this.baseDir, name));
}
}
`;
const result = await process(source, {
language: "typescript",
structure: true,
imports: true,
docstrings: true,
});
console.log("Imports:", result.imports.map(i => i.name));
console.log("Symbols:", result.structure.map(s => s.name));
use tree_sitter_language_pack::{process, ProcessConfig};
fn main() -> anyhow::Result<()> {
let source = r#"
use std::fs;
use std::path::Path;
/// Read and return the contents of a file.
fn read_file(path: &str) -> String {
fs::read_to_string(path).unwrap()
}
struct FileManager {
base_dir: String,
}
"#;
let mut config = ProcessConfig::new("rust");
config.structure = true;
config.imports = true;
config.docstrings = true;
let result = process(source, &config)?;
println!("Imports: {:?}", result.imports.iter().map(|i| &i.name).collect::<Vec<_>>());
println!("Symbols: {:?}", result.structure.iter().map(|s| &s.name).collect::<Vec<_>>());
Ok(())
}
5. Run Extraction Queries¶
Use extract to run custom tree-sitter queries and get structured results with captured text and metadata.
import tree_sitter_language_pack as tslp
source = """
def greet(name: str) -> str:
return f"Hello, {name}!"
def farewell(name: str) -> str:
return f"Goodbye, {name}!"
"""
result = tslp.extract(source, {
"language": "python",
"patterns": {
"functions": {
"query": "(function_definition name: (identifier) @name)",
"capture_output": "Text",
}
}
})
for match in result["results"]["functions"]["matches"]:
print(match["captures"][0]["text"])
# greet
# farewell
7. Chunk for LLMs¶
Split code at natural boundaries so language models receive coherent, complete units which is ideal for embedding pipelines and context windows.
from tree_sitter_language_pack import process, ProcessConfig
with open("large_module.py") as f:
source = f.read()
config = ProcessConfig(
language="python",
chunk_max_size=1500, # max bytes per chunk
structure=True,
)
result = process(source, config)
for i, chunk in enumerate(result["chunks"]):
print(f"Chunk {i}: lines {chunk['start_line']}-{chunk['end_line']} "
f"({chunk['end_byte'] - chunk['start_byte']} bytes)")
import { process } from "@kreuzberg/tree-sitter-language-pack";
import { readFileSync } from "fs";
const source = readFileSync("large_module.ts", "utf8");
const result = await process(source, {
language: "typescript",
chunkMaxSize: 1500,
structure: true,
});
result.chunks.forEach((chunk, i) => {
console.log(`Chunk ${i}: lines ${chunk.startLine}-${chunk.endLine} (${chunk.endByte - chunk.startByte} bytes)`);
});
You now have the full workflow. You can now install, download, parse, extract intelligence, run queries, and chunk for LLMs. Go further with the following guides:
- Parsing guide — syntax trees, error handling, and incremental parsing
- Configuration —
language-pack.tomland advanced options - API Reference — full API docs for every binding