Code intelligence¶
process() parses a file and walks the AST to return structured data: the functions and classes defined in it, what it imports, its docstrings, comments, and more. You configure what to extract; the Rust core handles the manual extraction. It does not expose a built-in arbitrary query execution API.
Here's what a typical result looks like:
from tree_sitter_language_pack import process, ProcessConfig
source = '''
import os
from pathlib import Path
def read_file(path: str) -> str:
"""Read and return file contents."""
return Path(path).read_text()
class FileCache:
"""Cache for file contents."""
def __init__(self, root: str):
self.root = root
def get(self, name: str) -> str:
"""Return cached file contents."""
return read_file(os.path.join(self.root, name))
'''
result = process(source, ProcessConfig(
language="python",
structure=True,
imports=True,
docstrings=True,
))
for item in result.structure:
doc = f" - {item.docstring}" if item.docstring else ""
print(f"{item.kind:8} {item.name:20} lines {item.start_line}-{item.end_line}{doc}")
print()
for imp in result.imports:
names = ", ".join(imp.names) or "*"
print(f"from {imp.source} import {names}")
Output:
import { process } from "@kreuzberg/tree-sitter-language-pack";
const result = await process(source, {
language: "typescript",
structure: true,
imports: true,
docstrings: true,
});
result.structure.forEach(item => {
const doc = item.docstring ? ` — ${item.docstring}` : "";
console.log(`${item.kind.padEnd(8)} ${item.name.padEnd(20)} lines ${item.startLine}-${item.endLine}${doc}`);
});
use tree_sitter_language_pack::{process, ProcessConfig};
let mut config = ProcessConfig::new("rust");
config.structure = true;
config.imports = true;
config.docstrings = true;
let result = process(source, &config)?;
for item in &result.structure {
println!("{:8} {:20} lines {}-{}",
item.kind, item.name, item.start_line, item.end_line);
}
ProcessConfig fields¶
Pass language plus any of these fields:
| Field | Default | What it extracts |
|---|---|---|
structure |
True |
Functions, classes, methods, interfaces, structs, traits, enums |
imports |
True |
Import/require statements — source module and imported names |
exports |
True |
Exported symbols |
comments |
False |
All comments with text and location |
docstrings |
False |
Docstrings attached to declarations (requires structure=True) |
symbols |
False |
Deduplicated list of all identifiers, for search indexing |
diagnostics |
False |
Syntax error nodes from the parse |
data_extraction |
False |
Hierarchical key-value tree for structured data formats Available by v1.9 |
chunk_max_size |
None |
Maximum chunk size in bytes; see Chunking for LLMs |
Enable everything at once: ProcessConfig.all("python").
Result fields¶
structure¶
Each item has:
| Field | Type | Description |
|---|---|---|
kind |
str | function, class, method, interface, struct, trait, enum, impl, module, and so on. |
name |
str | Declaration name |
start_line |
int | First line (1-indexed) |
end_line |
int | Last line (1-indexed) |
docstring |
str | None | Attached docstring — only present when docstrings=True |
Kinds vary by language:
| Language | Kinds |
|---|---|
| Python | function, class, method, async_function |
| JavaScript/TypeScript | function, class, method, async_function, interface, enum |
| Rust | function, struct, impl, trait, enum, type_alias, mod |
| Java | class, interface, method, constructor, enum |
| Go | function, struct, interface, method |
imports¶
Each import has source (module path), names (list of imported identifiers — empty for wildcard or bare imports), and start_line.
Covers both import x and from x import y in Python, both import and require() in JavaScript.
exports¶
Language-specific:
- Python: module-level items not prefixed with
_, or listed in__all__ - JavaScript/TypeScript: explicit
exportdeclarations - Rust: items with
pubvisibility
Each export has name and kind.
comments¶
Each comment has text, start_line, and is_block (True for /* ... */, False for line comments).
docstrings¶
Docstrings attach to their parent item in structure as the docstring field. Extraction understands each language's convention:
| Language | Convention |
|---|---|
| Python | """...""" immediately after def/class |
| Rust | /// or //! above the item |
| JavaScript/TypeScript | /** ... */ JSDoc above the function |
| Java | /** ... */ Javadoc |
| Ruby | # ... lines immediately before def/class |
| Go | // FuncName ... comment block above the func |
| Elixir | @doc "..." or @moduledoc "..." |
symbols¶
A deduplicated list of all identifiers in the file. Useful for search indexing:
result = process(source, ProcessConfig(language="python", symbols=True))
print(sorted(result.symbols)[:10])
# ['FileCache', 'Path', 'get', 'name', 'os', 'path', 'read_file', 'root', 'str']
diagnostics¶
Syntax error nodes. A non-empty list does not mean the file has a parse error — tree-sitter recovers and produces a partial tree.
result = process(source, ProcessConfig(language="python", diagnostics=True))
for err in result.diagnostics:
print(f"Line {err.start_line}, col {err.start_col}: {err.message}")
metrics¶
File-level statistics, independent of the other fields:
| Field | Type | Description |
|---|---|---|
total_lines |
int | All lines |
code_lines |
int | Non-blank, non-comment lines |
comment_lines |
int | Comment lines |
blank_lines |
int | Empty lines |
max_depth |
int | Maximum nesting depth of the syntax tree |
result = process(source, ProcessConfig(language="python"))
m = result.metrics
print(f"{m.total_lines} lines total, {m.code_lines} code, {m.comment_lines} comments")
chunks¶
When chunk_max_size has a value, result.chunks contains syntax-aware splits ready for LLM ingestion. See Chunking for LLMs for full documentation.
Data extraction Available by v1.9¶
Set data_extraction = true on ProcessConfig to extract a hierarchical DataNode tree from structured-data languages. Instead of parsing code, this returns a nested key-value structure preserving the original document's hierarchy.
This is available through the process() API and generated bindings. The ts-pack process CLI does not expose a data_extraction flag.
Supported identifiers (19):
json, hjson, json5, toml, properties, hcl, hocon, kdl, cue, yaml, ini, editorconfig, csv, psv, po, nginx, caddy, xml, dtd.
DataNode shape¶
Each node contains:
| Field | Type | Description |
|---|---|---|
kind |
KeyValue | Element | Sequence |
Node type: key-value pair, XML element, or sequence item |
key |
string | None | Key name, attribute name, tag name, or positional index ("0", "1", …). None at document root. |
value |
string | None | Leaf value if present. None for containers (objects, arrays, XML elements with children). |
attributes |
array | Attributes on XML elements; empty for other node types. |
children |
array | Nested child nodes for containers and XML element bodies. |
span |
object | Source location (start_byte, end_byte, start_line, end_line, start_col, end_col). |
Examples¶
JSON nested object:
result = process('''
{
"server": {
"host": "localhost",
"port": 8080
}
}
''', ProcessConfig(language="json", data_extraction=True))
# result.data.kind = "KeyValue"
# result.data.key = None
# result.data.children[0].key = "server"
# result.data.children[0].children[0].key = "host"
# result.data.children[0].children[0].value = "localhost"
# result.data.children[0].children[1].key = "port"
# result.data.children[0].children[1].value = "8080"
Properties flat key-value (issue #136):
// configuration.properties:
// database.url=jdbc:postgres://localhost
// database.port=5432
// cache.ttl=3600
result = process(propertiesSource, ProcessConfig.builder()
.language("properties")
.dataExtraction(true)
.build());
// Iterate key-value pairs:
for (DataNode pair : result.getData().getChildren()) {
String key = pair.getKey();
String value = pair.getValue();
System.out.println(key + " = " + value);
}
YAML with nested mapping:
result = process('''
database:
primary:
host: db.example.com
user: admin
replica:
host: db-replica.example.com
user: readonly
''', ProcessConfig(language="yaml", data_extraction=True))
# result.data.children[0].key = "database"
# result.data.children[0].children[0].key = "primary"
# result.data.children[0].children[0].children[0].key = "host"
TOML sections:
result = process('''
[build]
name = "my-app"
version = "1.0"
''', ProcessConfig(language="toml", data_extraction=True))
# result.data.children[0].key = "build"
# result.data.children[0].children = [
# {"key": "name", "value": "my-app", ...},
# {"key": "version", "value": "1.0", ...}
# ]
XML elements with attributes:
result = process('''
<config>
<server host="localhost" port="8080">
<ssl enabled="true"/>
</server>
</config>
''', ProcessConfig(language="xml", data_extraction=True))
# result.data.children[0].kind = "Element"
# result.data.children[0].key = "server"
# result.data.children[0].attributes = [
# {"name": "host", "value": "localhost"},
# {"name": "port", "value": "8080"}
# ]
# result.data.children[0].children = [
# {"kind": "Element", "key": "ssl", "attributes": [...], ...}
# ]
Next steps¶
- Chunking for LLMs — split code at natural boundaries for LLM ingestion
- Parsing code — raw syntax trees and low-level node traversal