Skip to content

Code intelligence

process() parses a file and walks the AST to return structured data: the functions and classes defined in it, what it imports, its docstrings, comments, and more. You configure what to extract; the Rust core handles the manual extraction. It does not expose a built-in arbitrary query execution API.

Here's what a typical result looks like:

from tree_sitter_language_pack import process, ProcessConfig

source = '''
import os
from pathlib import Path

def read_file(path: str) -> str:
    """Read and return file contents."""
    return Path(path).read_text()

class FileCache:
    """Cache for file contents."""

    def __init__(self, root: str):
        self.root = root

    def get(self, name: str) -> str:
        """Return cached file contents."""
        return read_file(os.path.join(self.root, name))
'''

result = process(source, ProcessConfig(
    language="python",
    structure=True,
    imports=True,
    docstrings=True,
))

for item in result.structure:
    doc = f" - {item.docstring}" if item.docstring else ""
    print(f"{item.kind:8} {item.name:20} lines {item.start_line}-{item.end_line}{doc}")

print()
for imp in result.imports:
    names = ", ".join(imp.names) or "*"
    print(f"from {imp.source} import {names}")

Output:

function read_file            lines 5-7  - Read and return file contents.
class    FileCache             lines 9-18 - Cache for file contents.
method   __init__              lines 12-13
method   get                   lines 15-17 — Return cached file contents.

from os import *
from pathlib import Path

import { process } from "@kreuzberg/tree-sitter-language-pack";

const result = await process(source, {
  language: "typescript",
  structure: true,
  imports: true,
  docstrings: true,
});

result.structure.forEach(item => {
  const doc = item.docstring ? ` — ${item.docstring}` : "";
  console.log(`${item.kind.padEnd(8)} ${item.name.padEnd(20)} lines ${item.startLine}-${item.endLine}${doc}`);
});
use tree_sitter_language_pack::{process, ProcessConfig};

let mut config = ProcessConfig::new("rust");
config.structure = true;
config.imports = true;
config.docstrings = true;

let result = process(source, &config)?;

for item in &result.structure {
    println!("{:8} {:20} lines {}-{}",
        item.kind, item.name, item.start_line, item.end_line);
}
# Extract structure and docstrings
ts-pack process src/app.py --structure --docstrings

# All fields, JSON output
ts-pack process src/app.py --all | jq '.structure'

ProcessConfig fields

Pass language plus any of these fields:

Field Default What it extracts
structure True Functions, classes, methods, interfaces, structs, traits, enums
imports True Import/require statements — source module and imported names
exports True Exported symbols
comments False All comments with text and location
docstrings False Docstrings attached to declarations (requires structure=True)
symbols False Deduplicated list of all identifiers, for search indexing
diagnostics False Syntax error nodes from the parse
data_extraction False Hierarchical key-value tree for structured data formats Available by v1.9
chunk_max_size None Maximum chunk size in bytes; see Chunking for LLMs

Enable everything at once: ProcessConfig.all("python").

Result fields

structure

Each item has:

Field Type Description
kind str function, class, method, interface, struct, trait, enum, impl, module, and so on.
name str Declaration name
start_line int First line (1-indexed)
end_line int Last line (1-indexed)
docstring str | None Attached docstring — only present when docstrings=True

Kinds vary by language:

Language Kinds
Python function, class, method, async_function
JavaScript/TypeScript function, class, method, async_function, interface, enum
Rust function, struct, impl, trait, enum, type_alias, mod
Java class, interface, method, constructor, enum
Go function, struct, interface, method

imports

Each import has source (module path), names (list of imported identifiers — empty for wildcard or bare imports), and start_line.

Covers both import x and from x import y in Python, both import and require() in JavaScript.

exports

Language-specific:

  • Python: module-level items not prefixed with _, or listed in __all__
  • JavaScript/TypeScript: explicit export declarations
  • Rust: items with pub visibility

Each export has name and kind.

comments

Each comment has text, start_line, and is_block (True for /* ... */, False for line comments).

docstrings

Docstrings attach to their parent item in structure as the docstring field. Extraction understands each language's convention:

Language Convention
Python """...""" immediately after def/class
Rust /// or //! above the item
JavaScript/TypeScript /** ... */ JSDoc above the function
Java /** ... */ Javadoc
Ruby # ... lines immediately before def/class
Go // FuncName ... comment block above the func
Elixir @doc "..." or @moduledoc "..."

symbols

A deduplicated list of all identifiers in the file. Useful for search indexing:

result = process(source, ProcessConfig(language="python", symbols=True))
print(sorted(result.symbols)[:10])
# ['FileCache', 'Path', 'get', 'name', 'os', 'path', 'read_file', 'root', 'str']

diagnostics

Syntax error nodes. A non-empty list does not mean the file has a parse error — tree-sitter recovers and produces a partial tree.

result = process(source, ProcessConfig(language="python", diagnostics=True))
for err in result.diagnostics:
    print(f"Line {err.start_line}, col {err.start_col}: {err.message}")

metrics

File-level statistics, independent of the other fields:

Field Type Description
total_lines int All lines
code_lines int Non-blank, non-comment lines
comment_lines int Comment lines
blank_lines int Empty lines
max_depth int Maximum nesting depth of the syntax tree
result = process(source, ProcessConfig(language="python"))
m = result.metrics
print(f"{m.total_lines} lines total, {m.code_lines} code, {m.comment_lines} comments")

chunks

When chunk_max_size has a value, result.chunks contains syntax-aware splits ready for LLM ingestion. See Chunking for LLMs for full documentation.

Data extraction Available by v1.9

Set data_extraction = true on ProcessConfig to extract a hierarchical DataNode tree from structured-data languages. Instead of parsing code, this returns a nested key-value structure preserving the original document's hierarchy.

This is available through the process() API and generated bindings. The ts-pack process CLI does not expose a data_extraction flag.

Supported identifiers (19):

json, hjson, json5, toml, properties, hcl, hocon, kdl, cue, yaml, ini, editorconfig, csv, psv, po, nginx, caddy, xml, dtd.

DataNode shape

Each node contains:

Field Type Description
kind KeyValue | Element | Sequence Node type: key-value pair, XML element, or sequence item
key string | None Key name, attribute name, tag name, or positional index ("0", "1", …). None at document root.
value string | None Leaf value if present. None for containers (objects, arrays, XML elements with children).
attributes array Attributes on XML elements; empty for other node types.
children array Nested child nodes for containers and XML element bodies.
span object Source location (start_byte, end_byte, start_line, end_line, start_col, end_col).

Examples

JSON nested object:

result = process('''
{
  "server": {
    "host": "localhost",
    "port": 8080
  }
}
''', ProcessConfig(language="json", data_extraction=True))

# result.data.kind = "KeyValue"
# result.data.key = None
# result.data.children[0].key = "server"
# result.data.children[0].children[0].key = "host"
# result.data.children[0].children[0].value = "localhost"
# result.data.children[0].children[1].key = "port"
# result.data.children[0].children[1].value = "8080"

Properties flat key-value (issue #136):

// configuration.properties:
// database.url=jdbc:postgres://localhost
// database.port=5432
// cache.ttl=3600

result = process(propertiesSource, ProcessConfig.builder()
    .language("properties")
    .dataExtraction(true)
    .build());

// Iterate key-value pairs:
for (DataNode pair : result.getData().getChildren()) {
    String key = pair.getKey();
    String value = pair.getValue();
    System.out.println(key + " = " + value);
}

YAML with nested mapping:

result = process('''
database:
  primary:
    host: db.example.com
    user: admin
  replica:
    host: db-replica.example.com
    user: readonly
''', ProcessConfig(language="yaml", data_extraction=True))

# result.data.children[0].key = "database"
# result.data.children[0].children[0].key = "primary"
# result.data.children[0].children[0].children[0].key = "host"

TOML sections:

result = process('''
[build]
name = "my-app"
version = "1.0"
''', ProcessConfig(language="toml", data_extraction=True))

# result.data.children[0].key = "build"
# result.data.children[0].children = [
#   {"key": "name", "value": "my-app", ...},
#   {"key": "version", "value": "1.0", ...}
# ]

XML elements with attributes:

result = process('''
<config>
  <server host="localhost" port="8080">
    <ssl enabled="true"/>
  </server>
</config>
''', ProcessConfig(language="xml", data_extraction=True))

# result.data.children[0].kind = "Element"
# result.data.children[0].key = "server"
# result.data.children[0].attributes = [
#   {"name": "host", "value": "localhost"},
#   {"name": "port", "value": "8080"}
# ]
# result.data.children[0].children = [
#   {"kind": "Element", "key": "ssl", "attributes": [...], ...}
# ]

Next steps

Edit this page on GitHub