Skip to content

Code Intelligence Guide

Beyond raw syntax trees, tree-sitter-language-pack can extract semantic information useful for code analysis, search, documentation generation, and LLM ingestion. This guide covers the process() function, ProcessConfig options, and working with intelligence results.

Overview

The process() function runs tree-sitter queries against a parsed AST to extract structured data about code. Enable only the features you need:

  • structure: Functions, classes, methods, interfaces, and other declarations
  • imports: Import statements and their sources
  • exports: Exported symbols and public API surface
  • comments: Inline and block comments
  • docstrings: Documentation strings attached to declarations
  • symbols: All identifiers in the code (for search indexing)
  • diagnostics: Syntax errors and malformed code regions
  • chunks: Syntax-aware splitting for LLM token budgets (see Chunking)
  • metrics: File-level statistics (line count, complexity, etc.)

Quick Start

from tree_sitter_language_pack import process, ProcessConfig

source = """
def greet(name: str) -> str:
    '''Say hello to a user.'''
    return f"Hello {name}"

class Greeter:
    def __init__(self, prefix: str = ""):
        self.prefix = prefix
"""

config = ProcessConfig(
    language="python",
    structure=True,      # extract classes, functions, methods
    docstrings=True,     # attach docstrings to declarations
    metrics=True,        # file statistics
)
result = process(source, config)

# Access results
for item in result["structure"]:
    print(f"{item['kind']:10} {item['name']:20} lines {item['start_line']}-{item['end_line']}")

print(f"\nFile metrics: {result['metrics']['total_lines']} lines, "
      f"{result['metrics']['code_lines']} code")
import { process } from "@kreuzberg/tree-sitter-language-pack";

const source = `
function greet(name: string): string {
  return \`Hello \${name}\`;
}

class Greeter {
  constructor(public prefix: string = "") {}
}
`;

const result = await process(source, {
  language: "typescript",
  structure: true,
  docstrings: true,
  metrics: true,
});

// Access results
result.structure.forEach(item => {
  console.log(`${item.kind.padEnd(10)} ${item.name.padEnd(20)} lines ${item.startLine}-${item.endLine}`);
});

console.log(`\nFile metrics: ${result.metrics.totalLines} lines, ${result.metrics.codeLines} code`);
use ts_pack_core::{process, ProcessConfig};

let source = r#"
/// Greet a user.
pub fn greet(name: &str) -> String {
    format!("Hello {}", name)
}

pub struct Greeter {
    prefix: String,
}
"#;

let config = ProcessConfig::new("rust")
    .structure(true)
    .docstrings(true)
    .metrics(true);

let result = process(source, &config)?;

// Access results
for item in &result.structure {
    println!("{:10} {:20} lines {}-{}",
        item.kind, item.name, item.start_line, item.end_line);
}

println!("\nFile metrics: {} lines, {} code",
    result.metrics.total_lines, result.metrics.code_lines);
# Extract structure and docstrings
ts-pack process src/app.py --structure --docstrings

# Enable all features and output as JSON
ts-pack process src/app.py --all --format json | jq '.structure'

# Count functions
ts-pack process src/lib.rs --structure --format json | jq '.structure | map(select(.kind == "function")) | length'

ProcessConfig Reference

Creating Configuration

from tree_sitter_language_pack import ProcessConfig

# Create config for a language
config = ProcessConfig(language="python")

# Enable specific features
config = ProcessConfig(
    language="python",
    structure=True,       # functions, classes, methods
    imports=True,         # import statements
    exports=True,         # exported symbols
    comments=True,        # inline comments
    docstrings=True,      # docstrings attached to declarations
    symbols=True,         # all identifiers
    diagnostics=True,     # syntax errors
    metrics=True,         # file statistics
    chunk_max_size=0,     # 0 = no chunking
    chunk_overlap=0,      # tokens to overlap between chunks
)

# Or enable everything at once
config = ProcessConfig(language="python").all()
import { process } from "@kreuzberg/tree-sitter-language-pack";

const result = await process(source, {
  language: "typescript",
  structure: true,
  imports: true,
  exports: true,
  comments: true,
  docstrings: true,
  symbols: true,
  diagnostics: true,
  metrics: true,
  chunkMaxSize: 0,      // 0 = no chunking
  chunkOverlap: 0,      // tokens to overlap
});
use ts_pack_core::{ProcessConfig, process};

let config = ProcessConfig::new("rust")
    .structure(true)
    .imports(true)
    .exports(true)
    .comments(true)
    .docstrings(true)
    .symbols(true)
    .diagnostics(true)
    .metrics(true)
    .all();  // Enable all at once

let result = process(source, &config)?;

ProcessResult Fields

structure — Code Declarations

List of top-level and nested code constructs: functions, classes, methods, interfaces, traits, etc.

from tree_sitter_language_pack import process, ProcessConfig

source = """
def process_data(input_file: str):
    '''Read and process data from a file.'''
    data = load_file(input_file)
    return transform(data)

class DataProcessor:
    '''A class for processing data.'''

    def __init__(self, name: str):
        self.name = name

    def process(self, data):
        '''Process data and return result.'''
        return data
"""

config = ProcessConfig(language="python", structure=True, docstrings=True)
result = process(source, config)

for item in result["structure"]:
    print(f"Kind:       {item['kind']}")           # "function", "class", "method"
    print(f"Name:       {item['name']}")           # "process_data", "DataProcessor"
    print(f"Start line: {item['start_line']}")     # 1-indexed
    print(f"End line:   {item['end_line']}")
    print(f"Docstring:  {item.get('docstring', '(none)')}")
    print()
import { process } from "@kreuzberg/tree-sitter-language-pack";

const source = `
function processData(inputFile: string): any {
  const data = loadFile(inputFile);
  return transform(data);
}

class DataProcessor {
  name: string;

  constructor(name: string) {
    this.name = name;
  }

  process(data: any): any {
    return data;
  }
}
`;

const result = await process(source, {
  language: "typescript",
  structure: true,
  docstrings: true,
});

result.structure.forEach(item => {
  console.log(`Kind:       ${item.kind}`);             // "function", "class", "method"
  console.log(`Name:       ${item.name}`);             // "processData", "DataProcessor"
  console.log(`Start line: ${item.startLine}`);
  console.log(`End line:   ${item.endLine}`);
  console.log(`Docstring:  ${item.docstring || "(none)"}`);
  console.log();
});

Structure item fields:

Field Type Description
kind string One of: function, class, method, interface, struct, trait, enum, module, etc.
name string Name of the declaration
start_line int First line (1-indexed)
end_line int Last line (1-indexed)
docstring string | null Docstring if docstrings=True

Supported kinds by language:

Language Kinds
Python function, class, method, async_function
JavaScript/TypeScript function, class, method, async_function, interface, enum
Rust function, struct, impl, trait, enum, type_alias, mod
Java class, interface, method, constructor, enum
Go function, struct, interface, method
Ruby def, class, module, singleton_method

imports — Import Statements

All import and require declarations with their source modules.

from tree_sitter_language_pack import process, ProcessConfig

source = """
import os
from pathlib import Path
from typing import Optional, Dict
from .utils import helper
"""

config = ProcessConfig(language="python", imports=True)
result = process(source, config)

for imp in result["imports"]:
    print(f"Source: {imp['source']}")          # "os", "pathlib", "typing", "./utils"
    print(f"Names:  {imp['names']}")           # ["Path"], ["Optional", "Dict"], ["helper"]
    print(f"Line:   {imp['start_line']}")
    print()

Output:

```text Source: os Names: []

Source: pathlib Names: ['Path']

Source: typing Names: ['Optional', 'Dict']

Source: ./utils Names: ['helper'] ```text

```typescript import { process } from "@kreuzberg/tree-sitter-language-pack";

const source = import os from "os"; import { readFile, writeFile } from "fs"; import * as path from "path"; import utils from "./utils.js";;

const result = await process(source, { language: "javascript", imports: true, });

result.imports.forEach(imp => { console.log(Source: ${imp.source}); console.log(Names: ${imp.names.join(", ") || "(all)"}); console.log(); }); ```

Import item fields:

Field Type Description
source string Module path or name
names list[string] Imported identifiers (empty = wildcard or bare import)
start_line int Line where import appears

exports — Exported Symbols

Symbols that are part of the module's public API.

```python from tree_sitter_language_pack import process, ProcessConfig

source = """ def public_function(): pass

def _private_function(): pass

class PublicClass: pass

all = ["public_function", "PublicClass"] """

config = ProcessConfig(language="python", exports=True) result = process(source, config)

for exp in result["exports"]: print(f"Name: {exp['name']}") # "public_function", "PublicClass" print(f"Kind: {exp['kind']}") # "function", "class" print() ```

```typescript import { process } from "@kreuzberg/tree-sitter-language-pack";

const source = export function helper() {} export const API_KEY = "secret"; export class Logger {} function internal() {};

const result = await process(source, { language: "typescript", exports: true, });

result.exports.forEach(exp => { console.log(${exp.kind.padEnd(10)} ${exp.name}); }); // Output: // function helper // const API_KEY // class Logger ```

Language Differences

Export detection varies: - Python: Module-level items not prefixed with _, or listed in __all__ - JavaScript/TypeScript: Only explicit export declarations - Rust: Public items with pub visibility

comments — Comments

All comments in the source, with their text and location.

```python from tree_sitter_language_pack import process, ProcessConfig

source = """

Top-level comment

def process(): # Inline comment return 42 # End-of-line comment """

config = ProcessConfig(language="python", comments=True) result = process(source, config)

for comment in result["comments"]: print(f"Text: {comment['text']}") print(f"Start line: {comment['start_line']}") print(f"Is block: {comment['is_block']}") print() ```

Output:

```text Text: # Top-level comment Start line: 1 Is block: False

Text: # Inline comment Start line: 3 Is block: False

Text: # End-of-line comment Start line: 4 Is block: False ```text

docstrings — Documentation Strings

Docstrings are automatically attached to their parent constructs in the structure field when docstrings=True.

```python from tree_sitter_language_pack import process, ProcessConfig

source = ''' def read_file(path: str) -> str: """Read and return the contents of a file.

Args:
    path: Path to the file to read.

Returns:
    The file contents as a string.

Raises:
    FileNotFoundError: If the file does not exist.
"""
with open(path) as f:
    return f.read()

class FileCache: """In-memory cache for file contents.""" pass '''

config = ProcessConfig(language="python", structure=True, docstrings=True) result = process(source, config)

for item in result["structure"]: if item.get("docstring"): print(f"{item['kind']} {item['name']}:") print(f" {item['docstring'][:80]}...") print() ```

Docstring conventions by language:

Language Convention
Python """...""" triple-quoted strings after def/class
Rust /// or //! doc comments above item
JavaScript/TypeScript /** ... */ JSDoc above function
Java /** ... */ Javadoc above method/class
Ruby # ... lines immediately before def/class
Go // FuncName ... comment block above function
Elixir @doc "..." or @moduledoc "..."

symbols — All Identifiers

A deduplicated list of all identifiers referenced in the file, useful for search indexing.

```python from tree_sitter_language_pack import process, ProcessConfig

source = """ from os import path from typing import List

def process_files(directory: str) -> List[str]: results = [] for file in path.listdir(directory): if file.endswith(".txt"): results.append(file) return results """

config = ProcessConfig(language="python", symbols=True) result = process(source, config)

print("Symbols found:") for symbol in sorted(result["symbols"]): print(f" - {symbol}") ```

Output:

text Symbols found: - List - append - directory - endswith - file - file - listdir - os - path - process_files - results - txt - typingtext

diagnostics — Syntax Errors

Syntax errors and error nodes detected during parsing.

```python from tree_sitter_language_pack import process, ProcessConfig

source = """ def broken_function( # missing closing paren print("hello") """

config = ProcessConfig(language="python", diagnostics=True) result = process(source, config)

if result["diagnostics"]: for error in result["diagnostics"]: print(f"Error at line {error['start_line']}, col {error['start_col']}") print(f" {error['message']}") else: print("No syntax errors") ```

Tip

A non-empty diagnostics list does not mean the file is unparsable—tree-sitter recovers and produces a partial tree. Use diagnostics to detect and report broken syntax.

metrics — File Statistics

Basic metrics about the file.

```python from tree_sitter_language_pack import process, ProcessConfig

source = """

This is a comment

def hello(): print("world")

Another comment

x = 1 """

config = ProcessConfig(language="python", metrics=True) result = process(source, config)

m = result["metrics"] print(f"Total lines: {m['total_lines']}") print(f"Code lines: {m['code_lines']}") print(f"Comment lines: {m['comment_lines']}") print(f"Blank lines: {m['blank_lines']}") print(f"Complexity: {m.get('complexity', 'N/A')}") ```

Metric fields:

Field Type Description
total_lines int Total lines in file
code_lines int Non-blank, non-comment lines
comment_lines int Lines that are comments
blank_lines int Empty lines
complexity int | null Cyclomatic complexity (if supported)

Chunking for LLMs

When chunk_max_size > 0, the result includes a chunks field with syntax-aware splits optimized for LLM token budgets. See Chunking for LLMs for full documentation.

```python from tree_sitter_language_pack import process, ProcessConfig

config = ProcessConfig( language="python", chunk_max_size=1000, # target 1000 tokens per chunk structure=True, imports=True, ) result = process(source, config)

for i, chunk in enumerate(result["chunks"]): print(f"Chunk {i+1}: lines {chunk['start_line']}-{chunk['end_line']} " f"({chunk['token_count']} tokens)") ```

Full Example

```python from tree_sitter_language_pack import process, ProcessConfig

source = ''' """Module for file operations.""" import os from pathlib import Path

def read_file(path: str) -> str: """Read and return file contents.

Args:
    path: Path to file.

Returns:
    File contents.
"""
# TODO: add error handling
return Path(path).read_text()

class FileManager: """Manage file operations."""

def __init__(self, root: str):
    self.root = Path(root)

def get(self, name: str) -> str:
    """Get file contents."""
    return read_file(os.path.join(self.root, name))

'''

Extract everything

config = ProcessConfig( language="python", structure=True, imports=True, exports=True, comments=True, docstrings=True, symbols=True, metrics=True, ) result = process(source, config)

Print structure

print("=== Structure ===") for item in result["structure"]: print(f"{item['kind']:12} {item['name']:20} ({item['start_line']}-{item['end_line']})")

Print imports

print("\n=== Imports ===") for imp in result["imports"]: names = ", ".join(imp["names"]) if imp["names"] else "*" print(f"from {imp['source']} import {names}")

Print exports

print("\n=== Exports ===") for exp in result["exports"]: print(f"{exp['kind']:12} {exp['name']}")

Print comments

print("\n=== Comments ===") for comment in result["comments"]: print(f"Line {comment['start_line']:3} {comment['text']}")

Print metrics

print("\n=== Metrics ===") m = result["metrics"] print(f"Lines: {m['total_lines']:3} total, {m['code_lines']:3} code, " f"{m['comment_lines']:3} comments, {m['blank_lines']:3} blank")

Print symbols

print(f"\n=== Symbols ({len(result['symbols'])}) ===") print(", ".join(sorted(result["symbols"])[:20])) # first 20 ```

Working with Language-Specific Results

Different languages produce different structure kinds, imports patterns, and docstring conventions. Always check what's available:

```python from tree_sitter_language_pack import process, ProcessConfig

config = ProcessConfig(language="python", structure=True) result = process("...", config)

Group structure by kind

by_kind = {} for item in result["structure"]: kind = item["kind"] if kind not in by_kind: by_kind[kind] = [] by_kind[kind].append(item["name"])

for kind in sorted(by_kind.keys()): print(f"{kind}: {by_kind[kind]}") ```

```typescript import { process } from "@kreuzberg/tree-sitter-language-pack";

const config = { language: "typescript", structure: true, }; const result = await process("...", config);

// Group structure by kind const byKind = {}; result.structure.forEach(item => { byKind[item.kind] ??= []; byKind[item.kind].push(item.name); });

for (const kind of Object.keys(byKind).sort()) { console.log(${kind}: ${byKind[kind]}); } ```

Performance Considerations

Enable Only What You Need

Enabling more features increases processing time. Start with just structure=True and add features as needed.

Reuse Parsers

The process() function internally uses a parser. For multiple files in the same language, consider lower-level APIs for better control.

Chunking Overhead

Chunking (chunk_max_size > 0) adds computational cost. Only enable if you plan to split code for LLM ingestion.

Next Steps

  • Parse without intelligence: Use Parsing for raw syntax trees and low-level tree navigation.
  • Split for LLMs: See Chunking for LLMs for syntax-aware code splitting.
  • Configure cache: Use Configuration to set cache directories and pre-download languages.