Extraction queries

Extraction queries let you run arbitrary tree-sitter S-expression queries against source code. Each match returns captured text, node metadata, and optionally the text of named child fields.

Use extract() when the built-in fields in process() don't cover your use case — finding all calls to a specific function, listing decorator names, extracting test method names matching a pattern. The two can also run together: ProcessConfig.extractions runs custom patterns alongside the standard analysis pass.

Basic usage¶

The simplest case: find all function names.

PythonNode.jsRust

from tree_sitter_language_pack import extract

source = "def hello(): pass\ndef world(): pass\n"

result = extract(source, {
    "language": "python",
    "patterns": {
        "functions": {
            "query": "(function_definition name: (identifier) @fn_name)",
        }
    }
})

for match in result["functions"]["matches"]:
    for capture in match["captures"]:
        print(capture["text"])
# hello
# world

import { extract } from "@kreuzberg/tree-sitter-language-pack";

const source = "def hello(): pass\ndef world(): pass\n";

const result = extract(source, {
  language: "python",
  patterns: {
    functions: {
      query: "(function_definition name: (identifier) @fn_name)",
    },
  },
});

for (const match of result.functions.matches) {
  for (const capture of match.captures) {
    console.log(capture.text);
  }
}

use tree_sitter_language_pack::{ExtractionConfig, ExtractionPattern, extract_patterns};
use ahash::AHashMap;

let mut patterns = AHashMap::new();
patterns.insert("functions".to_string(), ExtractionPattern {
    query: "(function_definition name: (identifier) @fn_name)".to_string(),
    ..Default::default()
});

let result = extract_patterns(
    "def hello(): pass\ndef world(): pass\n",
    &ExtractionConfig { language: "python".to_string(), patterns },
)?;

assert_eq!(result.results["functions"].total_count, 2);

Extracting child fields¶

When you capture a parent node, child_fields pulls the text of its named children without needing to write extra captures:

result = extract("def greet(name): pass\n", {
    "language": "python",
    "patterns": {
        "functions": {
            "query": "(function_definition) @fn_def",
            "child_fields": ["name", "parameters"],
        }
    }
})

capture = result["functions"]["matches"][0]["captures"][0]
print(capture["child_fields"]["name"])        # "greet"
print(capture["child_fields"]["parameters"])  # "(name)"

Field names depend on the grammar. Run ts-pack parse file.py to see the sexp with field labels, or check the grammar's node-types.json.

Capture output modes¶

capture_output controls how much data each match returns:

Mode	`text`	`node`	Use when
`"Text"` (default for most cases)	present	null	You only need the matched text
`"Node"`	null	present	You only need position/type info
`"Full"`	present	present	You need both

The node field contains type, start_byte, end_byte, start_point, and end_point.

result = extract(source, {
    "language": "python",
    "patterns": {
        "names": {
            "query": "(function_definition name: (identifier) @fn_name)",
            "capture_output": "Text",
        }
    }
})

capture = result["names"]["matches"][0]["captures"][0]
print(capture["text"])  # "hello"
print(capture["node"])  # None

Limiting results¶

max_results caps the matches list. total_count always shows the true match count, so you can detect truncation:

result = extract(source, {
    "language": "python",
    "patterns": {
        "fns": {
            "query": "(function_definition name: (identifier) @fn_name)",
            "max_results": 5,
        }
    }
})

pattern = result["fns"]
print(len(pattern["matches"]))  # at most 5
print(pattern["total_count"])   # actual count, e.g. 42

Restricting to a byte range¶

byte_range limits extraction to a [start, end] byte offset range. Matches return when the root node falls within the range:

source = "def a(): pass\ndef b(): pass\ndef c(): pass\n"

result = extract(source, {
    "language": "python",
    "patterns": {
        "fns": {
            "query": "(function_definition name: (identifier) @fn_name)",
            "byte_range": [14, 28],
        }
    }
})

# Only "b" falls in bytes 14-28
print(result["fns"]["matches"][0]["captures"][0]["text"])  # "b"

Validating queries¶

validate_extraction() checks query syntax without running extraction. Use it to catch mistakes before running extractions:

from tree_sitter_language_pack import validate_extraction

result = validate_extraction({
    "language": "python",
    "patterns": {
        "good": {
            "query": "(function_definition name: (identifier) @fn_name)",
        },
        "bad": {
            "query": "((((not valid syntax",
        }
    }
})

print(result["valid"])                    # False
print(result["patterns"]["good"]["valid"])  # True
print(result["patterns"]["good"]["capture_names"])  # ["fn_name"]
print(result["patterns"]["bad"]["errors"])          # ["<syntax error>"]

Each pattern result has:

Field	Type	Description
`valid`	bool	Whether the query compiled
`capture_names`	list[str]	Capture names in the query
`pattern_count`	int	Number of patterns
`warnings`	list[str]	Non-fatal warnings
`errors`	list[str]	Fatal errors

Combined with process()¶

ProcessConfig.extractions runs custom patterns alongside the standard analysis pass. Results appear in result["extractions"], keyed by pattern name:

from tree_sitter_language_pack import process, ProcessConfig

result = process(source, ProcessConfig(
    language="python",
    structure=True,
    extractions={
        "decorators": {
            "query": "(decorator) @dec",
            "capture_output": "Text",
        }
    }
))

print(result["structure"])                              # standard results
print(result["extractions"]["decorators"]["matches"])   # custom results

Compiled extraction (Rust)¶

CompiledExtraction pre-compiles queries once for reuse across inputs — useful when processing files with the same patterns:

use tree_sitter_language_pack::CompiledExtraction;

let compiled = CompiledExtraction::compile(&config)?;

// Reuse across many files
let r1 = compiled.extract("def a(): pass\n")?;
let r2 = compiled.extract("def x(): pass\ndef y(): pass\n")?;

assert_eq!(r1.results["fns"].total_count, 1);
assert_eq!(r2.results["fns"].total_count, 2);

CompiledExtraction is Send + Sync. To skip re-parsing when you already have a tree:

let tree = parse_string("python", source.as_bytes())?;
let result = compiled.extract_from_tree(&tree, source.as_bytes())?;

Binding support¶

Binding	`extract()`	`validate_extraction()`
Python	yes	yes
Node.js	yes	yes
Rust	yes	yes
Ruby	yes	yes
Elixir	yes	yes
PHP	yes	yes
Wasm	yes	yes
C FFI	yes	yes
Go	not yet	not yet
C#	not yet	not yet
Java	not yet	not yet

Next steps¶

Code intelligence — built-in extraction for common patterns (structure, imports, exports)
Parsing code — understanding the syntax tree your queries run against