Tree-sitter and its Query

MartinLwx included in category Program-Analysis

2024-09-07 2025-04-26 1799 words 8 minutes

Contents

Info

Updates：

2025-04-26: fix the broken playground link and add a new chapter - Playground in Neovim.

Intro

Tree-sitter is a parser generator, that is, we can leverage it to generate a specific parser. In addition to that, it also offers other functionality. For example, we can write a Query using S-expression, which will do the pattern matching on the AST. Before we delve into this feature, let’s talk about some backgrounds.

S-expression

Info

If you once wrote Lisp (or its dialects, such as Scheme and Racket), you should be familiar with the S-expression

We can define a S-expression recursively, the definition is

A S-expression can be x, where x is an atom
A S-expression is not empty and follows the form of (S1 S2 ... Sn), where each Si is a S-expression

The definition of a tree in data structure is also recursive, so we can use a S-expression to represent the tree. For example

The AST of 1 + (2 - 3) $\uparrow$, which has the following S-expression representation.

(+ 1 (- 2 3))

In this situation, the first S-expression in () is the parent node, and the others are its children.

AST in Tree-sitter

The Tree-sitter supports many programming languages. For each programming language, there is a GitHub repository called tree-sitter-*, where * represents the programming language. For example, Python’s corresponding GitHub repository is tree-sitter-python.

You may find a specific file called grammar.js in such a repository. This file defines the grammar of the programming language. Let’s take tree-sitter-python for an example, the grammar shows the syntax of for-loop is

for_statement: $ => seq(
  optional('async'),
  'for',
  field('left', $._left_hand_side),
  'in',
  field('right', $._expressions),
  ':',
  field('body', $._suite),
  field('alternative', optional($.else_clause)),
),

If you have some compiler background, it’s easy to infer the syntax of for-loop in Python is

[async] for <left> in <right>:
    <body>
[<else_clause>]

Tip

The [] means it’s optional, and the <> means they can be further expanded.

After knowing how the AST is defined, we want to know how to generate an AST for a code snippet using the parser produced by Tree-sitter.

We can install tree-sitter in the local machine. If you are using the Mac, you can use homebrew

$ brew install tree-sitter
$ tree-sitter init-config

Then, we need to clone the tree-sitter-* GitHub repository we want to use. After cloning this repo, we can run the following commands (take tree-sitter-python for an example)

$ git clone git@github.com:tree-sitter/tree-sitter-python.git
$ tree-sitter generate  # generate a parser in src/parser.c

Tip

If you want to use the tree-sitter CLI tool anywhere, then you need to

Use the tree-sitter init-config command to find where the config file is in your OS
Open this config file and check the parse-directories attribute
Copy your tree-sitter-* GitHub repository to one of the folder

Now, in the root directory of tree-sitter-python, we can generate the AST. Before that, let’s just write a simple Python code snippet.

def sum(x: int, y: int) -> int:
    return x + y


sum(1, 2)

Then, we can use the following command to parse this code snippet. The output should be like

(module [0, 0] - [5, 0]
  (function_definition [0, 0] - [1, 16]
    name: (identifier [0, 4] - [0, 7])
    parameters: (parameters [0, 7] - [0, 23]
      (typed_parameter [0, 8] - [0, 14]
        (identifier [0, 8] - [0, 9])
        type: (type [0, 11] - [0, 14]
          (identifier [0, 11] - [0, 14])))
      (typed_parameter [0, 16] - [0, 22]
        (identifier [0, 16] - [0, 17])
        type: (type [0, 19] - [0, 22]
          (identifier [0, 19] - [0, 22]))))
    return_type: (type [0, 27] - [0, 30]
      (identifier [0, 27] - [0, 30]))
    body: (block [1, 4] - [1, 16]
      (return_statement [1, 4] - [1, 16]
        (binary_operator [1, 11] - [1, 16]
          left: (identifier [1, 11] - [1, 12])
          right: (identifier [1, 15] - [1, 16])))))
  (expression_statement [4, 0] - [4, 9]
    (call [4, 0] - [4, 9]
      function: (identifier [4, 0] - [4, 3])
      arguments: (argument_list [4, 3] - [4, 9]
        (integer [4, 4] - [4, 5])
        (integer [4, 7] - [4, 8])))))

The [x1, y1] - [x2, y2] here denotes the range of the source code for a specific AST node. That is, row x1 to row x2 and column y1 to column y2. Note that the index starts from 0 and it is right exclusive.

Basic Query Syntax

Here comes the main topic of this post. How to find the AST nodes you want to match? Just use the Query feature.

The Query is just a S-expression. For example, if we want to find the function definition in Python code, we can write

(function_definition)

In addition to that, the Tree-sitter also enables us to capture the matched source code and save it in a variable. We just need to put @variable_name after the specific S-expression.

(function_definition) @func

Tip

The captured function definition source code will be stored at func.

Tip

There are two types of AST nodes in Tree-sitter. The first one is named AST node, and the other one is anonymous AST node. The syntax of named AST nodes is field_name: S-expression.

Usually, we want to restrict the matched AST nodes. For example, we want to match an AST node only when it has specific children. We can do this by just adding the field: prefix. The following Query will find Python functions that have type hints for return value.

(function_definition
  return_type: (type)) @has_return_type_func

Similarly, we can only match an AST node when it does not have specific children. We just need to add ! before a field. The following Query will find Python functions which does not have type hints for the return value.

(function_definition
  !return_type) @no_return_type_func

What about the anonymous AST nodes? We specific the expected string directly without using the (). Let’s say we want to find binary operations which do addition.

(binary_operator
  operator: "+") @add_op

Tip

Since anonymous AST nodes aren’t visible in the AST representation, you might wonder how to write queries for them. The answer is hidden in the grammar.js file. I referred to these lines when I wrote the above Query.

Advanced Query Syntax

Now let’s take a step further, the Query supports regex expression natively. All of *, ?, +, [abc] are supported. Just consider the following Python code.

arr1 = []
arr2 = [1]
arr3 = [1, 2]

We want to extract all the integers within all Python lists. However, the number of integers varies in different lists. That is exactly what * did in regex expression

((integer) ("," (integer))*)@int_list

Besides, the Query supports the . mark. The functionality will change depending on the location of .

If . appears before the first children (denoted by S2), then the S2 must be the first children of the parent node (denoted by S1). If we want to extract all the first elements in all Python lists, we can write

(list . (integer) @first_int)

If . appears after the last children (denoted by Sn), then the Sn must be the last children of the parent node (denoted by S1). If we want to extract all the last elements in all Python lists, we can write

(list (integer) @last_int .)

Things get more complicated when the elements within a Python list have different types. For example

arr1 = []
arr2 = [None]
arr3 = [1, 2, "90"]

Now we can not just use (integer) to match the elements. What we need is a special mark that denotes any possibility. That is what _ did in Python code, the Query just borrows it.

(list (_) @last_int .)

Tip

The (_) can match any named AST nodes, and the _ can match both named AST nodes and anonymous AST nodes.

More Advanced Query Syntax

Let’s dive deeper into the Query feature. Using @variable_name to capture the matched source code is indeed useful. If we can apply some condition to control if the match is considered legal, it would be much better! Luckily, the Tree-sitter offers this functionality called predicate :)

The predicate is also a S-expression, and the syntax is (S1 S2 S3)

S1: It can be any of #eq?, #not-eq?, #any-eq?, #any-not-eq?, #match?, #not-match?, #any-match?, #any-not-match?, #any-of?, #not-any-of?. The functionality is indicated by its name.
S2: It must be a @variable_name
S3: It can be a @variable_name, or a string (for *eq?), a regex expression (for *match?), or candidates (for *-of?). The candidates do not need to be parenthesized.

Info

The eq? is a function in Lisp and its dialects

Now, let’s use the predicate to extract the last element only when it is a string and it is equal to "90". The Query would be like

(list (string) @last_str (#eq? @last_str "90") .)

Bindings

Previously, we used the CLI command and the online Playground to demonstrate the ideas. Can we use other programming languages to do the same thing? Of course, we can. The Tree-sitter provides us with many programming languages bindings

Let’s use the Python binding. First, we need to install some packages.

$ pip install tree-sitter
$ pip install tree-sitter-python

Then we can start to write Python code. After generating the Parser, we can exploit the parser to parse Python code.

import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())

code = """def sum(x: int, y: int) -> int:
    return x + y


sum(1, 2)
"""

parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))

Now tree is the AST. Let’s check the root_node (module) to see if the result is right.

parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))
root_node = tree.root_node

print(root_node.type)  # module
print(root_node.start_point)  # Point(row=0, column=0)
print(root_node.end_point)  # Point(row=5, column=0)

We can see from the output that the result is aligned with the previous result.

(module [0, 0] - [5, 0]
... ;; omit

Now we can write a Query, which will extract the function’s arguments with type hints.

query = PY_LANGUAGE.query(
    """(typed_parameter
         (identifier)@name
         type: (type (identifier)@val))"""
)
captures = query.captures(root_node)
for arg_name, arg_type in zip(captures["name"], captures["val"]):
    print(f"{str(arg_name.text)}: {str(arg_type.text)}")

Playground in Neovim

If you’re also using Neovim and have installed the nvim-treesitter, you can simply use the built-in playground feature.

You can launch the interactive playground by running the :EditQuery command. As shown in the image below, matched code segments are visually highlighted (shadowed) for easy identification.

Wrap-up

Tree-sitter is a powerful parser generator. I use its Query feature heavily to do program analysis. The information extracted from the AST can be further analyzed. Compared to another parse generator (like ANTLR), I prefer Tree-sitter because it is easier to use in my opinion. This may be related to my background, I once learned Scheme and Racket because of my interest. I am quite familiar with writing a S-expression :)