Tree-sitter and its Query

Tree-sitter is a parser generator, that is, we can leverage it to generate a specific parser. In addition to that, it also offers other functionality. For example, we can write a Query using S-expression, which will do the pattern matching on the AST. Before we delve into this feature, let’s talk about some backgrounds

Info
If you once wrote Lisp (or its dialects, such as Scheme and Racket), you should be familiar with the S-expression

We can define a S-expression recursively, the definition is

  • A S-expression can be x, where x is an atom
  • A S-expression is not empty and follows the form of (S1 S2 ... Sn), where each Si is a S-expression

The definition of a tree in data structure is also recursive, so we can use a S-expression to represent the tree. For example

python

        +
       / \
      1   -
         / \
        2   3

The AST of 1 + (2 - 3) $\uparrow$, which has the following S-expression representation.

lisp

(+ 1 (- 2 3))

In this situation, the first S-expression in () is the parent node, and the others are its children.

The Tree-sitter supports many programming languages. For each programming language, there is a GitHub repository called tree-sitter-*, where * represents the programming language. For example, Python’s corresponding GitHub repository is tree-sitter-python.

You may find a specific file called grammar.js in such a repository. This file defines the grammar of the programming language. Let’s take tree-sitter-python for an example, the grammar shows the syntax of for-loop is

js

for_statement: $ => seq(
  optional('async'),
  'for',
  field('left', $._left_hand_side),
  'in',
  field('right', $._expressions),
  ':',
  field('body', $._suite),
  field('alternative', optional($.else_clause)),
),

If you have some compiler background, it’s easy to infer the syntax of for-loop in Python is

python

[async] for <left> in <right>:
    <body>
[<else_clause>]
Tip
The [] means it’s optional, and the <> means they can be further expanded.

After knowing how the AST is defined, we want to know how to generate an AST for a code snippet using the parser produced by Tree-sitter.

We can install tree-sitter in the local machine. If you are using the Mac, you can use homebrew

sh

$ brew install tree-sitter
$ tree-sitter init-config

Then, we need to clone the tree-sitter-* GitHub repository we want to use. After cloning this repo, we can run the following commands (take tree-sitter-python for an example)

sh

$ git clone git@github.com:tree-sitter/tree-sitter-python.git
$ tree-sitter generate  # generate a parser in src/parser.c

Now, in the root directory of tree-sitter-python, we can generate the AST. Before that, let’s just write a simple Python code snippet.

python

def sum(x: int, y: int) -> int:
    return x + y


sum(1, 2)

The output is a S-expression and it should be like

python

(module [0, 0] - [5, 0]
  (function_definition [0, 0] - [1, 16]
    name: (identifier [0, 4] - [0, 7])
    parameters: (parameters [0, 7] - [0, 23]
      (typed_parameter [0, 8] - [0, 14]
        (identifier [0, 8] - [0, 9])
        type: (type [0, 11] - [0, 14]
          (identifier [0, 11] - [0, 14])))
      (typed_parameter [0, 16] - [0, 22]
        (identifier [0, 16] - [0, 17])
        type: (type [0, 19] - [0, 22]
          (identifier [0, 19] - [0, 22]))))
    return_type: (type [0, 27] - [0, 30]
      (identifier [0, 27] - [0, 30]))
    body: (block [1, 4] - [1, 16]
      (return_statement [1, 4] - [1, 16]
        (binary_operator [1, 11] - [1, 16]
          left: (identifier [1, 11] - [1, 12])
          right: (identifier [1, 15] - [1, 16])))))
  (expression_statement [4, 0] - [4, 9]
    (call [4, 0] - [4, 9]
      function: (identifier [4, 0] - [4, 3])
      arguments: (argument_list [4, 3] - [4, 9]
        (integer [4, 4] - [4, 5])
        (integer [4, 7] - [4, 8])))))

The [x1, y1] - [x2, y2] here denotes the range of the source code for a specific AST node. That is, row x1 to row x2 and column y1 to column y2. Note that the index starts from 0 and it is right exclusive.

Tip
There is an online Playground on the official website. You can just try it!

Here comes the main topic of this post. How to find the AST nodes you want to match? Just use the Query feature.

The Query is just a S-expression. For example, if we want to find the function definition in Python code, we can write

lisp

(function_definition)

In addition to that, the Tree-sitter also enables us to capture the matched source code and save it in a variable. We just need to put @variable_name after the specific S-expression.

lisp

(function_definition) @func
Tip
If you are using the online Playground, you can see the matched source code is highlighted.

Tip
There are two types of AST nodes in Tree-sitter. The first one is named AST node, and the other one is anonymous AST node. The syntax of named AST nodes is field_name: S-expression.

Usually, we want to restrict the matched AST nodes. For example, we want to match an AST node only when it has specific children. We can do this by just adding the field: prefix. The following Query will find Python functions that have type hints for return value.

lisp

(function_definition
  return_type: (type)) @has_return_type_func

Similarly, we can only match an AST node when it does not have specific children. We just need to add ! before a field. The following Query will find Python functions which does not have type hints for the return value.

lisp

(function_definition
  !return_type) @no_return_type_func

What about the anonymous AST nodes? We specific the expected string directly without using the (). Let’s say we want to find binary operations which do addition

lisp

(binary_operator
  operator: "+") @add_op
Tip
The anonymous AST nodes won’t be shown in the AST. Then how can we know how to write the Query? The answer is hidden in the grammar.js file. I referred to these lines when I wrote the above Query.

Now let’s take a step further, the Query supports regex expression natively. All of *, ?, +, [abc] are supported. Just consider the following Python code.

python

arr1 = []
arr2 = [1]
arr3 = [1, 2]

We want to extract all the integers within all Python lists. However, the number of integers varies in different lists. That is exactly what * did in regex expression

lisp

((integer) ("," (integer))*)@int_list

Besides, the Query supports the . mark. The functionality will change depending on the location of .

If . appears before the first children (denoted by S2), then the S2 must be the first children of the parent node (denoted by S1). If we want to extract all the first elements in all Python lists, we can write

lisp

(list . (integer) @first_int)

If . appears after the last children (denoted by Sn), then the Sn must be the last children of the parent node (denoted by S1). If we want to extract all the last elements in all Python lists, we can write

lisp

(list (integer) @last_int .)

Things get more complicated when the elements within a Python list have different types. For example

python

arr1 = []
arr2 = [None]
arr3 = [1, 2, "90"]

Now we can not just use (integer) to match the elements. What we need is a special mark that denotes any possibility. That is what _ did in Python code, the Query just borrows it.

lisp

(list (_) @last_int .)
Tip
The (_) can match any named AST nodes, and the _ can match both named AST nodes and anonymous AST nodes.

Let’s dive deeper into the Query feature. Using @variable_name to capture the matched source code is indeed useful. If we can apply some condition to control if the match is considered legal, it would be much better! Luckily, the Tree-sitter offers this functionality called predicate :)

The predicate is also a S-expression, and the syntax is (S1 S2 S3)

  • S1: It can be any of #eq?, #not-eq?, #any-eq?, #any-not-eq?, #match?, #not-match?, #any-match?, #any-not-match?, #any-of?, #not-any-of?. The functionality is indicated by its name.
  • S2: It must be a @variable_name
  • S3: It can be a @variable_name, or a string (for *eq?), a regex expression (for *match?), or candidates (for *-of?). The candidates do not need to be parenthesized.
Info
The eq? is a function in Lisp and its dialects

Now, let’s use the predicate to extract the last element only when it is a string and it is equal to "90". The Query would be like

lisp

(list (string) @last_str (#eq? @last_str "90") .)

Previously, we used the CLI command and the online Playground to demonstrate the ideas. Can we use other programming languages to do the same thing? Of course, we can. The Tree-sitter provides us with many programming languages bindings

Let’s use the Python binding. First, we need to install some packages

sh

$ pip install tree-sitter
$ pip install tree-sitter-python

Then we can start to write Python code. After generating the Parser, we can exploit the parser to parse Python code.

python

import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())

code = """def sum(x: int, y: int) -> int:
    return x + y


sum(1, 2)
"""

parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))

Now tree is the AST. Let’s check the root_node (module) to see if the result is right

python

parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))
root_node = tree.root_node

print(root_node.type)  # module
print(root_node.start_point)  # Point(row=0, column=0)
print(root_node.end_point)  # Point(row=5, column=0)

We can see from the output that the result is aligned with the previous result

lisp

(module [0, 0] - [5, 0]
... ;; omit

Now we can write a Query, which will extract the function’s arguments with type hints.

python

query = PY_LANGUAGE.query(
    """(typed_parameter
         (identifier)@name
         type: (type (identifier)@val))"""
)
captures = query.captures(root_node)
for arg_name, arg_type in zip(captures["name"], captures["val"]):
    print(f"{str(arg_name.text)}: {str(arg_type.text)}")

Tree-sitter is a powerful parser generator. I use its Query feature heavily to do program analysis. The information extracted from the AST can be further analyzed. Compared to another parse generator (like ANTLR), I prefer Tree-sitter because it is easier to use in my opinion. This may be related to my background, I once learned Scheme and Racket because of my interest. I am quite familiar with writing a S-expression :)