Tree-sitter and its Query
Intro
Tree-sitter is a parser generator, that is, we can leverage it to generate a specific parser. In addition to that, it also offers other functionality. For example, we can write a Query using S-expression, which will do the pattern matching on the AST. Before we delve into this feature, let’s talk about some backgrounds
S-expression
We can define a S-expression recursively, the definition is
- A S-expression can be
x
, wherex
is an atom - A S-expression is not empty and follows the form of
(S1 S2 ... Sn)
, where eachSi
is a S-expression
The definition of a tree in data structure is also recursive, so we can use a S-expression to represent the tree. For example
+
/ \
1 -
/ \
2 3
The AST of 1 + (2 - 3)
$\uparrow$, which has the following S-expression representation.
(+ 1 (- 2 3))
In this situation, the first S-expression in ()
is the parent node, and the others are its children.
AST in Tree-sitter
The Tree-sitter supports many programming languages. For each programming language, there is a GitHub repository called tree-sitter-*
, where *
represents the programming language. For example, Python’s corresponding GitHub repository is tree-sitter-python.
You may find a specific file called grammar.js
in such a repository. This file defines the grammar of the programming language. Let’s take tree-sitter-python
for an example, the grammar shows the syntax of for-loop is
for_statement: $ => seq(
optional('async'),
'for',
field('left', $._left_hand_side),
'in',
field('right', $._expressions),
':',
field('body', $._suite),
field('alternative', optional($.else_clause)),
),
If you have some compiler background, it’s easy to infer the syntax of for-loop in Python is
[async] for <left> in <right>:
<body>
[<else_clause>]
[]
means it’s optional, and the <>
means they can be further expanded.After knowing how the AST is defined, we want to know how to generate an AST for a code snippet using the parser produced by Tree-sitter.
We can install tree-sitter
in the local machine. If you are using the Mac, you can use homebrew
$ brew install tree-sitter
$ tree-sitter init-config
Then, we need to clone the tree-sitter-*
GitHub repository we want to use. After cloning this repo, we can run the following commands (take tree-sitter-python for an example)
$ git clone git@github.com:tree-sitter/tree-sitter-python.git
$ tree-sitter generate # generate a parser in src/parser.c
If you want to use the tree-sitter
CLI tool anywhere, then you need to
- Use the
tree-sitter init-config
command to find where the config file is in your OS - Open this config file and check the
parse-directories
attribute - Copy your
tree-sitter-*
GitHub repository to one of the folder
Now, in the root directory of tree-sitter-python
, we can generate the AST. Before that, let’s just write a simple Python code snippet.
def sum(x: int, y: int) -> int:
return x + y
sum(1, 2)
Then, we can use the following command to parse this code snippet. The output should be like
(module [0, 0] - [5, 0]
(function_definition [0, 0] - [1, 16]
name: (identifier [0, 4] - [0, 7])
parameters: (parameters [0, 7] - [0, 23]
(typed_parameter [0, 8] - [0, 14]
(identifier [0, 8] - [0, 9])
type: (type [0, 11] - [0, 14]
(identifier [0, 11] - [0, 14])))
(typed_parameter [0, 16] - [0, 22]
(identifier [0, 16] - [0, 17])
type: (type [0, 19] - [0, 22]
(identifier [0, 19] - [0, 22]))))
return_type: (type [0, 27] - [0, 30]
(identifier [0, 27] - [0, 30]))
body: (block [1, 4] - [1, 16]
(return_statement [1, 4] - [1, 16]
(binary_operator [1, 11] - [1, 16]
left: (identifier [1, 11] - [1, 12])
right: (identifier [1, 15] - [1, 16])))))
(expression_statement [4, 0] - [4, 9]
(call [4, 0] - [4, 9]
function: (identifier [4, 0] - [4, 3])
arguments: (argument_list [4, 3] - [4, 9]
(integer [4, 4] - [4, 5])
(integer [4, 7] - [4, 8])))))
The [x1, y1] - [x2, y2]
here denotes the range of the source code for a specific AST node. That is, row x1
to row x2
and column y1
to column y2
. Note that the index starts from 0 and it is right exclusive.
Basic Query Syntax
Here comes the main topic of this post. How to find the AST nodes you want to match? Just use the Query feature.
The Query is just a S-expression. For example, if we want to find the function definition in Python code, we can write
(function_definition)
In addition to that, the Tree-sitter also enables us to capture the matched source code and save it in a variable. We just need to put @variable_name
after the specific S-expression.
(function_definition) @func
field_name: S-expression
.Usually, we want to restrict the matched AST nodes. For example, we want to match an AST node only when it has specific children. We can do this by just adding the field:
prefix. The following Query will find Python functions that have type hints for return value.
(function_definition
return_type: (type)) @has_return_type_func
Similarly, we can only match an AST node when it does not have specific children. We just need to add !
before a field. The following Query will find Python functions which does not have type hints for the return value.
(function_definition
!return_type) @no_return_type_func
What about the anonymous AST nodes? We specific the expected string directly without using the ()
. Let’s say we want to find binary operations which do addition
(binary_operator
operator: "+") @add_op
grammar.js
file. I referred to these lines when I wrote the above Query.Advanced Query Syntax
Now let’s take a step further, the Query supports regex expression natively. All of *, ?, +, [abc]
are supported. Just consider the following Python code.
arr1 = []
arr2 = [1]
arr3 = [1, 2]
We want to extract all the integers within all Python lists. However, the number of integers varies in different lists. That is exactly what *
did in regex expression
((integer) ("," (integer))*)@int_list
Besides, the Query supports the .
mark. The functionality will change depending on the location of .
If .
appears before the first children (denoted by S2
), then the S2
must be the first children of the parent node (denoted by S1
). If we want to extract all the first elements in all Python lists, we can write
(list . (integer) @first_int)
If .
appears after the last children (denoted by Sn
), then the Sn
must be the last children of the parent node (denoted by S1
). If we want to extract all the last elements in all Python lists, we can write
(list (integer) @last_int .)
Things get more complicated when the elements within a Python list have different types. For example
arr1 = []
arr2 = [None]
arr3 = [1, 2, "90"]
Now we can not just use (integer)
to match the elements. What we need is a special mark that denotes any possibility. That is what _
did in Python code, the Query just borrows it.
(list (_) @last_int .)
(_)
can match any named AST nodes, and the _
can match both named AST nodes and anonymous AST nodes.More Advanced Query Syntax
Let’s dive deeper into the Query feature. Using @variable_name
to capture the matched source code is indeed useful. If we can apply some condition to control if the match is considered legal, it would be much better! Luckily, the Tree-sitter offers this functionality called predicate :)
The predicate is also a S-expression, and the syntax is (S1 S2 S3)
S1
: It can be any of#eq?, #not-eq?, #any-eq?, #any-not-eq?, #match?, #not-match?, #any-match?, #any-not-match?, #any-of?, #not-any-of?
. The functionality is indicated by its name.S2
: It must be a@variable_name
S3
: It can be a@variable_name
, or a string (for*eq?
), a regex expression (for*match?
), or candidates (for*-of?
). The candidates do not need to be parenthesized.
eq?
is a function in Lisp and its dialectsNow, let’s use the predicate to extract the last element only when it is a string and it is equal to "90"
. The Query would be like
(list (string) @last_str (#eq? @last_str "90") .)
Bindings
Previously, we used the CLI command and the online Playground to demonstrate the ideas. Can we use other programming languages to do the same thing? Of course, we can. The Tree-sitter provides us with many programming languages bindings
Let’s use the Python binding. First, we need to install some packages
$ pip install tree-sitter
$ pip install tree-sitter-python
Then we can start to write Python code. After generating the Parser, we can exploit the parser to parse Python code.
import tree_sitter_python as tspython
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
code = """def sum(x: int, y: int) -> int:
return x + y
sum(1, 2)
"""
parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))
Now tree
is the AST. Let’s check the root_node (module
) to see if the result is right
parser = Parser(PY_LANGUAGE)
tree = parser.parse(bytes(code, encoding="utf8"))
root_node = tree.root_node
print(root_node.type) # module
print(root_node.start_point) # Point(row=0, column=0)
print(root_node.end_point) # Point(row=5, column=0)
We can see from the output that the result is aligned with the previous result
(module [0, 0] - [5, 0]
... ;; omit
Now we can write a Query, which will extract the function’s arguments with type hints.
query = PY_LANGUAGE.query(
"""(typed_parameter
(identifier)@name
type: (type (identifier)@val))"""
)
captures = query.captures(root_node)
for arg_name, arg_type in zip(captures["name"], captures["val"]):
print(f"{str(arg_name.text)}: {str(arg_type.text)}")
Wrap-up
Tree-sitter is a powerful parser generator. I use its Query feature heavily to do program analysis. The information extracted from the AST can be further analyzed. Compared to another parse generator (like ANTLR), I prefer Tree-sitter because it is easier to use in my opinion. This may be related to my background, I once learned Scheme and Racket because of my interest. I am quite familiar with writing a S-expression :)