Tail call and Tail-call Optimization (TCO)
Tail call & Tail-recursion
Prerequisites: function calls & the call stack. The call stack contains all functions that are currently executing but have not yet returned.
- Calling a function
fpushes a stack frame forfonto the call stack. - When
ffinishes execution and returns, its stack frame is popped off the stack
You may find the code demo here
Assume that function A calls function B. We call function A Caller and function B Callee.
A tail call occurs when the caller only needs to wait for the callee’s return value, as all other work has already been completed1.
If this is a recursive function call(e.g., the Caller and Callee are the same function), then the function is said to be tail-recursive.
Tail-call Optimization (TCO)
According to the mechanism of the function call, a stack frame will be created to store some useful information, such as the return address. What if the function call is a tail-call? Could you find any optimization chances?
By definition, in a tail call, the caller has already done all its work and only needs to wait for the result. Most of the information about the Caller is no longer needed. Then we can reuse the stack frame of the Caller instead of creating a new one. That’s exactly what the tail call optimization (TCO) does2
The benefit of TCO is that the overhead of creating new stack frames is avoided. We can safely use tail-recursion without worrying about the famous stack overflow issue, as long as the programming language supports TCO, and it’s enabled.
Make recursion tail-recursion
I use OCaml to demonstrate the ideas because it supports TCO.
To get the benefits of TCO, we need to learn how to rewrite any naive recursive function to a tail-recursive function. One of the popular ways is the accumulator pattern.
Steps to follow3
- Create a helper function that usually has two parameters
- The data to be processed
- The accumulated value
- The base case of the helper function is: returning the accumulated value
- The base case of the original recursive function changes to calling the helper function
Let’s use a classic recursive function, which calculates the factorial as an example. First, create a new project with dune
$ dune init project tco
Then we can easily write a naive recursive function in ./tco/bin/main.ml` as follows.
let rec factorial n =
match n with
0 | 1 -> 1
| n -> n * factorial (n - 1)
This function is not tail-recursive. The reason is that we still need to multiply the return value of factorial (n - 1) by n.
Now we can rewrite this function to a tail-recursive function using the accumulator pattern.
let factorial_tco n =
let rec helper cur acc =
match cur with
| 0 | 1 -> acc (* Return the accumulated value *)
| cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
in
helper n 1 (* Call the helper function *)
We can use the [@tailcall] annotation to ensure a function call is a tail call; otherwise, the compiler will throw a warning.
To benchmark the two implementations, we can use the core_bench tool. Let’s first install the required packages.
$ opam install core core_bench
After the installation, we need to change the configuration file ./tco/bin/dune
(executable
(public_name tco)
(name main)
(libraries tco core core_bench core_unix.command_unix))
Now, we can add the performance testing code in ./tco/bin/main.ml as follows.
open Core
open Core_bench
let rec factorial n =
match n with
| 0 | 1 -> 1
| n -> n * factorial (n - 1)
let factorial_tco n =
let rec helper cur acc =
match cur with
| 0 | 1 -> acc
| cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
in
helper n 1
(* Benchmark *)
let bench () =
Bench.make_command
[
Bench.Test.create ~name:"factorial" (fun () -> factorial 1000);
Bench.Test.create ~name:"factorial_tco" (fun () -> factorial_tco 1000);
]
|> Command_unix.run
let _ = bench ()
Build then run
$ dune build
$ dune exec tco
The output of performance testing on my MacBook M3 Air is:
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├───────────────┼──────────┼────────────┤
│ factorial │ 5.55us │ 100.00% │
│ factorial_tco │ 1.32us │ 23.76% │
└───────────────┴──────────┴────────────┘
The performance gain of TCO is quite significant–factorial_tco runs in roughly 23% of factorial.
The exact values of the output may vary depending on the hardware. However, the performance gain should be obvious. The deeper the recursion is, the more performance you will gain. You can change the 1000 to other values to explore the differences.
Wrap-up
In my opinion, the TCO stores the accumulated value in a specific parameter, which avoids the overhead of creating new stack frames. We gain the performance boost but lose the beauty of recursion. I recommend rewriting recursive functions into tail-recursive form only when necessary.