Tail call and Tail-call Optimization (TCO)

Assume that function A calls function B. We call function A Caller and function B Callee.

The tail call refers to when the Caller only needs to wait for the return value of the Callee, as everything else has already been completed1.

If this is a recursive function call(e.g. the Caller and Callee are the same function), then the function is said to be tail-recursive.

According to the mechanism of the function call, a stack frame will be created to store some useful information such as the return address. What if the function call is a tail-call? Could you find any optimization chance?

By the definition of tail call, we know that the Caller has done everything and just needs to keep waiting. Most of the information of the Caller is no longer needed. Then we can reuse the stack frame of the Caller instead of creating a new one. That’s exactly what the tail call optimization (TCO) does2

The benefit of TCO is that the overhead of creating new stack frames is avoided. We can safely use tail-recursion without worrying about the famous stack overflow issue, as long as the programming language supports TCO, and it’s enabled.

Info
I use OCaml to demonstrate the ideas because it supports TCO.

To get the benefits of TCO, we need to learn how to rewrite any naive recursion function to a tail-recursion. One of the popular ways is the accumulator pattern.

Steps to follow3

  • Create a helper function which usually has two parameters
    • The data to be processed
    • The accumulated value
  • The base case of the helper function is: returning the accumulated value
  • The base case of the original recursive function changes to calling the helper function

Let’s use a classic recursive function which calculates the factorial as an example. First, create a new project with dune

sh

$ dune init project tco

Then we can easily write a naive recursive function in ./tco/bin/main.ml` as follows

ocaml

let rec factorial n =
  match n with
    0 | 1 -> 1 
  | n -> n * factorial (n - 1)

It’s obvious that this function is not tail-recursive. The reason is that we still need to multiply the return value of factorial (n - 1) by n.

Now we can rewrite this function to a tail-recursive function using the accumulator pattern.

ocaml

let factorial_tco n =
  let rec helper cur acc =
    match cur with
    | 0 | 1 -> acc (* Return the accumulated value *)
    | cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
  in
  helper n 1  (* Call the helper function *)
Tip
We can use the [@tailcall] annotation to ensure a function call is a tail call, otherwise, the compiler will throw a warning.

To benchmark the two implementations, we can use the core_bench tool. Let’s first install the required packages

sh

$ opam install core_bench

After the installation, we need to change the configuration file ./tco/bin/dune

lisp

(executable
 (public_name tco)
 (name main)
 (libraries tco core core_bench core_unix.command_unix))

Now, we can add the performance testing code in ./tco/bin/main.ml as follows

ocaml

open Core
open Core_bench

let rec factorial n =
  match n with
  | 0 | 1 -> 1
  | n -> n * factorial (n - 1)

let factorial_tco n =
  let rec helper cur acc =
    match cur with
    | 0 | 1 -> acc
    | cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
  in
  helper n 1

(* Benchmark *)
let bench () =
  Bench.make_command
    [
      Bench.Test.create ~name:"factorial" (fun () -> factorial 10000);
      Bench.Test.create ~name:"factorial_tco" (fun () -> factorial_tco 10000);
    ]
  |> Command_unix.run

let _ = bench ()

Build then run

sh

$ dune build
$ dune exec tco

The output of performance testing in my MacBook M3 Air is:

text

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬────────────┐
│ Name          │ Time/Run │ Percentage │
├───────────────┼──────────┼────────────┤
│ factorial     │  40.42us │    100.00% │
│ factorial_tco │  10.06us │     24.90% │
└───────────────┴──────────┴────────────┘

The performance gain of TCO is quite impressive 🚀

Warning
The exact values of the output may vary depending on the hardware. However, the performance gain should be obvious. The deeper the recursion is, the more performance you will gain. You can change the 10000 to other values to explore the differences.

In my opinion, the TCO stores the accumulated value in a specific parameter which avoids the overhead of creating new stack frames. We gain the performance boost but lose the beauty of recursion. I would suggest that only try to rewrite recursive functions to tail-recursive functions as needed.