Tail call and Tail-call Optimization (TCO)

MartinLwx included in category Programming Languages

2024-03-22 2024-03-22 848 words 4 minutes

Contents

Tail call & Tail-recursion

Warning

Prerequisites: function calls & the call stack. The call stack contains all functions that are currently executing but have not yet returned.

Calling a function f pushes a stack frame for f onto the call stack.
When f finishes execution and returns, its stack frame is popped off the stack

You may find the code demo here

Assume that function A calls function B. We call function A Caller and function B Callee.

A tail call occurs when the caller only needs to wait for the callee’s return value, as all other work has already been completed¹.

If this is a recursive function call(e.g., the Caller and Callee are the same function), then the function is said to be tail-recursive.

Tail-call Optimization (TCO)

According to the mechanism of the function call, a stack frame will be created to store some useful information, such as the return address. What if the function call is a tail-call? Could you find any optimization chances?

By definition, in a tail call, the caller has already done all its work and only needs to wait for the result. Most of the information about the Caller is no longer needed. Then we can reuse the stack frame of the Caller instead of creating a new one. That’s exactly what the tail call optimization (TCO) does²

The benefit of TCO is that the overhead of creating new stack frames is avoided. We can safely use tail-recursion without worrying about the famous stack overflow issue, as long as the programming language supports TCO, and it’s enabled.

Make recursion tail-recursion

Info

I use OCaml to demonstrate the ideas because it supports TCO.

To get the benefits of TCO, we need to learn how to rewrite any naive recursive function to a tail-recursive function. One of the popular ways is the accumulator pattern.

Steps to follow³

Create a helper function that usually has two parameters
- The data to be processed
- The accumulated value
The base case of the helper function is: returning the accumulated value
The base case of the original recursive function changes to calling the helper function

Let’s use a classic recursive function, which calculates the factorial as an example. First, create a new project with dune

$ dune init project tco

Then we can easily write a naive recursive function in ./tco/bin/main.ml` as follows.

let rec factorial n =
  match n with
    0 | 1 -> 1 
  | n -> n * factorial (n - 1)

This function is not tail-recursive. The reason is that we still need to multiply the return value of factorial (n - 1) by n.

Now we can rewrite this function to a tail-recursive function using the accumulator pattern.

let factorial_tco n =
  let rec helper cur acc =
    match cur with
    | 0 | 1 -> acc (* Return the accumulated value *)
    | cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
  in
  helper n 1  (* Call the helper function *)

Tip

We can use the [@tailcall] annotation to ensure a function call is a tail call; otherwise, the compiler will throw a warning.

To benchmark the two implementations, we can use the core_bench tool. Let’s first install the required packages.

$ opam install core core_bench

After the installation, we need to change the configuration file ./tco/bin/dune

(executable
 (public_name tco)
 (name main)
 (libraries tco core core_bench core_unix.command_unix))

Now, we can add the performance testing code in ./tco/bin/main.ml as follows.

open Core
open Core_bench

let rec factorial n =
  match n with
  | 0 | 1 -> 1
  | n -> n * factorial (n - 1)

let factorial_tco n =
  let rec helper cur acc =
    match cur with
    | 0 | 1 -> acc
    | cur -> (helper [@tailcall]) (cur - 1) (acc * cur)
  in
  helper n 1

(* Benchmark *)
let bench () =
  Bench.make_command
    [
      Bench.Test.create ~name:"factorial" (fun () -> factorial 1000);
      Bench.Test.create ~name:"factorial_tco" (fun () -> factorial_tco 1000);
    ]
  |> Command_unix.run

let _ = bench ()

Build then run

$ dune build
$ dune exec tco

The output of performance testing on my MacBook M3 Air is:

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬────────────┐
│ Name          │ Time/Run │ Percentage │
├───────────────┼──────────┼────────────┤
│ factorial     │   5.55us │    100.00% │
│ factorial_tco │   1.32us │     23.76% │
└───────────────┴──────────┴────────────┘

The performance gain of TCO is quite significant–factorial_tco runs in roughly 23% of factorial.

Warning

The exact values of the output may vary depending on the hardware. However, the performance gain should be obvious. The deeper the recursion is, the more performance you will gain. You can change the 1000 to other values to explore the differences.

Wrap-up

In my opinion, the TCO stores the accumulated value in a specific parameter, which avoids the overhead of creating new stack frames. We gain the performance boost but lose the beauty of recursion. I recommend rewriting recursive functions into tail-recursive form only when necessary.