Contents

Dynamic Time Warping: From Euclidean Distance to Optimal Alignment

Note

You can find the code demo here

Recently, I’ve been exploring how to diagnose issues encountered during LLM training through log analysis. This naturally involves analyzing time-series data — during iterative training, you can collect various signals such as loss curves. I noticed that some papers use Dynamic Time Warping (DTW) to detect anomalies in time-series data, so I decided to document this algorithm.

Let’s define some notations

  • $\mathbf x_{1:t}$ represents time-series data with length $t$ (1-indexed)
  • $\mathbf y_{1:t}$ represents time-series data with length $t$ (1-indexed)

Imagine this: if you were given two time series, $\mathbf x$ 和 $\mathbf y$, how would you determine whether they are sufficiently similar?

A very natural idea is to measure the distance between the two sequences.

The next question is: what distance metric should we use?

The first thing that probably comes to mind is Euclidean distance. Its formulation is also quite straightforward, as shown below.

$$ Euclidean(\mathbf x,\mathbf y)=\sqrt{\sum_i(\mathbf x_i-\mathbf y_i)^2} $$

Can it effectively capture the distance between time series? The answer is no. This is because time series have a key property: two similar sequences may be misaligned in the time dimension due to warping (stretching or compressing along the time axis). A classic example is speech recognition: different speakers may say the same sentence, but variations in speaking speed cause the signals to have different lengths or temporal alignment, even though the underlying content is essentially the same.

Let’s use some mock data using the $\texttt{sin}$ function, as shown below.

We, humans, can see that the two time series data are similar even though they are misaligned along the time axis. However, that’s not how Euclidean distance works. The Euclidean distance forces a strict one-to-one alignment in time - as can be seen from the formula, $\mathbf x_i$ is only compared with $\mathbf y_i$, i.e., values at the same time index $i$

The red vertical line highlights a key property of the Euclidean distance - it enforces index-wise alignment.

The problem here is that the Euclidean distance does not account for the property of time-series data. We need to find a better alignment method. That’s exactly the motivation behind DTW. The core idea of the DTW algorithm is that each point $\mathbf x_i$ does not have to correspond to $\mathbf y_i$ at the same time index. Instead, it should be allowed to find its best matching counterparat $\mathbf y_j$, where $j$ is not necessarily equal to $i$

Here are some rules of DTW when matching two time series data.

  • Non one-to-one matching: Each point in $\mathbf x$ can match one or more points in $\mathbf y$, and vice verse. ✅️ It can handle the time warping.
  • Continuous matching: Each point in $\mathb x$ should have match. ✅️ You can not ignore a point deliberately.
  • Boundary matchching: The $\mathbf x_1$ must match $\mathbf y_1$ and the $\mathbf x_t$ must match $\mathbf y_t$. ✅️ If two time-series data are similar, then the beginning and the end must match correspondingly.
  • Monotonic matching: the matching must be monotonic. If $\mathbf x_i$ is matched with $\mathbf y_j$, then for any $\mathbf x_m(m>i)$, its match $\mathbf y_n$ must satisfy $n>j$. This may sound a bit abstract, but if you visualize the matches as connecting lines between nodes, the rule simply means that these lines are not allowed to cross. ✅️ This enforces that time progresses monotonically and cannot move backward.

The DTW algorithm will find such an optimal match with minimal cost, which can be represented as this equation.

$$ Dtw(\mathbf x,\mathbf y)=\min_{W}\sum_{(i,j)\in W} d(\mathbf x_i,\mathbf y_j) $$

Here, $W$ represents a possible path. It can be viewed as a dictionary that specifies which data points in $\mathbf{x}$ should be matched with which data points in $\mathbf{y}$. The $d$ here can use the aforementioned Euclidean distance.

We have now developed an intuition for the DTW algorithm. How should we implement it?

Before we implement this algorithm, we can change the perspective of how DTW matching works. We can put the time-series data in the x-axis and y-axis. The optimal match found by the DTW algorithm is the path from the top-left to the bottom-right with the minimal cost.

$\mathbf y_1$ $\mathbf y_2$ $\mathbf y_3$
$\mathbf x_1$ $d(\mathbf x_1, \mathbf y_1)$ $d(\mathbf x_1, \mathbf y_2)$ $d(\mathbf x_1, \mathbf y_3)$
$\mathbf x_2$ $d(\mathbf x_2, \mathbf y_1)$ $d(\mathbf x_2, \mathbf y_2)$ $d(\mathbf x_2, \mathbf y_3)$
$\mathbf x_3$ $d(\mathbf x_3, \mathbf y_1)$ $d(\mathbf x_3, \mathbf y_2)$ $d(\mathbf x_3, \mathbf y_3)$

Why is that?

  • Non one-to-one matching: within the same row, you may move from left to right ➡️
  • Continuous matching: you may move downward ⬇️ or diagonally down-right ↘️
  • Boundary matching: the alignment starts at the top-left corner and ends at the bottom-right corner.
  • Monotonic matching: no backward moves are allowed; you cannot move left or down-left

It’s clear that this can be solved by dynamic programming. In each state (coordinate) (x, y), we have these choice

  • Move from left to right: (x, y) -> (x, y + 1).
  • Move downward: (x, y) -> (x + 1, y)
  • Move diagonally down-right: (x, y) -> (x + 1, y + 1)

The following Python code shows the implementation details. It returns the distance and the optimal match relations.

def dtw(series_a: np.ndarray, series_b: np.ndarray) -> dict[int, list[int]]:
    m, n = len(series_a), len(series_b)
    dp = [[float("inf") for _ in range(n + 1)] for _ in range (m + 1)]
    dp[0][0] = 0
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            dp[i][j] = min(
                dp[i - 1][j - 1],
                dp[i - 1][j],
                dp[i][j - 1],
            ) + abs(series_a[i - 1] - series_b[j - 1])

    # backtrace
    matches: dict[int, list[int]] = {}
    i, j = m, n
    while i > 0 and j > 0:
        matches.setdefault(i - 1, []).append(j - 1)
        candidates = [
            (dp[i - 1][j - 1], i - 1, j - 1),
            (dp[i][j - 1], i, j - 1),
            (dp[i - 1][j], i - 1, j),
        ]
        _, i, j = min(candidates, key=lambda t: t[0])

    return dp[-1][-1], matches

Let me plot the optimal match for you, such that you have a good understanding.

Aha, the DTW does find a reasonable matching relation :)

You can also compare the distances computed by the two methods in quantitative terms:

  • Euclidean distance only: 7.04
  • DTW (optimal alignment) + Euclidean distance: 5.13

DTW is used to find the optimal alignment between two time series. Under this optimal matching, the sequences are not forced into a strict point-to-point alignment. Instead, the algorithm adapts to the structure of the data and finds a correspondence that best reflects its underlying temporal patterns. As a result, the computed distance is often more meaningful and robust than a direct pointwise comparison.