Contents

TSNE + K-Means: Data visualization and Clustering

I recently used the K-Means algorithm for clustering in my work. While reviewing my notes on the topic, I decided to publish them online :)

K-Means is a clustering algorithm that assigns each sample to one of the $K$ clusters.

K-Means algorithm relies on distance calculations, so data should be normalized to prevent features with larger scales from dominating the results. The normalization can be achieved by the Scikit-Learn library as follows.

from sklearn import preprocessing
X_normalized = preprocess.normalzie(X)

Input: The number of clusters (denoted as $K$)

Algorithm

  1. Initialize $K$ cluster centroids (denoted as $\mu_1,\mu_2,…,\mu_k$). A cluster centroid represents the average position of all samples within this cluster.
  2. Repeat until convergence (the position of cluster centroids barely changesjj)
    1. For each sample: find the closest cluster centroid and assign the sample to that cluster.
    2. Now that each sample may have been assigned to a new cluster. Update each cluster centroid by recalculating its position based on current assignments.

Output: $K$ clusters.

The K-Means algorithm is straightforward and intuitive, but how can we evaluate its performance?

The inertia metric measures the average distance between each sample and its assigned cluster centroid 1

$$ \texttt{inertia}=\frac{1}{m}\sum_m\texttt{distance}(\mathbf x_i,\mu_i) $$

where

  • $m$ denotes the number of samples within this cluster.
  • $\mu_i$ represents the centroid of the cluster to which $\mathbf x_i$ belongs.
  • $\texttt{distance}$ is the distance function, such as Euclidean distance.
Tip

In the Scikit-Learn library, you can access the inertia_ value after you have trained a KMeans model.

The silhouette score is another metric for evaluating K-Means performance. It is computed as the average of the silhouette coefficients of all samples. The silhouette coefficient for a single node can be calculated by:

$$ \texttt{Sihouette}(\mathbf x)=\frac{B - A}{max(A, B)} $$

where

  • $A$ denotes the average distance to nodes within the same cluster.
  • $B$ denotes the average distance to nodes within the neighbor cluster.

We aim for lower $A$ and higher $B$. In the ideal case, $B > A, A=0$, resulting in a silhouette coefficient of $B/B=1$. In a similar manner, you can also derive the worst case, which has a value $-1$.

If you can determine a suitable $K$ based on the problem domain, that’s sufficient.

Otherwise, you can leverage the measurements we mentioned before.

  • The inertia should be lower. You can use the famous elbow method to find the best $K$.
  • The silhouette score should be higher.
Info

You can found this demo here.

Note

In real-world applications, the dimensionality of feature vectors — as well as the number of samples - is typically much higher. The underlying concept remains the same.

I chose the iris dataset for demonstration mainly because

  • It has only 150 samples => we can see clearly how K-Means performs.
  • The feature vector has size 4 instead of 2 => we can use TSNE for data visualization.

The iris dataset is so popular that you may know the real $K$ in advance. Let’s pretend that we have no idea what the value of $K$ is :)

Let’s import some libraries.

import pandas as pd
import altair as alt
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn import preprocessing

And then, let’s do normalization as mentioned before.

data, target = load_iris(return_X_y=True, as_frame=True)
X, y = data.to_numpy(), target.to_numpy()
X_normalized = preprocessing.normalize(X)
df = pd.concat((pd.DataFrame(X_normalized, columns = df.columns[:-1]), target), axis=1)

Now we can check the first 5 samples with df.head(5)

id sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 0.803773 0.551609 0.220644 0.031521 0
1 0.828133 0.507020 0.236609 0.033801 0
2 0.805333 0.548312 0.222752 0.034269 0
3 0.800030 0.539151 0.260879 0.034784 0
4 0.790965 0.569495 0.221470 0.031639 0

Because the feature vector has size 4, we can leverage t-SNE to reduce the dimensionality.

model = TSNE(n_components=2)
X_proj = model.fit_transform(X)
Tip

Note that we need to use the raw data X for data visualization.

I prefer using altair for plotting, so I’ll need to create a Dataframe.

df_proj_without_label = pd.DataFrame(X_proj, columns=["x", "y"])
df_proj_with_label = pd.concat((df_proj_without_label, df['target']), axis=1)

Finally, we can visualize the iris dataset with the following code.

alt.Chart(df_proj_without_label).mark_circle(size=60).encode(
    x = "x",
    y = "y",
).interactive().properties(width=400, height=400)

From this diagram, the best guess is $K=3$.

Let’s train a K-Means model with $K=3$ to see if it’s the case.

Tip

Note that when you train a K-Means model, you should use the normalized vectors X_normalized.

model = KMeans(n_clusters=3, random_state=42)
model.fit(X_normalized)

df_with_new_pred = pd.concat((df_proj_without_label, pd.Series(model.labels_)), axis=1)
df_with_new_pred.columns = ["x", "y", "target"]
alt.Chart(df_with_new_pred).mark_circle(size=60).encode(
    x = "x",
    y = "y",
    color='target:N',
).interactive().properties(width=400, height=400)

The diagram suggests that $K=3$ is definitely a good choice. Does there exist a better $K$?

ks = range(1, 12)
inertias = []
silhouette_scores = []
models = []
for k in ks:
    model = KMeans(k, random_state=42)
    model.fit(X_normalized)

    models.append(model)
    
    inertias.append(model.inertia_)
    if k != 1:
        silhouette_scores.append(silhouette_score(X, model.labels_))
        
# For plotting
silhouette_scores = [silhouette_scores[0]] + silhouette_scores
curve = pd.DataFrame(
    zip(ks, inertias, silhouette_scores), columns=["K", "inertia", "silhouette"]
)
base = alt.Chart(curve).encode(x='K')
alt.layer(
    base.mark_line(color='blue').encode(y='inertia'),
    base.mark_line(color='red').encode(y='silhouette')
)
Warning

Note that the silhouette score requires at least 2 clusters. I insert a dummy value here.

The line chart (red - silhouette score, blue - inertia) says that $K=3$ is indeed the best choice for that

  • The elbow position of inertia is $K=3$.
  • The silhouette score keeps decreasing as $K$ increases.

Congratulations! The iris dataset does indeed contain three classes.

Let’s see the plot of the ground truth.

alt.Chart(df_proj_with_label).mark_circle(size=60).encode(
    x = "x",
    y = "y",
    color='target:N',
).interactive().properties(width=400, height=400)

In this post, we review how the K-Means algorithm works and use the iris dataset as a demo. The main takeaways are:

  • How to visualize data with high dimensionality 👉 Use TSNE on the raw data.
  • How to do clustering 👉 Use K-Means algorithm on normalized data.
  • How to find the best $K$ 👉 If the value of $K$ can’t be determined by the problem domain, just try different $K$ and observe how inertia and silhouette (or other measurements) scores change. You can use the elbow method here.

  1. Statistical learning methods (2nd edition). Li Hang. ↩︎