t-SNE + K-Means: Data visualization and Clustering

MartinLwx included in category ML-DL

2025-09-12 2025-09-12 1025 words 5 minutes

Contents

K-Means Algorithm

I recently used the K-Means algorithm for clustering in my work. While reviewing my notes on the topic, I decided to publish them online :)

K-Means is a clustering algorithm that assigns each sample to one of the $K$ clusters.

How does AK-Means algorithm work?

Data Preparation

K-Means algorithm relies on distance calculations, so data should be normalized to prevent features with larger scales from dominating the results. The normalization can be achieved by the Scikit-Learn library as follows.

from sklearn import preprocessing
X_normalized = preprocess.normalzie(X)

K-Means Clustering

Input: The number of clusters (denoted as $K$)

Algorithm

Initialize $K$ cluster centroids (denoted as $\mu_1,\mu_2,…,\mu_k$). A cluster centroid represents the average position of all samples within this cluster.
Repeat until convergence (the position of cluster centroids barely changesjj)
1. For each sample: find the closest cluster centroid and assign the sample to that cluster.
2. Now that each sample may have been assigned to a new cluster. Update each cluster centroid by recalculating its position based on current assignments.

Output: $K$ clusters.

How to measure the performance of K-Means

The K-Means algorithm is straightforward and intuitive, but how can we evaluate its performance?

Inertia

The inertia metric measures the average distance between each sample and its assigned cluster centroid ¹

$$ \texttt{inertia}=\frac{1}{m}\sum_m\texttt{distance}(\mathbf x_i,\mu_i) $$

where

$m$ denotes the number of samples within this cluster.
$\mu_i$ represents the centroid of the cluster to which $\mathbf x_i$ belongs.
$\texttt{distance}$ is the distance function, such as Euclidean distance.

Tip

In the Scikit-Learn library, you can access the inertia_ value after you have trained a KMeans model.

Silhouette Score

The silhouette score is another metric for evaluating K-Means performance. It is computed as the average of the silhouette coefficients of all samples. The silhouette coefficient for a single node can be calculated by:

$$ \texttt{Sihouette}(\mathbf x)=\frac{B - A}{max(A, B)} $$

where

$A$ denotes the average distance to nodes within the same cluster.
$B$ denotes the average distance to nodes within the neighbor cluster.

We aim for lower $A$ and higher $B$. In the ideal case, $B > A, A=0$, resulting in a silhouette coefficient of $B/B=1$. In a similar manner, you can also derive the worst case, which has a value $-1$.

How to the find best K

If you can determine a suitable $K$ based on the problem domain, that’s sufficient.

Otherwise, you can leverage the measurements we mentioned before.

The inertia should be lower. You can use the famous elbow method to find the best $K$.
The silhouette score should be higher.

Demo

Info

You can found this demo here.

Note

In real-world applications, the dimensionality of feature vectors — as well as the number of samples - is typically much higher. The underlying concept remains the same.

I chose the iris dataset for demonstration mainly because

It has only 150 samples => we can see clearly how K-Means performs.
The feature vector has size 4 instead of 2 => we can use t-SNE for data visualization.

The iris dataset is so popular that you may know the real $K$ in advance. Let’s pretend that we have no idea what the value of $K$ is :)

Let’s import some libraries.

import pandas as pd
import altair as alt
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn import preprocessing

And then, let’s do normalization as mentioned before.

data, target = load_iris(return_X_y=True, as_frame=True)
X, y = data.to_numpy(), target.to_numpy()
X_normalized = preprocessing.normalize(X)
df = pd.concat((pd.DataFrame(X_normalized, columns = df.columns[:-1]), target), axis=1)

Now we can check the first 5 samples with df.head(5)

id	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	0.803773	0.551609	0.220644	0.031521
1	0.828133	0.507020	0.236609	0.033801
2	0.805333	0.548312	0.222752	0.034269
3	0.800030	0.539151	0.260879	0.034784
4	0.790965	0.569495	0.221470	0.031639

Because the feature vector has size 4, we can leverage t-SNE to reduce the dimensionality.

model = TSNE(n_components=2)
X_proj = model.fit_transform(X)

Tip

Note that we need to use the raw data X for data visualization.

I prefer using altair for plotting, so I’ll need to create a Dataframe.

df_proj_without_label = pd.DataFrame(X_proj, columns=["x", "y"])
df_proj_with_label = pd.concat((df_proj_without_label, df['target']), axis=1)

Finally, we can visualize the iris dataset with the following code.

alt.Chart(df_proj_without_label).mark_circle(size=60).encode(
    x = "x",
    y = "y",
).interactive().properties(width=400, height=400)

From this diagram, the best guess is $K=3$.

Let’s train a K-Means model with $K=3$ to see if it’s the case.

Tip

Note that when you train a K-Means model, you should use the normalized vectors X_normalized.

model = KMeans(n_clusters=3, random_state=42)
model.fit(X_normalized)

df_with_new_pred = pd.concat((df_proj_without_label, pd.Series(model.labels_)), axis=1)
df_with_new_pred.columns = ["x", "y", "target"]
alt.Chart(df_with_new_pred).mark_circle(size=60).encode(
    x = "x",
    y = "y",
    color='target:N',
).interactive().properties(width=400, height=400)

The diagram suggests that $K=3$ is definitely a good choice. Does there exist a better $K$?

ks = range(1, 12)
inertias = []
silhouette_scores = []
models = []
for k in ks:
    model = KMeans(k, random_state=42)
    model.fit(X_normalized)

    models.append(model)
    
    inertias.append(model.inertia_)
    if k != 1:
        silhouette_scores.append(silhouette_score(X, model.labels_))
        
# For plotting
silhouette_scores = [silhouette_scores[0]] + silhouette_scores
curve = pd.DataFrame(
    zip(ks, inertias, silhouette_scores), columns=["K", "inertia", "silhouette"]
)
base = alt.Chart(curve).encode(x='K')
alt.layer(
    base.mark_line(color='blue').encode(y='inertia'),
    base.mark_line(color='red').encode(y='silhouette')
)

Warning

Note that the silhouette score requires at least 2 clusters. I insert a dummy value here.

The line chart (red - silhouette score, blue - inertia) says that $K=3$ is indeed the best choice for that

The elbow position of inertia is $K=3$.
The silhouette score keeps decreasing as $K$ increases.

Congratulations! The iris dataset does indeed contain three classes.

Let’s see the plot of the ground truth.

alt.Chart(df_proj_with_label).mark_circle(size=60).encode(
    x = "x",
    y = "y",
    color='target:N',
).interactive().properties(width=400, height=400)

Wrap-up

In this post, we review how the K-Means algorithm works and use the iris dataset as a demo. The main takeaways are:

How to visualize data with high dimensionality 👉 Use t-SNE on the raw data.
How to do clustering 👉 Use K-Means algorithm on normalized data.
How to find the best $K$ 👉 If the value of $K$ can’t be determined by the problem domain, just try different $K$ and observe how inertia and silhouette (or other measurements) scores change. You can use the elbow method here.

Statistical learning methods (2nd edition). Li Hang. ↩︎