TSNE + K-Means: Data visualization and Clustering
K-Means Algorithm
I recently used the K-Means algorithm for clustering in my work. While reviewing my notes on the topic, I decided to publish them online :)
K-Means is a clustering algorithm that assigns each sample to one of the $K$ clusters.
How does AK-Means algorithm work?
Data Preparation
K-Means algorithm relies on distance calculations, so data should be normalized to prevent features with larger scales from dominating the results. The normalization can be achieved by the Scikit-Learn
library as follows.
from sklearn import preprocessing
X_normalized = preprocess.normalzie(X)
K-Means Clustering
Input: The number of clusters (denoted as $K$)
Algorithm
- Initialize $K$ cluster centroids (denoted as $\mu_1,\mu_2,…,\mu_k$). A cluster centroid represents the average position of all samples within this cluster.
- Repeat until convergence (the position of cluster centroids barely changesjj)
- For each sample: find the closest cluster centroid and assign the sample to that cluster.
- Now that each sample may have been assigned to a new cluster. Update each cluster centroid by recalculating its position based on current assignments.
Output: $K$ clusters.
How to measure the performance of K-Means
The K-Means algorithm is straightforward and intuitive, but how can we evaluate its performance?
Inertia
The inertia metric measures the average distance between each sample and its assigned cluster centroid 1
$$ \texttt{inertia}=\frac{1}{m}\sum_m\texttt{distance}(\mathbf x_i,\mu_i) $$
where
- $m$ denotes the number of samples within this cluster.
- $\mu_i$ represents the centroid of the cluster to which $\mathbf x_i$ belongs.
- $\texttt{distance}$ is the distance function, such as Euclidean distance.
In the Scikit-Learn
library, you can access the inertia_
value after you have trained a KMeans
model.
Silhouette Score
The silhouette score is another metric for evaluating K-Means performance. It is computed as the average of the silhouette coefficients of all samples. The silhouette coefficient for a single node can be calculated by:
$$ \texttt{Sihouette}(\mathbf x)=\frac{B - A}{max(A, B)} $$
where
- $A$ denotes the average distance to nodes within the same cluster.
- $B$ denotes the average distance to nodes within the neighbor cluster.
We aim for lower $A$ and higher $B$. In the ideal case, $B > A, A=0$, resulting in a silhouette coefficient of $B/B=1$. In a similar manner, you can also derive the worst case, which has a value $-1$.
How to the find best K
If you can determine a suitable $K$ based on the problem domain, that’s sufficient.
Otherwise, you can leverage the measurements we mentioned before.
- The inertia should be lower. You can use the famous elbow method to find the best $K$.
- The silhouette score should be higher.
Demo
You can found this demo here.
In real-world applications, the dimensionality of feature vectors — as well as the number of samples - is typically much higher. The underlying concept remains the same.
I chose the iris dataset for demonstration mainly because
- It has only 150 samples => we can see clearly how K-Means performs.
- The feature vector has size
4
instead of2
=> we can use TSNE for data visualization.
The iris dataset is so popular that you may know the real $K$ in advance. Let’s pretend that we have no idea what the value of $K$ is :)
Let’s import some libraries.
import pandas as pd
import altair as alt
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn import preprocessing
And then, let’s do normalization as mentioned before.
data, target = load_iris(return_X_y=True, as_frame=True)
X, y = data.to_numpy(), target.to_numpy()
X_normalized = preprocessing.normalize(X)
df = pd.concat((pd.DataFrame(X_normalized, columns = df.columns[:-1]), target), axis=1)
Now we can check the first 5 samples with df.head(5)
id | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target |
---|---|---|---|---|---|
0 | 0.803773 | 0.551609 | 0.220644 | 0.031521 | 0 |
1 | 0.828133 | 0.507020 | 0.236609 | 0.033801 | 0 |
2 | 0.805333 | 0.548312 | 0.222752 | 0.034269 | 0 |
3 | 0.800030 | 0.539151 | 0.260879 | 0.034784 | 0 |
4 | 0.790965 | 0.569495 | 0.221470 | 0.031639 | 0 |
Because the feature vector has size 4, we can leverage t-SNE to reduce the dimensionality.
model = TSNE(n_components=2)
X_proj = model.fit_transform(X)
Note that we need to use the raw data X
for data visualization.
I prefer using altair for plotting, so I’ll need to create a Dataframe.
df_proj_without_label = pd.DataFrame(X_proj, columns=["x", "y"])
df_proj_with_label = pd.concat((df_proj_without_label, df['target']), axis=1)
Finally, we can visualize the iris dataset with the following code.
alt.Chart(df_proj_without_label).mark_circle(size=60).encode(
x = "x",
y = "y",
).interactive().properties(width=400, height=400)
From this diagram, the best guess is $K=3$.
Let’s train a K-Means model with $K=3$ to see if it’s the case.
Note that when you train a K-Means model, you should use the normalized vectors X_normalized
.
model = KMeans(n_clusters=3, random_state=42)
model.fit(X_normalized)
df_with_new_pred = pd.concat((df_proj_without_label, pd.Series(model.labels_)), axis=1)
df_with_new_pred.columns = ["x", "y", "target"]
alt.Chart(df_with_new_pred).mark_circle(size=60).encode(
x = "x",
y = "y",
color='target:N',
).interactive().properties(width=400, height=400)
The diagram suggests that $K=3$ is definitely a good choice. Does there exist a better $K$?
ks = range(1, 12)
inertias = []
silhouette_scores = []
models = []
for k in ks:
model = KMeans(k, random_state=42)
model.fit(X_normalized)
models.append(model)
inertias.append(model.inertia_)
if k != 1:
silhouette_scores.append(silhouette_score(X, model.labels_))
# For plotting
silhouette_scores = [silhouette_scores[0]] + silhouette_scores
curve = pd.DataFrame(
zip(ks, inertias, silhouette_scores), columns=["K", "inertia", "silhouette"]
)
base = alt.Chart(curve).encode(x='K')
alt.layer(
base.mark_line(color='blue').encode(y='inertia'),
base.mark_line(color='red').encode(y='silhouette')
)
Note that the silhouette score requires at least 2 clusters. I insert a dummy value here.
- The elbow position of inertia is $K=3$.
- The silhouette score keeps decreasing as $K$ increases.
Congratulations! The iris dataset does indeed contain three classes.
Let’s see the plot of the ground truth.
alt.Chart(df_proj_with_label).mark_circle(size=60).encode(
x = "x",
y = "y",
color='target:N',
).interactive().properties(width=400, height=400)
Wrap-up
In this post, we review how the K-Means algorithm works and use the iris dataset as a demo. The main takeaways are:
- How to visualize data with high dimensionality 👉 Use TSNE on the raw data.
- How to do clustering 👉 Use K-Means algorithm on normalized data.
- How to find the best $K$ 👉 If the value of $K$ can’t be determined by the problem domain, just try different $K$ and observe how inertia and silhouette (or other measurements) scores change. You can use the elbow method here.
-
Statistical learning methods (2nd edition). Li Hang. ↩︎