\(~~~~~~~~~~\)Unsupervised Learning\(~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

October 1, 2025

Unsupervised Learning

Unsupervised Learning is one of the approaches in Machine Learning that analyzes data without a target variable (no labels).

In other words, the model must discover the structure or hidden patterns in the data on its own, without being told in advance how the data should be grouped or related. In some cases,

we may not even know how many groups should exist in the data.

Principles of Unsupervised Learning

Instead of learning from examples with predefined answers (labels) as in Supervised Learning, in the case of Unsupervised Learning, the model works by:

  • Analyzing the structure of the data

  • Discovering hidden patterns or relationships

  • Grouping similar data together

  • Reducing data dimensionality to simplify analysis

Main Types of Unsupervised Learning

1. Clustering

  • Clustering automatically groups data so that items within the same group are similar to each other, while being different from items in other groups.

  • Example algorithms include:

    • (k)-means clustering
    • Hierarchical clustering
    • DBSCAN (Density-Based Spatial Clustering)

2. Dimensionality Reduction

  • Used to reduce the number of variables in the data while preserving as much important information as possible.

  • Useful for addressing the “Curse of Dimensionality” and allowing models to learn faster.

  • Example algorithms include:

    • PCA (Principal Component Analysis)

    • t-SNE (t-Distributed Stochastic Neighbor Embedding)

    • UMAP (Uniform Manifold Approximation and Projection)

3. Anomaly Detection

  • Used to identify values that deviate from the normal pattern in a dataset, such as detecting fraud or system errors.

  • Example algorithms include:

    • Isolation Forest

    • One-Class SVM

    • Autoencoders (Deep Learning)

4. Association Rule Learning

  • Used to discover relationships between variables in a dataset.
    Example: Market Basket Analysis, which identifies items that customers often purchase together.

  • Example algorithms include:

    • Apriori

    • FP-Growth

Real-World Applications of Unsupervised Learning

  • Customer Segmentation
    Companies can use clustering to group customers based on purchasing behavior, e.g., loyal customers, new customers, or churned customers.

  • Topic Modeling
    Unsupervised Learning can analyze and group articles or reviews. For example, LDA (Latent Dirichlet Allocation) is used to classify document categories.

  • Fraud Detection
    Anomaly Detection helps identify suspicious transactions, such as credit card use in very different locations within a short period.

  • Dimensionality Reduction
    PCA can be applied to reduce the number of variables before feeding data into a Machine Learning model, simplifying the process.

Advantages and Disadvantages of Unsupervised Learning

Advantages

  • Does not require labeled data, reducing the cost of data preparation.

  • Automatically discovers hidden patterns in the data.

  • Works well with large and complex datasets.

  • Useful for uncovering new knowledge that humans may not easily notice.

Disadvantages

  • Results are often harder to interpret compared to Supervised Learning.

  • Setting the number of clusters or certain parameters requires expertise.

  • May struggle with noisy or poorly prepared data.

Interactive: Clustering

K-Means Clustering

Clustering with \(k\)-means (K-Means Clustering)

\(k\)-means clustering is one of the most popular techniques for
Clustering in Unsupervised Learning.

It works by dividing data into \(k\) groups based on similarity,
with each group represented by its own centroid (the central point).

How \(k\)-means Works

flowchart TD
    A[Start]
    B[Set number of clusters k]
    C[Randomly initialize 
    k centroids]
    D[Compute distances to 
    centroids]
    E[Assign each point to the 
    nearest centroid]
    F[Update centroids as mean 
    of assigned points]
    G{Centroids changed 
    significantly or max 
    iterations reached?}
    H[Stop]
    I[Output cluster labels 
    and  final centroids]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    G -- No --> H
    G -- Yes --> D
    H --> I

  1. Set the number of clusters \(k\):
    The user specifies the number of clusters \(k\) before starting.

  2. Randomly initialize centroids:
    Randomly select \(k\) centroids from the dataset.

  3. Compute distances and assign groups:
    Calculate the distance of each data point to the centroids (commonly using Euclidean Distance)
    and assign each point to the cluster with the nearest centroid.

  4. Update centroid positions:
    Recalculate each centroid by taking the mean of all points in its cluster.

  5. Repeat steps 3 and 4:
    Continue until centroids stop changing significantly, or the algorithm converges.

Distance Metrics in \(k\)-means Clustering

In the process of \(k\)-means clustering, calculating the distance between data points and centroids is crucial, as it determines which cluster each point belongs to.

By default, \(k\)-means uses Euclidean Distance.
However, other distance metrics can also be applied depending on the nature of the data.

1. Euclidean Distance

\[ d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} \]

✅ Easy to use and understand
✅ Suitable for numerical quantitative data

❌ Sensitive to outliers
❌ Not suitable when features are on different scales (apply Standardization first)

2. Manhattan Distance

\[ d(x, y) = \sum_{i=1}^{n} |x_i - y_i| \]

✅ Suitable for data distributed in a grid-like structure (e.g., pixel images)
✅ Less sensitive to outliers compared to Euclidean

❌ Not ideal for data with strong linear relationships or directional patterns

3. Cosine Similarity

\[ \cos(\theta) = \frac{x \cdot y}{||x|| \cdot ||y||} = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \cdot \sqrt{\sum_{i=1}^{n} y_i^2}} \]

In general, Cosine Similarity measures the angle-based similarity between vectors rather than the distance.

✅ Suitable for vector-based data such as documents, text, and embeddings ✅ Not affected by vector magnitude (e.g., documents of different lengths)

❌ Not suitable when magnitude itself is important for grouping

Distance Calculation with Different Methods

Example data:

x y
a 1 2
b 3 4
c 4 3

Euclidean Distance

\[ \begin{aligned} Dis_{ab}&=\sqrt{2^2+2^2}=\sqrt{8}=2.828\\ Dis_{ac}&=\sqrt{3^2+1^2}=\sqrt{10}=3.162\\ Dis_{bc}&=\sqrt{1^2+1^2}=\sqrt{2}=1.414 \end{aligned} \]

Manhattan Distance

\[ \begin{aligned} Dis_{ab}&= |2|+|2|= 4\\ Dis_{ac}&= |3|+|1|= 4\\ Dis_{bc}&= |1|+|1| = 2 \end{aligned} \]

Interactive: Distance

Example of \(k\)-means

1. Sample Data

We have a dataset of 5 points in 2 dimensions (X, Y) and want to split them into \(k=2\) clusters.

Data Point X Y
A 2 10
B 2 5
C 8 4
D 5 8
E 7 5

2. Steps of \(k\)-means Calculation

(1) Define \(k\) and randomly select initial centroids

  • Set the number of clusters: \(k = 2\)

  • Randomly choose initial centroids (suppose we pick A and C):

    • Centroid 1 (C1) = A = (2, 10)

    • Centroid 2 (C2) = C = (8, 4)


(2) Calculate the distance from each point to the centroids

We use Euclidean Distance:

\[ d(x, y) = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \]

Data Point \(d(A, C1)\) \(d(A, C2)\) Nearest Cluster
A (2,10) 0.00 7.21 C1
B (2,5) 5.00 6.00 C1
C (8,4) 7.21 0.00 C2
D (5,8) 3.61 4.24 C1
E (7,5) 7.81 1.00 C2

First-round result:

  • C1 cluster: {A, B, D}, C2 cluster: {C, E}

(3) Compute new centroids for each cluster

Centroid 1 (C1) {A, B, D}

\[\begin{aligned} C1_x &= \frac{2 + 2 + 5}{3} = \frac{9}{3} = 3.00\\ C1_y &= \frac{10 + 5 + 8}{3} = \frac{23}{3} = 7.67\end{aligned}\]

Centroid 2 (C2) {C, E}

\[\begin{aligned} C2_x &= \frac{8 + 7}{2} = \frac{15}{2} = 7.50\\ C2_y &= \frac{4 + 5}{2} = \frac{9}{2} = 4.50\end{aligned}\]

(4) Repeat steps 2 and 3

  • Recompute distances and reassign points to clusters.

  • Update the centroids, and continue iterating until the centroids no longer change (convergence).

3. Compute \(k\)-means

Animation of K-means

Set up

K-mean (3D)

Choosing the Optimal \(k\) in \(k\)-means using Silhouette Score

Selecting the right number of clusters \(k\) in \(k\)-means is crucial.
If \(k\) is chosen too high or too low, clustering performance may suffer.

Silhouette Score is a metric for evaluating clustering quality, based on:

  • \(a(i)\): the average distance between a point and other points in the same cluster → “cluster cohesion”
  • \(b(i)\): the average distance between a point and points in the nearest neighboring cluster → “cluster separation”

\[ S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

  • \(S(i)\) close to 1 → The point is well clustered within its group.

  • \(S(i)\) close to 0 → The point lies near the boundary between clusters.

  • \(S(i)\) close to -1 → The point is likely assigned to the wrong cluster.

How to Use Silhouette Score to Choose \(k\)

The average Silhouette Score indicates how appropriate the chosen \(k\) is.

  1. Run \(k\)-means with multiple values of \(k\) (e.g., \(k = 2, 3, 4, \dots, 10\)).

  2. Calculate the Silhouette Score for each \(k\).

  3. Select the \(k\) that yields the highest average Silhouette Score.

Interpreting Silhouette Values by Cluster

  • Cluster 1 (Well-clustered): All three points have high positive Silhouette values (0.60–0.85). This means the points are tightly grouped and clearly separated from other clusters.

  • Cluster 2 (Borderline): One point is exactly at 0 and others are close to 0 (0.05–0.10). This indicates these points lie near the boundary between clusters, so their assignment is uncertain.

  • Cluster 3 (Poor cluster): The only point has a negative Silhouette value (-0.20). This shows the point is closer to another cluster than to its own, suggesting possible misclassification.

Interactive K-means & Silhouette Score

Interactive Silhouette Values

Advantages and Disadvantages of \(k\)-means

Advantages of \(k\)-means

✅ Easy to implement and computationally efficient
✅ Works well when data has clear cluster structures
✅ Scales effectively to large datasets

Disadvantages of \(k\)-means

❌ Requires specifying \(k\) in advance
❌ Sensitive to outliers
❌ Performs poorly when clusters are non-spherical or vary in size

\(k\)-means with Orange Data Mining

Example of \(k\)-means in Orange (1)

Example of \(k\)-means in Orange (2)

References