Road to Clustering
Before we learn about clustering techniques, we need to understand the concept of distance, which we refer to as a metric.
Matric Space
A metric space is a fundamental concept in mathematics, particularly in geometry and analysis, that provides a formal framework for discussing distances between points. It consists of a set of points and a distance function (called a “metric”) that satisfies certain properties.
A metric space is a set \(X\), along with a metric (distance function) \(d\), that assigns a non-negative real number to every pair of points \(x\) and \(y\) in \(X\), representing the distance between them. The metric space is denoted as \((X, d)\).
Formally, a metric space is defined by:
A set \(X\) of points (e.g., numbers, vectors, or more abstract objects).
A metric \(d: X \times X \to \mathbb{R}\), which is a function that assigns a distance \(d(x, y)\) between any two points \(x, y \in X\).
The function \(d(x, y)\) (the distance between \(x\) and \(y\)) must satisfy the following four properties:
Non-negativity: \[ d(x, y) \geq 0 \quad \text{for all} \quad x, y \in X \] The distance between any two points must be a non-negative number.
Identity of Indiscernibles: \[ d(x, y) = 0 \quad \text{if and only if} \quad x = y \] The distance between two distinct points must be strictly positive, and the distance between a point and itself is zero.
Symmetry: \[ d(x, y) = d(y, x) \quad \text{for all} \quad x, y \in X \] The distance from \(x\) to \(y\) must be the same as the distance from \(y\) to \(x\).
Triangle Inequality: \[ d(x, z) \leq d(x, y) + d(y, z) \quad \text{for all} \quad x, y, z \in X \] The direct distance between two points must be less than or equal to the distance via an intermediate point. This ensures the shortest path between two points is a straight line.
Example of Metric Spaces:
Euclidean Space \(\mathbb{R}^n\):
The set \(X\) is the space of \(n\)-dimensional real vectors \(\mathbb{R}^n\).
The metric \(d(x, y)\) is the Euclidean distance between two vectors \(x\) and \(y\): \[ d(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} \]
Manhattan Distance (L1 Norm):
- The set \(X\) is again \(\mathbb{R}^n\), and the metric is defined by: \[ d(x, y) = \sum_{i=1}^{n} |x_i - y_i| \]
- This is commonly used in grid-based or urban environments, where diagonal movement is restricted.
Applications of Metric Spaces:
- Machine Learning: Clustering algorithms (like K-Means) rely on metric spaces to group data points based on their “distances”.
Exercise computing the distance
Here are three exercises to compute distances using both L2 (Euclidean) and L1 (Manhattan) distance metrics.
Exercise 1: Points in 2D Space
You are given two points in a 2D space: \(A(1, 3)\) and \(B(4, 7)\).
- Task 1: Compute the L2 (Euclidean) distance between points \(A\) and \(B\).
\(\sqrt{(1-4)^2+(3-7)^2}=\sqrt{(-3)^2+(-4)^2}=\sqrt{9+16}=\sqrt{25}\) =
- Task 2: Compute the L1 (Manhattan) distance between points \(A\) and \(B\).
\(|1-4|+|3-7|=|-3|+|-4|=3+4\) =
Exercise 2: Points in 3D Space
Consider two points in 3D space: \(P(2, -1, 5)\) and \(Q(6, 3, -2)\).
- Task 1: Compute the L2 (Euclidean) distance between points \(P\) and \(Q\).
\(\sqrt{(2-6)^2+(-1-3)^2+(5-(-2))^2}\) =
- Task 2: Compute the L1 (Manhattan) distance between points \(P\) and \(Q\).
\(|2-6|+|-1-3|+|5-(-2)|\) =
Exercise 3:High-dimensional Space
You are given two points in 5-dimensional space:
\(X(1, 2, 3, 4, 5)\) and \(Y(5, 4, 3, 2, 1)\).
We cannot draw a graph for dimensions higher than 3.
- Task 1: Compute the L2 (Euclidean) distance between points \(X\) and \(Y\).
\(\sqrt{(1-5)^2+(2-4)^2+(4-2)^2+(5-1)^2}\) =
- Task 2: Compute the L1 (Manhattan) distance between points \(X\) and \(Y\).
\(|1-5|+|2-4|+|4-2|+|5-1|\) =