\(~\)Hierarchical Clustering\(~\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 30, 2025

What is Hierarchical Clustering?

Hierarchical Clustering is one of the Clustering techniques
used in Exploratory Data Analysis, with the goal of dividing data into groups based on similarity.

Applications of Hierarchical Clustering in Business

Hierarchical Clustering can be applied in business to analyze and segment large datasets.

This helps companies better understand customer behavior, improve marketing strategies, and enhance overall operational efficiency.

1. Customer Segmentation

Objective: Segment customers based on behavior, interests, or other factors
to enable targeted marketing.

Real-world examples:

A retail company uses Hierarchical Clustering to divide customers into groups such as
Loyal Customers, Occasional Buyers, and New Customers.
Banks apply this technique to segment customers by their risk levels in lending.

Benefits:

Design targeted advertising campaigns
Develop loyalty or reward programs for each customer group
Improve customer retention rates

2. Market Basket Analysis

Objective: Analyze which products are most frequently purchased together
to support sales strategy planning.

Real-world examples:

A supermarket applies Hierarchical Clustering to group items that are often bought together,
e.g., customers who buy bread 🍞 often purchase peanut butter 🧈 as well.
Online stores use this information to recommend products through a Recommendation System.

Benefits:

Tailor promotions for specific customer groups
Optimize product placement within stores
Increase sales by suggesting related products

3. Product Categorization

Objective: Group similar products to support better inventory and product management.

Real-world examples:

Retail stores can use Hierarchical Clustering to categorize products into
premium, regular, and budget items.
E-commerce platforms can organize products into categories, making it easier for users to search.

Benefits:

Improve menu structures in websites or applications
Plan inventory management more effectively
Develop suitable pricing strategies

4. Credit Risk Analysis

Objective: Segment customers based on their credit risk levels.

Real-world examples:

Financial institutions use Hierarchical Clustering to group customers by risk level,
e.g., good repayment history 😊, medium risk 😐, and high risk 😨.
Credit card companies use this information to set appropriate credit limits for different customer groups.

Benefits:

Reduce the risk of lending
Adjust interest rates according to customer profiles
Prevent non-performing loans (NPL)

5. Employee Segmentation

Objective: Segment employees to design policies tailored to each group.

Real-world examples:

Companies can use Hierarchical Clustering to divide employees into
🚀 High Performers 🌟, 🏢 General Staff 🙂, and 🌱 Employees needing further development 📚.
Human Resources (HR) departments can use this information to design training programs that fit the needs of each group.

Benefits:

Adjust bonus and benefits strategies
Develop career paths for different employee groups
Reduce employee turnover rates

6. Market Trend Analysis

Objective: Group market trends or customer segments to support the development of new products.

Real-world examples:

Technology companies can use Hierarchical Clustering to track customer trends,
e.g., those who always adopt the latest smartphones vs. those who upgrade only when necessary.
Cosmetic companies can apply this technique to segment consumers,
e.g., those who prefer organic products vs. those who prioritize budget-friendly items.

Benefits:

Help companies understand market trends and customer behavior
Enable the development of products that meet target customer needs
Increase competitiveness in the market

Interactive of Hierachical Clustering

viewof hc_ctrl = (() => {
  const form = Inputs.form({
    linkage: Inputs.select(["average","single","complete"], { label: "Linkage", value: "average" }),
    thresh:  Inputs.range([0, 8], { step: 0.1, value: 2.5, label: "Cut at distance" }),
    click:   Inputs.button("Randomize data")
  });

  // layout เป็นกริดหลายคอลัมน์ในแถวเดียว
  form.style.display = "grid";
  form.style.gridTemplateColumns = "repeat(3, minmax(220px, 1fr))";
  form.style.gap = "12px";
  form.style.alignItems = "end";
  return form;
})()

// ตัวแปรที่ส่วนอื่นใช้
hc_linkage = hc_ctrl.linkage
hc_thresh  = hc_ctrl.thresh
hc_clicks  = hc_ctrl.click

// ===== RNG helpers (ประกาศครั้งเดียวแล้วใช้ทุกที่) =====
RNG = (() => {
  function xmur3(str){
    let h = 1779033703 ^ str.length;
    for (let i=0; i<str.length; i++)
      h = Math.imul(h ^ str.charCodeAt(i), 3432918353), h = (h << 13) | (h >>> 19);
    return () => {
      h = Math.imul(h ^ (h >>> 16), 2246822507);
      h = Math.imul(h ^ (h >>> 13), 3266489909);
      return (h ^ (h >>> 16)) >>> 0;
    };
  }
  function mulberry32(a){
    return () => {
      let t = a += 0x6D2B79F5;
      t = Math.imul(t ^ (t >>> 15), t | 1);
      t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
      return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
    };
  }
  const seed = (s) => mulberry32(xmur3(String(s))());
  return { seed };
})();

hc_makeData = (n=60, seed=1) => {
  const rand = RNG.seed(seed);
  const j = s => (rand()*2 - 1)*s;
  const blob = (cx,cy,s,n) => Array.from({length:n}, () => ({x:cx+j(s), y:cy+j(s)}));
  const s = 1.0;
  return [
    ...blob(0,   0,   s, n),
    ...blob(3.5, 3.5, s, n),
    ...blob(7,   0,   s, n)
  ];
}
hc_data = hc_makeData(60, hc_clicks)

hc_pairwise = (() => {
  const n = hc_data.length;
  const D = Array.from({length:n}, () => Array(n).fill(0));
  for (let i=0;i<n;i++){
    for (let j=i+1;j<n;j++){
      const dx = hc_data[i].x - hc_data[j].x;
      const dy = hc_data[i].y - hc_data[j].y;
      const d  = Math.hypot(dx, dy);
      D[i][j] = D[j][i] = d;
    }
  }
  return D;
})()

hc_linkage_table = (() => {
  const n = hc_data.length;
  let clusters = new Map();                   // id -> Set(indices)
  for (let i=0;i<n;i++) clusters.set(i, new Set([i]));
  let nextId = n;
  const merges = [];

  const distAB = (A, B) => {
    let cnt=0, sum=0, min=Infinity, max=0;
    for (const i of A) for (const j of B){
      const d = hc_pairwise[i][j];
      cnt++; sum+=d; if (d<min) min=d; if (d>max) max=d;
    }
    if (hc_linkage === "single")   return min;
    if (hc_linkage === "complete") return max;
    return sum/cnt; // average
  };

  while (clusters.size > 1){
    const ids = [...clusters.keys()];
    let best = {a:null, b:null, d:Infinity};
    for (let i=0;i<ids.length;i++){
      for (let j=i+1;j<ids.length;j++){
        const a = ids[i], b = ids[j];
        const d = distAB(clusters.get(a), clusters.get(b));
        if (d < best.d) best = {a,b,d};
      }
    }
    const A = clusters.get(best.a), B = clusters.get(best.b);
    const merged = new Set([...A, ...B]);
    merges.push({ id: nextId, left: best.a, right: best.b, dist: best.d, size: merged.size });
    clusters.delete(best.a); clusters.delete(best.b);
    clusters.set(nextId, merged);
    nextId++;
  }
  return merges; // ความยาว = n-1
})()

hc_cut = (() => {
  const n = hc_data.length;
  const current = new Map();                  // id -> Set(indices)
  for (let i=0;i<n;i++) current.set(i, new Set([i]));

  for (const m of hc_linkage_table){
    if (m.dist > hc_thresh) break;            // ตัดที่ระยะ
    const A = current.get(m.left), B = current.get(m.right);
    const merged = new Set([...A, ...B]);
    current.delete(m.left); current.delete(m.right);
    current.set(m.id, merged);
  }

  const clusters = [...current.values()];
  const assign = new Array(n);
  clusters.forEach((set, k) => { for (const i of set) assign[i] = k; });
  return {clusters, assign};
})()

// ===== Scatter + Dendrogram (side-by-side, 640×360 each, smaller points) =====
{
  const P = await import("https://cdn.jsdelivr.net/npm/@observablehq/plot@0.6/+esm");

  // ----- เตรียมโครง dendrogram จากตัวแปรเดิม -----
  const n = hc_data.length;
  const merges = hc_linkage_table;
  const rootId = merges.at(-1)?.id ?? null;
  const byId = new Map(merges.map(m => [m.id, m]));

  const order = [];
  const collect = (id) => {
    if (id < n) { order.push(id); return; }      // leaf
    const m = byId.get(id); collect(m.left); collect(m.right);
  };
  if (rootId != null) collect(rootId); else for (let i=0;i<n;i++) order.push(i);

  const yPos = new Map(order.map((leaf, i) => [leaf, i]));
  const nodes = new Map(); for (let i=0;i<n;i++) nodes.set(i, {id:i, dist:0, y:yPos.get(i)});
  const horiz = [], vert = [];
  for (const m of merges) {
    const L = nodes.get(m.left), R = nodes.get(m.right);
    horiz.push({ y: L.y, x1: L.dist, x2: m.dist });
    horiz.push({ y: R.y, x1: R.dist, x2: m.dist });
    vert.push({ x: m.dist, y1: Math.min(L.y, R.y), y2: Math.max(L.y, R.y) });
    nodes.set(m.id, { id: m.id, dist: m.dist, y: (L.y + R.y)/2 });
  }
  const leaves = order.map(leaf => ({ x:0, y:yPos.get(leaf), cluster:`C${hc_cut.assign[leaf]+1}` }));
  const maxDist = Math.max(merges.at(-1)?.dist ?? 1, hc_thresh);

  // ----- ขนาดคงที่ + คอนฟิก -----
  const W_FIX = 640;   // ความกว้างแต่ละกราฟ
  const H_FIX = 360;   // ความสูงแต่ละกราฟ
  const GAP   = 16;    // ช่องว่างระหว่างกราฟ
  const POINT_R = 5;   // << จุดใน scatter เล็กลง 50%
  const LEAF_R  = 3.5; // จุดใบใน dendrogram ให้เล็กลงตามสัดส่วน

  // ----- SCATTER -----
  const colored = hc_data.map((p,i)=> ({...p, cluster:`C${hc_cut.assign[i]+1}`}));
  const scatter = P.plot({
    width: W_FIX, height: H_FIX,
    grid: true, nice: false,
    x: { label: "X", domain: [-2, 9] },
    y: { label: "Y", domain: [-2, 7] },
    marks: [
      P.dot(colored, { x:"x", y:"y", r: POINT_R, fill:"cluster", opacity: 0.85, tip:true })
    ]
  });

  // ----- DENDROGRAM -----
  const dendro = P.plot({
    width: W_FIX, height: H_FIX,
    marginLeft: 60, marginRight: 20, marginTop: 10, marginBottom: 30,
    x: { label: "distance", domain: [0, maxDist * 1.05] },
    y: { domain: [order.length - 1, 0], tickFormat: () => "" },
    marks: [
      P.ruleY(horiz, { y:"y", x1:"x1", x2:"x2", stroke:"#444" }),
      P.ruleX(vert,  { x:"x", y1:"y1", y2:"y2", stroke:"#444" }),
      P.ruleX([{ x: hc_thresh, y1: -0.5, y2: order.length - 0.5 }],
              { x:"x", y1:"y1", y2:"y2", stroke:"#c00", strokeDash:[6,4] }),
      P.dot(leaves, { x:"x", y:"y", r: LEAF_R, fill:"cluster", stroke:"black" })
    ]
  });

  // ----- วางซ้าย–ขวา -----
  const wrap  = document.createElement("div");
  wrap.style.display = "flex";
  wrap.style.gap = GAP + "px";
  wrap.style.alignItems = "flex-start";
  wrap.style.justifyContent = "center";
  wrap.style.overflowX = "hidden";

  const left  = document.createElement("div");
  const right = document.createElement("div");
  left.style.flex  = `0 0 ${W_FIX}px`;
  right.style.flex = `0 0 ${W_FIX}px`;

  const titleL = Object.assign(document.createElement("div"), {textContent:"Scatter (colored by clusters)"});
  titleL.style.font = "600 14px system-ui, sans-serif"; titleL.style.margin = "0 0 .5rem 0";
  const titleR = Object.assign(document.createElement("div"), {textContent:"Dendrogram"});
  titleR.style.font = "600 14px system-ui, sans-serif"; titleR.style.margin = "0 0 .5rem 0";

  left.append(titleL, scatter);
  right.append(titleR, dendro);
  wrap.append(left, right);
  return wrap;
}

How It Works

Hierarchical Clustering is based on the concept of a hierarchical structure,
by building a Dendrogram, which is a tree diagram showing the relationships among data.

Start by treating each data point as its own cluster (Singleton Cluster).
Merge the closest clusters based on distance or similarity.
Repeat until all data points are combined into a single cluster.

Types of Hierarchical Clustering

Agglomerative Hierarchical Clustering (AHC)

Bottom-up approach
Start with each data point as its own cluster
Iteratively merge the closest clusters
Continue until only one cluster remains

What is Linkage?

Linkage is a method for measuring the distance between clusters in Hierarchical Clustering,
which directly affects how data points are merged into clusters.

Single Linkage
Complete Linkage
Average Linkage
Ward’s Method

Types of Linkage Methods

1. Single Linkage (Nearest Neighbor)

Uses the minimum distance between two clusters
Suitable for data with chain-like or connected structures
May suffer from the “Chain Effect,” where clusters form long chains

2. Complete Linkage (Farthest Neighbor)

Uses the maximum distance between two clusters
Produces more compact clusters
Reduces the chance of the Chain Effect

3. Average Linkage (Unweighted Pair Group Method with Arithmetic Mean - UPGMA)

Uses the average distance between all points in the two clusters
A compromise between Single and Complete Linkage
Produces balanced results in terms of cluster size and distance

4. Weighted Linkage (WPGMA - Weighted Pair Group Method with Arithmetic Mean)

Similar to Average Linkage but gives weight to the size of the clusters being merged

5. Centroid Linkage (Unweighted Pair Group Method with Centroid - UPGMC)

Uses the distance between the centroids of two clusters
May cause issues with overlapping clusters

6. Ward’s Method

Focuses on minimizing within-cluster variance
Often produces well-structured clusters and works well with large datasets
Widely used in business and data science applications

Choosing a Linkage Method

Single Linkage → Suitable when data tends to form connected or chain-like clusters
Complete Linkage → Use when compact clusters are desired
Average Linkage → Works well when clusters vary in size
Ward’s Method → Good for general use and minimizing within-cluster variance

Interactive Linkage

viewof seed   = Inputs.number({label: "Seed", value: 42, step: 1, min: 0})
viewof ktrue  = Inputs.range([2,5], {label: "Number of true clusters", value: 3, step: 1})
viewof nper   = Inputs.range([15,60], {label: "Points per cluster", value: 35, step: 5})
viewof spread = Inputs.range([0.1, 1.2], {label: "Cluster spread (sd)", value: 0.35, step: 0.05})
viewof linkage = Inputs.radio(["single","complete","average","centroid","ward"], {label: "Linkage"})

// โหมดเลือกคู่คลัสเตอร์
viewof pairMode = Inputs.radio(["Auto (best pair)", "Choose pair"], {label: "Pair selection", value: "Auto (best pair)"})

// ===== Data & Cluster Options =====
data = {
  const rng  = mulberry32(seed);
  const K    = ktrue, Nper = nper, sd = spread;
  const centers = Array.from({length: K}, (_,i)=>({id:i, x:2+6*rng(), y:2+6*rng()}));
  const pts = [];
  for (const c of centers){
    for (let i=0;i<Nper;i++){
      pts.push({x:c.x + sd*randn(rng), y:c.y + sd*randn(rng), cid:c.id});
    }
  }
  return {pts, centers, K};
}

// ตัวเลือก 1-based
clusterOptions = Array.from({length: data.K}, (_,i)=>i+1)

// เลือกคลัสเตอร์ 2 อันเมื่ออยู่โหมด Choose (ค่าเริ่มจาก 1)
viewof clusterA = Inputs.select(clusterOptions, {label: "Cluster A", value: 1})
viewof clusterB = Inputs.select(clusterOptions, {label: "Cluster B", value: Math.min(2, data.K)})

// Explanation (callout) — English
// Callout as HTML (works in Observable)
linkageHelp = {
  const help = {
    single:   html`<div style="border-left:4px solid #2b8cbe; padding:0.5em 1em; background:#f0f8ff; border-radius:4px;">
      <strong>single</strong>: Line between the <b>closest pair of points</b> (minimum pairwise distance)
    </div>`,
    complete: html`<div style="border-left:4px solid #e6550d; padding:0.5em 1em; background:#fff5eb; border-radius:4px;">
      <strong>complete</strong>: Line between the <b>farthest pair of points</b> (maximum pairwise distance)
    </div>`,
    average:  html`<div style="border-left:4px solid #31a354; padding:0.5em 1em; background:#f7fcf5; border-radius:4px;">
      <strong>average</strong>: Line between the <b>centroids</b> of the two clusters (label shows the <b>average pairwise distance</b>)
    </div>`,
    centroid: html`<div style="border-left:4px solid #756bb1; padding:0.5em 1em; background:#f5f5f5; border-radius:4px;">
      <strong>centroid</strong>: Line directly between the <b>centroids</b> of the two clusters
    </div>`,
    ward:     html`<div style="border-left:4px solid #636363; padding:0.5em 1em; background:#f7f7f7; border-radius:4px;">
      <strong>ward</strong>: Line between the <b>centroids</b>, with a label showing <b>ΔSSE</b> (increase in within-cluster SSE after merging)
    </div>`
  };
  return help[linkage];
}

function randn(rng){ const u=1-rng(), v=1-rng(); return Math.sqrt(-2*Math.log(u))*Math.cos(2*Math.PI*v); }
function L2(a,b){ return Math.hypot(a.x-b.x, a.y-b.y); }
function mulberry32(a){return function(){let t=a+=0x6D2B79F5;t=Math.imul(t^t>>>15,t|1);t^=t+Math.imul(t^t>>>7,t|61);return((t^t>>>14)>>>0)/4294967296;};}

// ===== Compute pairwise criteria =====
pairs = {
  const {pts, K} = data;
  const byCluster = d3.group(pts, d=>d.cid);
  const out = [];
  for (let i=0;i<K;i++){
    for (let j=i+1;j<K;j++){
      const A = byCluster.get(i), B = byCluster.get(j);

      // single (min)
      let single = Infinity, sPair=null;
      for (const a of A) for (const b of B){ const d=L2(a,b); if (d<single){single=d; sPair=[a,b];} }

      // complete (max)
      let complete = -Infinity, cPair=null;
      for (const a of A) for (const b of B){ const d=L2(a,b); if (d>complete){complete=d; cPair=[a,b];} }

      // average
      let sum=0, cnt=0; for (const a of A) for (const b of B){ sum+=L2(a,b); cnt++; }
      const average = sum/cnt;

      // centroid & Ward
      const cA = {x:d3.mean(A,d=>d.x), y:d3.mean(A,d=>d.y)};
      const cB = {x:d3.mean(B,d=>d.x), y:d3.mean(B,d=>d.y)};
      const centroid = L2(cA,cB);

      const SSE = C => {
        const mx=d3.mean(C,d=>d.x), my=d3.mean(C,d=>d.y);
        return d3.sum(C, p => (p.x-mx)**2 + (p.y-my)**2);
      };
      const ward = SSE([...A,...B]) - SSE(A) - SSE(B);

      out.push({i,j,A,B,cA,cB,single,sPair,complete,cPair,average,centroid,ward});
    }
  }
  return out;
}

chosen = {
  const met = linkage; // "single" | "complete" | "average" | "centroid" | "ward"
  let cand = pairs;

  // ถ้าเลือกโหมดเจาะจงคู่ (UI เป็น 1-based -> แปลงกลับเป็น 0-based)
  if (pairMode === "Choose pair") {
    if (clusterA === clusterB) return null; // กันเลือกคู่เดียวกัน
    const a = Math.min(clusterA - 1, clusterB - 1);
    const b = Math.max(clusterA - 1, clusterB - 1);
    cand = pairs.filter(d => d.i === a && d.j === b);
  }

  // กรองเฉพาะที่มีค่า metric เป็นตัวเลขปกติ
  cand = cand
    .map(d => ({...d, _score: d[met]}))
    .filter(d => Number.isFinite(d._score));

  if (!cand.length) return null;

  // หา minimum เอง
  const best = cand.reduce((m, d) => (d._score < m._score ? d : m), cand[0]);

  // เตรียมจุดเริ่ม/ปลายของเส้นตาม linkage
  let segStart, segEnd, label, style = "solid";
  if (met === "single")   { segStart = best.sPair[0]; segEnd = best.sPair[1]; label = best.single; }
  else if (met === "complete"){ segStart = best.cPair[0]; segEnd = best.cPair[1]; label = best.complete; }
  else if (met === "centroid"){ segStart = best.cA;      segEnd = best.cB;      label = best.centroid; }
  else if (met === "average") { segStart = best.cA;      segEnd = best.cB;      label = best.average; style = "dash"; }
  else if (met === "ward")    { segStart = best.cA;      segEnd = best.cB;      label = best.ward;    style = "dot"; }

  return { method: met, segStart, segEnd, label, pair: [best.i, best.j], style };
}

!chosen ? md`**Select two different clusters or change settings — no valid pair.**` : null

color = d3.scaleOrdinal(d3.schemeTableau10).domain(d3.range(data.K))

centroids = d3.range(data.K).map(i => ({
  x: d3.mean(data.pts.filter(p=>p.cid===i), d=>d.x),
  y: d3.mean(data.pts.filter(p=>p.cid===i), d=>d.y),
  cid: i
}))

fmt = d3.format(".2f")

plotTitle = {
  if (!chosen) return "Select a linkage and (optionally) a pair of clusters";

  const a = chosen.pair[0] + 1;   // 1-based
  const b = chosen.pair[1] + 1;
  const L = fmt(chosen.label);

  const t = {
    single:   (a,b,L)=>`single: closest pair of points — cluster ${a} ↔ cluster ${b} (d = ${L})`,
    complete: (a,b,L)=>`complete: farthest pair of points — cluster ${a} ↔ cluster ${b} (d = ${L})`,
    average:  (a,b,L)=>`average: centroid ↔ centroid — clusters ${a} ↔ ${b} (avg d = ${L})`,
    centroid: (a,b,L)=>`centroid: centroid ↔ centroid distance — clusters ${a} ↔ ${b} (d = ${L})`,
    ward:     (a,b,L)=>`ward: ΔSSE between clusters ${a} ↔ ${b} (ΔSSE = ${L})`
  };

  return t[chosen.method](a,b,chosen.label);
}

// ===== Plot =====
Plot.plot({
  width: 760,
  height: 540,
  grid: true,
  marginBottom: 60,
  title: plotTitle,   // <-- เพิ่มบรรทัดนี้
  x: {label: "X"},
  y: {label: "Y"},
  marks: [
    Plot.dot(data.pts, {x:"x", y:"y", r:3.5, fill:d=>color(d.cid)}),
    Plot.dot(centroids, {x:"x", y:"y", r:8, fill:d=>color(d.cid), stroke:"white", strokeWidth:2}),

    // label บน centroid เริ่มจาก 1
    Plot.text(centroids.map(c=>({x:c.x,y:c.y,t:String(c.cid+1)})),
              {x:"x", y:"y", text:"t", dy:2, fontWeight:700, fill:"black",
               stroke:"white", strokeWidth:6, paintOrder:"stroke"}),

    // เส้นตาม linkage
    chosen && Plot.line([chosen.segStart, chosen.segEnd], {
      x:d=>d.x, y:d=>d.y, stroke:"black", strokeWidth:3,
      strokeDasharray: chosen.style==="dash" ? "6,6" : (chosen.style==="dot" ? "2,6" : "0")
    }),

    // ป้ายตัวเลขกลางเส้น
    chosen && Plot.text([{
      x:(chosen.segStart.x+chosen.segEnd.x)/2,
      y:(chosen.segStart.y+chosen.segEnd.y)/2,
      t:(chosen.method==="ward" ? "ΔSSE = " : "d = ") + d3.format(".2f")(chosen.label)
    }], {x:"x", y:"y", text:"t", dy:-8, stroke:"white", strokeWidth:5, paintOrder:"stroke"}),
    chosen && Plot.text([{
      x:(chosen.segStart.x+chosen.segEnd.x)/2,
      y:(chosen.segStart.y+chosen.segEnd.y)/2,
      t:(chosen.method==="ward" ? "ΔSSE = " : "d = ") + d3.format(".2f")(chosen.label)
    }], {x:"x", y:"y", text:"t", dy:-8})
  ],
  color: {legend: true}
})

// Show the explanation right under the chart
Plot.plot({
  // ...options เดิม...
  caption: linkageHelp  // ถ้ารองรับจะขึ้นใต้กราฟในกรอบคำอธิบาย
})

Example Workflow

1. Load the Data

Example dataset:

x	y	label
1	1	A
1	2	B
6	6	C
8	4	D
8	7	E

Visualized

2. Rescale (Normalize or Standardize) if Necessary

Not required in this example.

3. Compute the Distance Matrix

	A	B	C	D	E
A	0.00	1.22	8.66	9.33	11.29
B	1.22	0.00	7.84	8.92	10.54
C	8.66	7.84	0.00	3.46	2.74
D	9.33	8.92	3.46	0.00	3.67
E	11.29	10.54	2.74	3.67	0.00

4. Perform Hierarchical Clustering

The default method is ‘complete linkage’, but other methods such as ‘single’ or ‘average’ can also be used.

5. Plot the Dendrogram

A dendrogram is a tree-like diagram that shows the hierarchical structure of clusters.

Hierachical Clustering Step by Step

//| panel: center
(async () => {
  // ===== Skeleton =====
  const box = html`<div style="max-width:1200px;font:14px system-ui;">
    <style>
      .gd-side {font:12px system-ui; max-width:1200px}
      .gd-side .row-h {display:flex; flex-wrap:wrap; gap:12px; align-items:flex-end}
      .gd-side .ctrl {display:flex; flex-direction:column; gap:6px}
      .gd-side input[type="number"]{width:110px; padding:2px 6px}
      .gd-side input[type="range"]{width:240px}
      .gd-side .oi-radio {display:flex; flex-wrap:wrap; gap:10px}
      .gd-side .oi-radio label {font-weight:400; font-size:12px}
      .gd-side button[disabled] {opacity:.5; cursor:not-allowed}

      .topbar {
        display:grid; grid-template-columns: 1fr auto;
        align-items:start; gap:16px; margin-top:0; white-space:nowrap;
      }
      .kpi {font:12px system-ui; color:#333}
      .kpi b{font-weight:600}
      .ctrl-right{
        display:grid; grid-auto-rows:auto; row-gap:6px;
        justify-items:end; margin-top:-18px;
      }
      .ctrl-right .row{ display:flex; gap:12px; align-items:center; flex-wrap:wrap; justify-content:flex-end; }
      .ctrl-right .label{ min-width:72px; text-align:right; font-weight:600 }
      #row-step { margin-top:-4px; }

      .gridR { display:grid; grid-template-columns: 1fr 1fr; grid-auto-rows:auto; gap:12px; margin-top:10px; }
      #plot-scatter { grid-column:1; }
      #plot-dendro  { grid-column:2; }
      #plot-scree   { grid-column:1 / -1; }
    </style>

    <div id="ctrl" class="gd-side">
      <div class="row-h" id="row-h"></div>
      <div class="row-h" id="row-btn" style="margin-top:8px;"></div>
    </div>

    <div id="topbar" class="topbar">
      <div id="kpi" class="kpi"></div>
      <div id="right" class="ctrl-right">
        <div class="row" id="row-linkage"><span class="label">Linkage</span></div>
        <div class="row" id="row-step"><span class="label">Step</span></div>
      </div>
    </div>

    <div id="gridR" class="gridR">
      <div id="plot-scatter"></div>
      <div id="plot-dendro"></div>
      <div id="plot-scree"></div>
    </div>

    <div id="note" style="margin-top:8px;color:#444"></div>
  </div>`;

  const rowH   = box.querySelector("#row-h");
  const rowBtn = box.querySelector("#row-btn");
  const kpiBox = box.querySelector("#kpi");
  const rowLink= box.querySelector("#row-linkage");
  const rowStep= box.querySelector("#row-step");
  const gridR  = box.querySelector("#gridR");
  const divSca = box.querySelector("#plot-scatter");
  const divDen = box.querySelector("#plot-dendro");
  const divScr = box.querySelector("#plot-scree");
  const note   = box.querySelector("#note");

  // ===== Inputs =====
  const seedI   = Inputs.number({label:"Seed", value:42, step:1, min:0});
  const nI      = Inputs.range([10, 120], {label:"Sample size (n)", value:40, step:2});
  const kI      = Inputs.range([2, 6],   {label:"True clusters", value:3, step:1});
  const spreadI = Inputs.range([0.05, 1.5], {label:"Cluster spread (sd)", value:0.35, step:0.05});
  const newBtn  = Inputs.button("🎲 New sample");

  rowH.append(
    html`<div class="ctrl" style="min-width:140px">${seedI}</div>`,
    html`<div class="ctrl">${nI}</div>`,
    html`<div class="ctrl">${kI}</div>`,
    html`<div class="ctrl">${spreadI}</div>`
  );
  rowBtn.append(html`<div class="ctrl">${newBtn}</div>`);

  const methodI = Inputs.radio(["single","complete","average","centroid","ward"], {label:"", value:"average"});
  const stepI   = Inputs.range([0, 39], {label:"", value:0, step:1});
  rowLink.append(methodI);
  rowStep.append(stepI);

  // ===== Utils =====
  const TAU = 2*Math.PI;
  function mulberry32(a){ return function(){ let t=a+=0x6D2B79F5; t=Math.imul(t^t>>>15,t|1); t^=t+Math.imul(t^t>>>7,t|61); return ((t^t>>>14)>>>0)/4294967296; }; }
  function rnorm(rng, m=0, s=1){ let u=0,v=0; while(u===0) u=rng(); while(v===0) v=rng(); return m + s*Math.sqrt(-2*Math.log(u))*Math.cos(TAU*v); }
  function centroid(points){ const n=points.length; if(!n) return {x:0,y:0}; let sx=0,sy=0; for(const p of points){ sx+=p.x; sy+=p.y; } return {x:sx/n,y:sy/n}; }
  function sse(points){ if(!points.length) return 0; const c=centroid(points); let s=0; for(const p of points){ const dx=p.x-c.x, dy=p.y-c.y; s+=dx*dx+dy*dy; } return s; }
  function dist(a,b){ return Math.hypot(a.x-b.x,a.y-b.y); }

  // ===== Sampler =====
  function makeData(seed,n,K,sd){
    const rng = mulberry32(seed >>> 0);
    const centers = Array.from({length:K}, (_,i)=>({
      x:rnorm(rng,(i-(K-1)/2)*2.2,0.2),
      y:rnorm(rng,(i%2?1:-1)*1.2,0.2)
    }));
    const data=[];
    for(let i=0;i<n;i++){
      const c = centers[Math.floor(rng()*K)];
      data.push({id:i, x:rnorm(rng,c.x,sd), y:rnorm(rng,c.y,sd)});
    }
    return data;
  }

  // ===== Hierarchical clustering (with witness pairs) =====
  function hclustAll(data, method){
    const n = data.length;
    let clusters = Array.from({length:n}, (_,i)=>[i]);
    const states = [clusters.map(c=>c.slice())];
    const merges = [];
    const heights = [];

    function pairwiseMinMax(Ci, Cj, pick){
      let bestD = (pick==="min" ? Infinity : -Infinity);
      let ia = -1, jb = -1;
      for(const i of Ci){
        for(const j of Cj){
          const d = dist(data[i], data[j]);
          if ((pick==="min" && d < bestD) || (pick==="max" && d > bestD)){
            bestD = d; ia = i; jb = j;
          }
        }
      }
      return {d: bestD, ia, jb};
    }

    for(let step=0; step<n-1; step++){
      let bestI=-1, bestJ=-1, bestH=Infinity;
      let bestCentroids=null, bestWitness=null;

      for(let i=0;i<clusters.length;i++){
        for(let j=i+1;j<clusters.length;j++){
          const Ci = clusters[i], Cj = clusters[j];
          let h, witness=null, cents=null;

          if(method==="single"){
            const r = pairwiseMinMax(Ci, Cj, "min");
            h = r.d; witness = {ia:r.ia, jb:r.jb};
          } else if(method==="complete"){
            const r = pairwiseMinMax(Ci, Cj, "max");
            h = r.d; witness = {ia:r.ia, jb:r.jb};
          } else if(method==="average"){
            let s=0,cnt=0;
            for(const a of Ci) for(const b of Cj){ s+=dist(data[a],data[b]); cnt++; }
            h = s/Math.max(1,cnt);
            cents = { ca: centroid(Ci.map(i=>data[i])), cb: centroid(Cj.map(j=>data[j])) };
          } else if(method==="centroid"){
            const ca = centroid(Ci.map(i=>data[i]));
            const cb = centroid(Cj.map(j=>data[j]));
            h = dist(ca, cb);
            cents = {ca, cb};
          } else if(method==="ward"){
            const Pi=Ci.map(i=>data[i]), Pj=Cj.map(j=>data[j]);
            h = sse(Pi.concat(Pj)) - sse(Pi) - sse(Pj); // ΔSSE
            cents = { ca: centroid(Pi), cb: centroid(Pj) };
          } else {
            h = Infinity;
          }

          if(h < bestH){
            bestH = h;
            bestI = i; bestJ = j;
            bestCentroids = cents;
            bestWitness = witness;
          }
        }
      }

      const A=clusters[bestI], B=clusters[bestJ];
      const merged=A.concat(B);
      const next=[]; for(let k=0;k<clusters.length;k++) if(k!==bestI && k!==bestJ) next.push(clusters[k]); next.push(merged);
      clusters=next;

      merges.push({
        aIdx:bestI, bIdx:bestJ, height:bestH,
        pair:[A.slice(),B.slice()],
        centroids: bestCentroids,
        witness:   bestWitness
      });
      heights.push(bestH);
      states.push(clusters.map(c=>c.slice()));
    }
    return {states, merges, heights};
  }

  // ===== Dendrogram segments =====
  function buildDendroSegments(data, result){
    const n = data.length;
    const leaves = Array.from({length:n}, (_,i)=>i).sort((i,j)=> (data[i].x - data[j].x) || (data[i].y - data[j].y));
    const xPos = new Map(leaves.map((idx, k)=> [idx, k+1]));
    const nodeKey = (arr)=> arr.slice().sort((a,b)=>a-b).join(",");
    const nodes = new Map();
    for(const i of leaves) nodes.set(nodeKey([i]), {x:xPos.get(i), y:0, size:1});

    const verts=[], horiz=[];
    result.merges.forEach((m, stepIdx)=>{
      const A = m.pair[0], B = m.pair[1], h = m.height;
      const kA = nodeKey(A), kB = nodeKey(B);
      const nA = nodes.get(kA), nB = nodes.get(kB);
      const xA = nA.x, yA = nA.y, sA = nA.size;
      const xB = nB.x, yB = nB.y, sB = nB.size;

      verts.push({x1:xA,y1:yA,x2:xA,y2:h, step:stepIdx+1});
      verts.push({x1:xB,y1:yB,x2:xB,y2:h, step:stepIdx+1});
      horiz.push({x1:Math.min(xA,xB), y1:h, x2:Math.max(xA,xB), y2:h, step:stepIdx+1});

      const keyU = nodeKey(A.concat(B));
      const xU = (xA*sA + xB*sB) / (sA+sB);
      nodes.set(keyU, {x:xU, y:h, size:sA+sB});
    });

    const maxH = result.heights.length ? Math.max(...result.heights) : 1;
    return {verts, horiz, maxH, n};
  }

  // ===== State =====
  const state = { data:[], result:null, stepMax:0 };

  function rebuild(){
    state.data = makeData(+seedI.value, +nI.value, +kI.value, +spreadI.value);
    state.result = hclustAll(state.data, methodI.value);
    state.stepMax = state.result.states.length-1;

    const rangeEl = stepI.querySelector('input[type="range"]');
    if(rangeEl){ rangeEl.max = state.stepMax; }
    if(+stepI.value > state.stepMax) stepI.value = state.stepMax;

    draw();
  }

  function draw(){
    divSca.innerHTML = ""; divDen.innerHTML = ""; divScr.innerHTML = "";

    const colW  = Math.max(320, Math.floor(gridR.clientWidth / 2) - 12);
    const Hpair = Math.max(260, Math.min(800, Math.floor(colW * 0.9)));
    const Wfull = gridR.clientWidth;

    const t = +stepI.value;
    const {states, merges, heights} = state.result;
    const clusters = states[t];

    // KPI
    const total = state.data.length;
    kpiBox.innerHTML = `
      <span><b>Step:</b> ${t} / ${state.stepMax}</span>
      <span>•</span>
      <span><b>#Clusters:</b> ${clusters.length}</span>
      <span>•</span>
      <span><b>n:</b> ${total}</span>
      ${t>0 ? `<span>•</span><span><b>Height:</b> ${heights[t-1].toFixed(4)}</span>` : ``}
      <span style="margin-left:12px;opacity:.8">(Linkage: <b>${methodI.value}</b>)</span>
    `;

    // ===== Scatter with numeric labels =====
    const palette = d3.schemeTableau10;
    const colorOfCluster = (ci)=> palette[ci % palette.length];
    const pointRows = [];
    clusters.forEach((clu, ci)=>{ for(const idx of clu) pointRows.push({x:state.data[idx].x, y:state.data[idx].y, ci}); });

    const marksScatter = [Plot.frame()];
    if(t===0){
      marksScatter.push(Plot.dot(state.data.map(p=>({x:p.x,y:p.y})), {x:"x", y:"y", r:4, fill:"black"}));
    } else {
      const m = merges[t-1];
      const {pair, centroids, witness} = m;

      const setA=new Set(pair[0]), setB=new Set(pair[1]);
      const hiA = state.data.filter((_,i)=> setA.has(i)).map(p=>({x:p.x,y:p.y}));
      const hiB = state.data.filter((_,i)=> setB.has(i)).map(p=>({x:p.x,y:p.y}));

      let explainMarks = [];
      if(methodI.value==="single" || methodI.value==="complete"){
        if(witness && witness.ia>=0 && witness.jb>=0){
          const pa = state.data[witness.ia], pb = state.data[witness.jb];
          const mid = {x:(pa.x+pb.x)/2, y:(pa.y+pb.y)/2};
          const dval = dist(pa, pb);
          explainMarks.push(
            Plot.link([{x1:pa.x,y1:pa.y,x2:pb.x,y2:pb.y}],
              {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"black", strokeDasharray:"4,3", strokeWidth:1.5, tip:true}),
            Plot.text([{x:mid.x, y:mid.y, t:`d = ${dval.toFixed(3)}`}],
              {x:"x", y:"y", text:"t", dy:-8, fontWeight:700, tip:true})
          );
        }
      } else if(methodI.value==="average"){
        const h = m.height;
        const mid = centroids ? {x:(centroids.ca.x+centroids.cb.x)/2, y:(centroids.ca.y+centroids.cb.y)/2} : {x:0,y:0};
        explainMarks.push(
          Plot.text([{x:mid.x, y:mid.y, t:`avg = ${h.toFixed(3)}`}],
            {x:"x", y:"y", text:"t", dy:-10, fontWeight:700, tip:true})
        );
      } else if(methodI.value==="centroid"){
        if(centroids){
          const mid = {x:(centroids.ca.x+centroids.cb.x)/2, y:(centroids.ca.y+centroids.cb.y)/2};
          const dcc = dist(centroids.ca, centroids.cb);
          explainMarks.push(
            Plot.link([{x1:centroids.ca.x,y1:centroids.ca.y,x2:centroids.cb.x,y2:centroids.cb.y}],
              {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"black", strokeDasharray:"4,3", strokeWidth:1.5, tip:true}),
            Plot.text([{x:mid.x, y:mid.y, t:`‖cₐ − c_b‖ = ${dcc.toFixed(3)}`}],
              {x:"x", y:"y", text:"t", dy:-10, fontWeight:700, tip:true})
          );
        }
      } else { // ward
        if(centroids){
          const mid = {x:(centroids.ca.x+centroids.cb.x)/2, y:(centroids.ca.y+centroids.cb.y)/2};
          const dval = m.height; // ΔSSE
          explainMarks.push(
            Plot.link([{x1:centroids.ca.x,y1:centroids.ca.y,x2:centroids.cb.x,y2:centroids.cb.y}],
              {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"black", strokeDasharray:"4,3", strokeWidth:1.5, tip:true}),
            Plot.text([{x:mid.x, y:mid.y, t:`ΔSSE = ${dval.toFixed(3)}`}],
              {x:"x", y:"y", text:"t", dy:-10, fontWeight:700, tip:true})
          );
        }
      }

      marksScatter.push(
        Plot.dot(pointRows, {x:"x", y:"y", r:4, fill:d=>colorOfCluster(d.ci), stroke:"white"}),
        ...explainMarks,
        Plot.dot(hiA, {x:"x", y:"y", r:4, stroke:"black", fill:"none"}),
        Plot.dot(hiB, {x:"x", y:"y", r:4, stroke:"black", fill:"none"})
      );
    }

    divSca.append(Plot.plot({
      marks: marksScatter,
      width: colW, height: Hpair, grid:true,
      x:{label:"x"}, y:{label:"y"}
    }));

    // ===== Dendrogram =====
    const den = buildDendroSegments(state.data, state.result);
    const sNow = t>0 ? t : -1;
    const vAll = den.verts, hAll = den.horiz;
    const vHL  = sNow>0 ? vAll.filter(d=>d.step===sNow) : [];
    const hHL  = sNow>0 ? hAll.filter(d=>d.step===sNow) : [];
    divDen.append(Plot.plot({
      marks: [
        Plot.link(vAll, {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"#bbb"}),
        Plot.link(hAll, {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"#bbb"}),
        (sNow>0 ? Plot.link(vHL, {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"black", strokeWidth:2}) : null),
        (sNow>0 ? Plot.link(hHL, {x1:"x1",y1:"y1",x2:"x2",y2:"y2", stroke:"black", strokeWidth:2}) : null),
        Plot.dot(Array.from({length:den.n}, (_,i)=>({x:i+1,y:0})), {x:"x", y:"y", r:2})
      ],
      width: colW, height: Hpair, grid:true,
      x:{label:"leaves (ordered by x)", domain:[0.5, den.n+0.5]},
      y:{label: methodI.value==="ward" ? "ΔSSE" : "height", domain:[0, den.maxH*1.05]}
    }));

    // ===== Scree =====
    const series = heights.map((h,i)=>({step:i+1, h}));
    const cur = t>0 ? [{step:t, h:heights[t-1]}] : [];
    divScr.innerHTML = "";
    divScr.append(Plot.plot({
      marks:[
        Plot.ruleY([0]),
        Plot.line(series, {x:"step", y:"h"}),
        Plot.dot(series,  {x:"step", y:"h"}),
        (t>0 ? Plot.ruleX([t], {stroke:"crimson"}) : null),
        (t>0 ? Plot.dot(cur, {x:"step", y:"h", r:6}) : null),
        (t>0 ? Plot.text(cur, {x:"step", y:"h", dy:-8, text:d=>`${methodI.value==="ward" ? "ΔSSE" : "h"} = ${d.h.toFixed(3)}`, fontWeight:600}) : null)
      ],
      width: gridR.clientWidth, height: Math.max(200, Math.floor(Hpair * 0.7)), grid:true,
      x:{label:"merge step (1 … n-1)", domain:[1, state.stepMax]},
      y:{label: methodI.value==="ward" ? "ΔSSE" : "height"}
    }));

    note.innerHTML = `แสดงตัวเลขการคำนวณใน scatter: single/complete → d, average → avg, centroid → ‖cₐ−c_b‖, ward → ΔSSE.`;
  }

  // ===== Events =====
  newBtn.addEventListener("click", () => {
    newBtn.disabled=true; setTimeout(()=>newBtn.disabled=false,600);
    seedI.value = (+seedI.value || 0) + 1;
    rebuild();
  });
  [seedI, nI, kI, spreadI].forEach(el => el.addEventListener("input", rebuild));
  methodI.addEventListener("input", () => { state.result = hclustAll(state.data, methodI.value); draw(); });
  stepI.addEventListener("input", draw);
  window.addEventListener("resize", () => draw(), {passive:true});

  // init
  rebuild();
  return box;
})()