International College of Digital Innovation, CMU
October 7, 2025
Students are able to…
Describe basic concepts and works of Natural Language Processing (NLP).
Explain the basics of sentiment analysis.
Recognise everyday applications of sentiment analysis.
Use simple tools for sentiment analysis.
Interpret the results of sentiment analysis in a clear and simple way, making it easy to understand.
Text Mining (also called Text Data Mining or Text Analytics) is the process of extracting useful information, patterns, and knowledge from unstructured text data.
It combines techniques from natural language processing (NLP), machine learning, and statistics to transform text into structured data for analysis.
✅ Scenario
An e-commerce company (e.g., Amazon) wants to improve product quality and customer satisfaction. They receive thousands of product reviews daily, which are unstructured text.
🔑 Process
Data Collection:
Preprocessing:
Text Mining Techniques:
Business Action:
Amazon uses text mining for product review analysis.
Starbucks applies text mining on Twitter and Instagram posts.
Healthcare: Mining Medical Records and Clinical Notes for Diagnosis Support
Hospitals and clinics generate massive amounts of unstructured text data such as:
These contain valuable information but are difficult to analyze manually.
Data Collection
Preprocessing
Text Mining Techniques
Business/Healthcare Impact
IBM Watson Health: Use text mining to extract meaningful insights from clinical notes for diagnosis support.
Mount Sinai Hospital (New York): Applied NLP to EHRs to predict heart failure risk earlier than traditional methods.
Finance: Detecting Fraud or Analyzing News Sentiment for Stock Prediction
Financial institutions handle enormous volumes of unstructured data:
This data contains hidden signals for fraud detection and investment prediction.
Fraud Detection (Credit Cards & Transactions)
Data Sources: transaction descriptions, merchant names, customer complaint notes
Techniques:
Impact: Real-time fraud alerts, reduced financial losses
News Sentiment for Stock Prediction
Data Sources: news headlines, financial articles, Twitter posts
Techniques:
Impact: Helps traders forecast price direction, build sentiment-driven trading strategies
JPMorgan Chase 🏦
Bloomberg Terminal & Reuters 📰
S&P Global Market Intelligence 📈
Education & Research: Summarizing Articles, Plagiarism Detection, or Learning Analytics 🎓📚
Universities and researchers deal with massive amounts of unstructured text:
Text mining makes it possible to process and analyze this information efficiently.
Summarizing Articles
Plagiarism Detection
Learning Analytics
Turnitin → Plagiarism detection across millions of student papers.
Coursera & edX → Analyze forum discussions to improve course design.
Semantic Scholar (Allen Institute for AI) → Uses NLP to summarize and recommend research papers.
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language (spoken or written).
It combines techniques from linguistics, computer science, and machine learning to bridge the gap between human communication and computer understanding.
Key Capabilities of NLP
We often assign a sentiment score to text.
Sentiment values usually range between –1 (very negative) and +1 (very positive).
Sentence:
“The movie was fantastic and inspiring.”
Sentence:
“The service was terrible and disappointing.”
Sentence:
“The food was okay, nothing special.”
Task: Classify text into broad categories → Positive, Negative, or Neutral.
Task: Break down sentiment into levels of polarity
Very Positive: 😍 / 🤩 / 🥳 / ⭐⭐⭐⭐⭐
Positive: 🙂 / 😊 / ⭐⭐⭐⭐⭐
Neutral: 😐 / 😶 / ⭐⭐⭐
Negative: 🙁 / 😟 / ⭐⭐
Very Negative: 😡 / 😠 / 😭 / ⭐
Sentence”
Task: Use NLP and psychology-based models to classify emotions.
Looks at specific aspects/features of a product or service.
Task: Identify what part of the product/service the sentiment refers to.
(async () => {
// ========== SHELL ==========
const box = html`<div style="max-width:1200px;font:14px system-ui;">
<style>
.grid { display:grid; grid-template-columns: 340px 1fr; gap:16px; }
.card { background:#fff; border:1px solid #ddd; border-radius:10px; padding:12px; }
.row { display:flex; gap:10px; align-items:center; flex-wrap:wrap; }
.pill { display:inline-block; padding:2px 8px; border-radius:999px; font:12px system-ui; border:1px solid #ddd; }
.tok { padding:1px 4px; border-radius:6px; margin:2px 3px; display:inline-block; }
.tok.pos { background:#e7f6ec; border:1px solid #b8e0c6; }
.tok.neg { background:#fde7e7; border:1px solid #f6bcbc; }
.tok.neu { background:#f1f1f1; border:1px solid #e2e2e2; }
.bar { height:12px; background:#eee; border-radius:999px; overflow:hidden; }
.bar > div { height:100%; }
.mono { font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; }
textarea { width:100%; min-height:120px; font:13px/1.4 system-ui; }
.hint { font-size:12px; color:#666; }
table.dict { width:100%; border-collapse:collapse; }
table.dict th, table.dict td { border-bottom:1px solid #eee; padding:6px 4px; text-align:left; }
table.dict th { border-bottom:1px solid #ddd; }
.badge { font:11px system-ui; border:1px solid #ddd; padding:1px 6px; border-radius:999px; }
</style>
<div class="grid">
<div class="card">
<div class="row"><span class="pill">Sentiment Lab</span></div>
<div style="margin-top:10px">
<label>Language</label>
<div class="row" id="langRow"></div>
<label style="display:block;margin-top:8px">Mode</label>
<div class="row" id="modeRow"></div>
<div id="singleWrap" style="margin-top:8px"></div>
<div id="batchWrap" style="display:none; margin-top:8px"></div>
<div style="margin-top:10px">
<b>Parameters</b>
<div class="row" id="paramRow"></div>
<div style="margin-top:6px" id="threshRow"></div>
</div>
<div style="margin-top:8px" id="btnRow"></div>
<div class="hint" style="margin-top:6px">Tip: Batch mode → one sentence per line; threshold controls what’s considered Neutral.</div>
</div>
</div>
<div class="card">
<div class="row"><span class="pill">Results</span></div>
<div id="summary" style="margin-top:8px"></div>
<div id="viz" style="margin-top:10px"></div>
</div>
</div>
<div class="card" style="margin-top:16px">
<div class="row" style="justify-content:space-between; align-items:center;">
<span class="pill">Detectable Terms Dictionary</span>
<div id="dictCtrl" class="row"></div>
</div>
<div id="dictTable" style="margin-top:8px"></div>
</div>
</div>`;
// ---------- Controls ----------
const langSel = Inputs.radio(["English","Thai"], {value:"English"});
const modeSel = Inputs.radio(["Single text","Batch (one per line)"], {value:"Single text"});
const textSingle= Inputs.textarea({label:"Text", value:"I absolutely love this! But the battery isn't great.", rows:6});
const textBatch = Inputs.textarea({label:"Lines", value:"I love this so much!\nThis is not good at all.\nใช้งานง่ายมากเลย ชอบ!\nไม่ค่อยดีเท่าไหร่", rows:8});
const negToggle = Inputs.toggle({label:"Negation (not/ไม่)", value:true});
const intToggle = Inputs.toggle({label:"Intensifiers (!!, very, มาก)", value:true});
const emojiToggle = Inputs.toggle({label:"Emoji cues 🙂😢", value:true});
const threshRange = Inputs.range([0, 1], {label:"Neutral zone ±", step:0.05, value:0.15});
const analyzeBtn = Inputs.button("Analyze");
box.querySelector("#langRow").append(langSel);
box.querySelector("#modeRow").append(modeSel);
box.querySelector("#singleWrap").append(textSingle);
box.querySelector("#batchWrap").append(textBatch);
box.querySelector("#paramRow").append(negToggle, intToggle, emojiToggle);
box.querySelector("#threshRow").append(threshRange);
box.querySelector("#btnRow").append(analyzeBtn);
function syncMode() {
const singleWrap = box.querySelector("#singleWrap");
const batchWrap = box.querySelector("#batchWrap");
const mode = modeSel.value;
singleWrap.style.display = (mode==="Single text") ? "" : "none";
batchWrap.style.display = (mode==="Batch (one per line)") ? "" : "none";
}
modeSel.addEventListener("input", syncMode);
syncMode();
const summary = box.querySelector("#summary");
const viz = box.querySelector("#viz");
// ========== LEXICONS (Expanded) ==========
const LEX_EN = {
pos: [
"love","great","good","amazing","awesome","nice","happy","excellent","like","fantastic","cool",
"wonderful","brilliant","delightful","impressive","superb","satisfying","pleasant","lovely","marvelous"
],
neg: [
"bad","terrible","awful","hate","worse","worst","boring","slow","buggy","disappoint","poor",
"horrible","mediocre","useless","unreliable","frustrating","laggy","expensive","crash","broken"
]
};
const LEX_TH = {
pos: [
"ชอบ","ดีมาก","ดี","เยี่ยม","สุดยอด","ประทับใจ","โอเค","ง่าย","เจ๋ง","สุดยอดมาก","รัก",
"ประเสริฐ","โอเคมาก","แจ่ม","ปัง","เริ่ด","คุ้มค่า","ประหยัดเวลา","น่าพอใจ","สวยงาม"
],
neg: [
"แย่","ไม่ดี","แย่มาก","ช้า","ห่วย","ผิดหวัง","น่าเบื่อ","งง","เลว","โคตรแย่","พัง",
"หงุดหงิด","ปวดหัว","ห่วยแตก","แพง","ใช้งานไม่ได้","บั๊ก","หลุด","ค้าง","ล้มเหลว"
]
};
// Emoji cues
const EMOJI = {
"😀":2, "🙂":1.5, "😍":2.5, "😂":1.5, "😢":-2, "😡":-2.5, "😭":-2.5, "👍":1.5, "👎":-1.5,
"🔥":1.5, "💔":-1.5, "✨":1.2, "🤩":2.0, "🤮":-2.0
};
// Intensifiers / Negations
const BOOST_EN = new Set(["very","really","so","extremely","super","highly","truly","incredibly","insanely"]);
const BOOST_TH = new Set(["มาก","มากๆ","สุดๆ","โคตร","สุดสุด","อย่างยิ่ง","สุดยอด"]);
const NEG_EN = new Set(["not","no","never","n't"]);
const NEG_TH = new Set(["ไม่","ไม่ได้","ไม่มี","มิได้","มิใช่","ไม่มีทาง"]);
// ========== Core helpers ==========
const SCORE = (w, lang) => {
const L = (lang==="Thai") ? LEX_TH : LEX_EN;
if (L.pos.includes(w)) return +3;
if (L.neg.includes(w)) return -3;
return 0;
};
function tokenize(text, lang){
if (lang==="Thai"){
const raw = text.replace(/[.,!?()";:]/g, " ").split(/\s+/).filter(Boolean);
return raw;
}
return text.toLowerCase().replace(/[^a-z0-9\s'!🙂😀😍😂😢😡😭👍👎🔥💔✨🤩🤮]/g," ")
.split(/\s+/).filter(Boolean);
}
function analyzeOne(text, lang, opts){
const negSet = (lang==="Thai") ? NEG_TH : NEG_EN;
const boostSet = (lang==="Thai") ? BOOST_TH : BOOST_EN;
const emojis = Array.from(text).filter(c=> EMOJI[c]);
const toks = tokenize(text, lang);
const rows = [];
// simple negation scope (next 1–2 tokens)
let i=0;
while(i<toks.length){
const t = toks[i];
const isNeg = opts.neg && negSet.has(t.replace(/[’']/g,"'"));
if (isNeg){
const nextN = Math.min(2, toks.length - i - 1);
for (let k=1; k<=nextN; k++){
const w = toks[i+k]; const s = SCORE(w, lang);
rows.push({tok:w, base:s, contrib: s ? -s : 0, effect:"negation"});
}
rows.push({tok:t, base:0, contrib:0, effect:"negator"});
i += (1 + nextN);
continue;
}
// base + intensifier (look-behind)
let s = SCORE(t, lang);
const prev = toks[i-1];
if (opts.intens && prev && boostSet.has(prev)){
s = s ? s*1.5 : 0;
}
rows.push({tok:t, base:s, contrib:s, effect: s? "base":"none"});
i++;
}
// emoji contribution
if (opts.emoji && emojis.length){
const emoSum = emojis.reduce((acc,e)=> acc + (EMOJI[e]||0), 0);
rows.push({tok: emojis.join(""), base:emoSum, contrib:emoSum, effect:"emoji"});
}
const raw = rows.reduce((a,b)=> a + (b.contrib||0), 0);
return { rows, raw };
}
function labelFromScore(s, nz){
if (s > nz) return ["Positive","#26a269"];
if (s < -nz) return ["Negative","#c01c28"];
return ["Neutral","#777"];
}
function renderSingle(text, lang, opts, neutralZone){
const {rows, raw} = analyzeOne(text, lang, opts);
const [lab, col] = labelFromScore(raw, neutralZone);
summary.innerHTML = `
<div class="row">
<div class="pill">Language: <b>${lang}</b></div>
<div class="pill">Score: <b class="mono">${raw.toFixed(2)}</b></div>
<div class="pill">Label: <b style="color:${col}">${lab}</b></div>
</div>
<div style="margin-top:6px" class="bar">
<div style="width:${Math.max(0, Math.min(100,(raw+4)/8*100))}%; background:${col}"></div>
</div>
<div class="hint" style="margin-top:6px">Score range ~[-4, 4]. Neutral if |score| ≤ ${neutralZone}.</div>
`;
const data = rows.filter(r=>r.tok.trim().length)
.map((r,i)=>({i, tok:r.tok, contrib:r.contrib||0, effect:r.effect, sign: Math.sign(r.contrib||0)}));
viz.innerHTML = "";
const W = 820, H = 280;
const fig1 = Plot.plot({
width: W, height: H, grid: true,
x: {label: "token index"},
y: {label: "contribution", domain: [-4.5,4.5]},
marks: [
Plot.ruleY([0]),
Plot.barY(data, {x:"i", y:"contrib", fill: d => d.sign>0 ? "#42b883" : (d.sign<0 ? "#e76f51" : "#bbb")}),
Plot.text(data, {x:"i", y:d=>d.contrib>0? d.contrib+0.15 : d.contrib-0.15, text:"tok", fontSize:11, textAnchor:"middle"})
]
});
const line = document.createElement("div");
line.style.marginTop = "8px";
for (const r of data){
const cls = r.contrib>0 ? "pos" : (r.contrib<0 ? "neg" : "neu");
const tip = `${r.tok} (${(r.contrib||0).toFixed(2)} ${r.effect})`;
line.insertAdjacentHTML("beforeend", `<span class="tok ${cls}" title="${tip}">${r.tok}</span>`);
}
viz.append(fig1, line);
}
function renderBatch(lines, lang, opts, neutralZone){
const rows = [];
for (const s of lines){
const {raw} = analyzeOne(s, lang, opts);
const [lab] = labelFromScore(raw, neutralZone);
rows.push({text:s, score:raw, label:lab});
}
summary.innerHTML = `
<div class="row">
<div class="pill">Language: <b>${lang}</b></div>
<div class="pill">Samples: <b class="mono">${rows.length}</b></div>
</div>
<div class="hint" style="margin-top:6px">Neutral if |score| ≤ ${neutralZone}. Drag the threshold in the sidebar to see label flips.</div>
`;
viz.innerHTML = "";
const W = 820, H = 260;
const figH = Plot.plot({
width: W, height: H, grid:true,
x: {label:"score"},
y: {label:"count"},
marks: [
Plot.rectY(rows, Plot.binY({y:"count"}, {x:"score", thresholds:16})),
Plot.ruleX([neutralZone, -neutralZone], {stroke:"#999", strokeDasharray:"4,4"}),
Plot.text([`+${neutralZone}`], {x:neutralZone, y:0, dy:-8}),
Plot.text([`-${neutralZone}`], {x:-neutralZone, y:0, dy:-8})
]
});
// table
const tbl = html`<table style="width:100%; border-collapse:collapse; margin-top:8px;">
<thead><tr>
<th style="border-bottom:1px solid #ddd; text-align:left">Text</th>
<th style="border-bottom:1px solid #ddd; text-align:right">Score</th>
<th style="border-bottom:1px solid #ddd; text-align:left">Label</th>
</tr></thead>
<tbody></tbody>
</table>`;
const tb = tbl.querySelector("tbody");
for (const r of rows){
const [_, col] = labelFromScore(r.score, neutralZone);
tb.insertAdjacentHTML("beforeend",
`<tr>
<td style="border-bottom:1px solid #eee; padding:4px 0">${r.text.replace(/</g,"<")}</td>
<td class="mono" style="border-bottom:1px solid #eee; text-align:right">${r.score.toFixed(2)}</td>
<td style="border-bottom:1px solid #eee; color:${col}">${r.label}</td>
</tr>`);
}
viz.append(figH, tbl);
}
function run(){
const lang = langSel.value;
const mode = modeSel.value;
const opts = {
neg: !!negToggle.value,
intens: !!intToggle.value,
emoji: !!emojiToggle.value
};
const nz = +threshRange.value;
if (mode === "Single text"){
renderSingle(textSingle.value || "", lang, opts, nz);
} else {
const lines = (textBatch.value || "").split(/\r?\n/).map(s=>s.trim()).filter(Boolean);
renderBatch(lines, lang, opts, nz);
}
}
analyzeBtn.addEventListener("click", run);
run(); // initial
// ======== Dictionary Table (Show/Hide + CSV export) ========
const dictTable = box.querySelector("#dictTable");
const dictCtrl = box.querySelector("#dictCtrl");
const showTblToggle = Inputs.toggle({ label: "Show table", value: false });
const langFilter = Inputs.radio(["All","English","Thai"], {value:"All"});
const typeFilter = Inputs.select(["All","positive","negative","intensifier","negation","emoji"], {value:"All"});
const copyBtn = Inputs.button("Copy CSV");
const downloadBtn= Inputs.button("Download CSV");
dictCtrl.append(
showTblToggle,
html`<span class="badge">Filter:</span>`,
langFilter,
typeFilter,
copyBtn,
downloadBtn
);
function buildDictRows(){
const rows = [];
// EN
for (const w of LEX_EN.pos) rows.push({language:"English", type:"positive", term:w, value:3});
for (const w of LEX_EN.neg) rows.push({language:"English", type:"negative", term:w, value:-3});
for (const w of BOOST_EN) rows.push({language:"English", type:"intensifier", term:w, value:"×1.5"});
for (const w of NEG_EN) rows.push({language:"English", type:"negation", term:w, value:"flip"});
for (const [emo,sc] of Object.entries(EMOJI)) rows.push({language:"English", type:"emoji", term:emo, value:sc});
// TH
for (const w of LEX_TH.pos) rows.push({language:"Thai", type:"positive", term:w, value:3});
for (const w of LEX_TH.neg) rows.push({language:"Thai", type:"negative", term:w, value:-3});
for (const w of BOOST_TH) rows.push({language:"Thai", type:"intensifier", term:w, value:"×1.5"});
for (const w of NEG_TH) rows.push({language:"Thai", type:"negation", term:w, value:"flip"});
return rows;
}
function renderDict(){
const lf = langFilter.value;
const tf = typeFilter.value;
const all = buildDictRows().filter(r =>
(lf==="All" || r.language===lf) &&
(tf==="All" || r.type===tf)
);
const tbl = html`<table class="dict">
<thead>
<tr>
<th>Language</th><th>Type</th><th>Term</th><th>Value</th>
</tr>
</thead>
<tbody></tbody>
</table>`;
const tb = tbl.querySelector("tbody");
for (const r of all){
tb.insertAdjacentHTML("beforeend",
`<tr>
<td>${r.language}</td>
<td>${r.type}</td>
<td class="mono">${r.term.replace(/</g,"<")}</td>
<td class="mono">${r.value}</td>
</tr>`);
}
dictTable.innerHTML = "";
dictTable.append(tbl);
}
function toCSV(rows){
const header = ["language","type","term","value"];
const esc = v => `"${String(v).replace(/"/g,'""')}"`;
const lines = [header.map(esc).join(",")].concat(
rows.map(r => [esc(r.language),esc(r.type),esc(r.term),esc(r.value)].join(","))
);
return lines.join("\n");
}
copyBtn.addEventListener("click", async () => {
const lf = langFilter.value;
const tf = typeFilter.value;
const rows = buildDictRows().filter(r =>
(lf==="All" || r.language===lf) &&
(tf==="All" || r.type===tf)
);
const csv = toCSV(rows);
try {
await navigator.clipboard.writeText(csv);
copyBtn.textContent = "Copied!";
setTimeout(()=> copyBtn.textContent = "Copy CSV", 1000);
} catch {
alert(csv); // fallback
}
});
downloadBtn.addEventListener("click", () => {
const lf = langFilter.value;
const tf = typeFilter.value;
const rows = buildDictRows().filter(r =>
(lf==="All" || r.language===lf) &&
(tf==="All" || r.type===tf)
);
const csv = toCSV(rows);
const blob = new Blob([csv], {type:"text/csv;charset=utf-8"});
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
const ts = new Date().toISOString().slice(0,19).replace(/[:T]/g,"-");
a.href = url;
a.download = `sentiment-dictionary-${lf}-${tf}-${ts}.csv`;
document.body.appendChild(a);
a.click();
a.remove();
URL.revokeObjectURL(url);
});
function syncDictVisibility(){
const on = !!showTblToggle.value;
dictTable.style.display = on ? "" : "none";
}
showTblToggle.addEventListener("input", syncDictVisibility);
langFilter.addEventListener("input", renderDict);
typeFilter.addEventListener("input", renderDict);
renderDict();
syncDictVisibility();
return box;
})()Example
I absolutely love this product—super easy to use! 🙂
The app is good, but the battery life is not great.
This update is incredibly fast and really impressive.
It’s not bad, just a bit slow sometimes.
The UX is terrible… I’m so disappointed. 👎
ใช้งานง่ายมาก ชอบฟีเจอร์ใหม่ที่สุด!
ไม่ดีเท่าไหร่ แถมค้างบ่อยๆ จนหงุดหงิด 😡
บริการโอเคนะ แต่ไม่ได้เร็วมาก
ราคาแพงไปนิด แต่คุณภาพก็ดีมากจริงๆ
Nothing special—works as expected.
Tokenization
Lowercasing / Normalization
Stop-word Removal
Stemming
Lemmatization
Punctuation & Special Character Removal
Handling Negations
Feature Extraction is the process of transforming preprocessed text into numerical vectors that machine learning or deep learning models can understand.
Main Techniques
1. Bag of Words (BoW)
Concept: Represents text by counting how many times each word appears, ignoring grammar and word order.
Pros: Simple, easy to implement.
Cons: Loses word context, results in sparse data.
Example:
2. TF–IDF
Term Frequency – Inverse Document Frequency
Example:
3. Word Embeddings
Example:
4. Contextual Embeddings
Example:
Classification
Input: Raw text (reviews, tweets, news)
⚙Process: Classification model (Naive Bayes, Logistic Regression, SVM, Neural Net)
Output: Discrete labels (e.g., Positive / Negative / Neutral, Spam / Not Spam)
Regression
Input: Raw text (reviews, financial news, social media posts)
⚙ Process: Regression model (Linear Regression, Ridge/Lasso, SVR, Neural Networks)
Output: Continuous values (e.g., Predicted Rating = 4.2, Stock Change = –1.5%, Engagement Score = 2000 likes)
Clustering
Input: Raw text (customer reviews, research papers, survey responses)
⚙ Process: Clustering model (k-Means, Hierarchical Clustering, DBSCAN, Topic Modeling such as LDA)
Output: Groups of similar texts (e.g., Delivery Issues, Price Concerns, Product Quality)
After preprocessing, feature extraction, and classification, the system produces results that can be interpreted and visualized.
Key Outputs
Sentiment Label
Sentiment Score / Probability
A numeric value representing sentiment intensity.
Range: –1.0 (very negative) to +1.0 (very positive).
Example:
Aspect-Based Sentiment
Sentiment toward specific product features.
Example:
“The phone’s camera is great but the battery is bad”
Pie Charts
Bar Charts
Time-Series Plots
Word Clouds
Dashboards
(async () => {
// ---- Load Plot (with fallback) ----
let Plot;
try {
Plot = await require("@observablehq/plot@0.6.17");
} catch (err) {
const m = await import("https://esm.sh/@observablehq/plot@0.6?bundle");
Plot = m.default || m;
}
// ---- Shell & Styles ----
const box = html`<div style="max-width:1400px;margin:0 auto;font:14px system-ui;">
<style>
:root { --bow-border:#111; --bow-muted:#6b7280; }
.layout { display:grid; grid-template-columns: 380px 1fr; gap:18px; align-items:start; }
.card { background:#fff; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
.row { display:flex; gap:10px; align-items:center; flex-wrap:wrap; }
.label { font-weight:700; font-size:16px; margin-top:4px; }
.hint { color:var(--bow-muted); font-size:12px; }
.textarea { width:100%; min-height:220px; padding:10px 12px; border:3px solid var(--bow-border); border-radius:6px; box-sizing:border-box; font:14px/1.5 system-ui; }
.field input[type=text] { width:100%; padding:8px 10px; border:3px solid var(--bow-border); border-radius:6px; box-sizing:border-box; }
.h-radio { display:flex; gap:14px; align-items:center; flex-wrap:wrap; }
.pill { display:inline-block; padding:2px 8px; border-radius:999px; border:1px solid #ddd; font-size:12px; }
.topk { display:flex; gap:10px; align-items:center; }
.topk input[type=number]{ width:88px; padding:6px 8px; border:3px solid var(--bow-border); border-radius:6px; font:14px system-ui; }
.topk input[type=range]{ width:55%; min-width:260px; }
#plotWrap { min-height:220px; }
table.tbl { width:100%; border-collapse:collapse; }
table.tbl th, table.tbl td { border-bottom:1px solid #eee; padding:6px 8px; text-align:left; }
table.tbl th { border-bottom:1px solid #ddd; }
.mono { font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace; }
details.stopbox > summary { cursor:pointer; user-select:none; font-weight:600; }
textarea.small { width:100%; min-height:72px; font:12px/1.45 system-ui; }
.toolbar { display:flex; gap:8px; align-items:center; flex-wrap:wrap; }
.btn { border:1px solid #d1d5db; background:#f9fafb; border-radius:8px; padding:6px 10px; cursor:pointer; }
.btn:active { transform: translateY(1px); }
.muted { color:#6b7280; }
.empty { color:#666; font-size:13px; }
@media (max-width: 980px){ .layout{ grid-template-columns: 1fr; } .topk input[type=range]{ width:100%; } }
</style>
<div class="layout">
<!-- Left: Controls -->
<div class="card" id="controls">
<div class="row"><span class="pill">Bag-of-Words Lab</span></div>
<div class="label">Input text</div>
<textarea id="txt" class="textarea" spellcheck="false"
placeholder="Type/paste English text here..."></textarea>
<div class="label" style="margin-top:10px;">Normalization</div>
<div id="normRow" class="h-radio"></div>
<div class="label" style="margin-top:10px;">Filter out</div>
<div class="hint">Enter words separated by spaces (e.g., the a an of to and ...)</div>
<div class="field"><input id="stopInp" type="text" placeholder="the a an of to and ..." /></div>
<details class="stopbox" style="margin-top:10px;">
<summary>Stopwords Menu <span class="hint">(quick presets & custom)</span></summary>
<div class="row" style="margin-top:8px; align-items:flex-start;">
<div style="flex:1;">
<div class="hint">Preset EN stopwords</div>
<textarea id="presetEN" class="small"></textarea>
</div>
<div style="flex:1;">
<div class="hint">Custom stopwords (space/comma/newline)</div>
<textarea id="customSW" class="small" placeholder="add your own..."></textarea>
</div>
</div>
<div class="toolbar" style="margin-top:6px;">
<button id="applySW" class="btn">Apply stopwords</button>
<span id="swInfo" class="muted"></span>
</div>
</details>
<div class="label" style="margin-top:10px;">Top-K</div>
<div class="topk">
<input id="kNum" type="number" value="15" min="1" max="200" step="1" />
<input id="kRng" type="range" value="15" min="1" max="200" step="1" />
<span class="muted">words</span>
</div>
<div class="toolbar" style="margin-top:10px;">
<button id="rebuild" class="btn">Rebuild</button>
<button id="reset" class="btn">Reset sample</button>
<span id="status" class="muted"></span>
</div>
</div>
<!-- Right: Results -->
<div class="card">
<div class="row"><span class="pill">Results</span></div>
<div id="plotWrap" style="margin-top:6px;"></div>
<div class="toolbar" style="margin-top:10px;">
<label class="h-radio" style="gap:8px;">
<input id="showTbl" type="checkbox" checked />
<span>Show table</span>
</label>
<button id="copyCSV" class="btn">Copy CSV</button>
<button id="dlCSV" class="btn">Download CSV</button>
<span id="meta" class="muted"></span>
</div>
<div id="tableWrap" style="margin-top:8px;"></div>
</div>
</div>
</div>`;
// ---- Controls refs ----
const txt = box.querySelector("#txt");
const normRow = box.querySelector("#normRow");
const stopInp = box.querySelector("#stopInp");
const presetEN = box.querySelector("#presetEN");
const customSW = box.querySelector("#customSW");
const applySW = box.querySelector("#applySW");
const swInfo = box.querySelector("#swInfo");
const kNum = box.querySelector("#kNum");
const kRng = box.querySelector("#kRng");
const rebuildBtn = box.querySelector("#rebuild");
const resetBtn = box.querySelector("#reset");
const status = box.querySelector("#status");
const plotWrap = box.querySelector("#plotWrap");
const showTbl = box.querySelector("#showTbl");
const copyCSV = box.querySelector("#copyCSV");
const dlCSV = box.querySelector("#dlCSV");
const meta = box.querySelector("#meta");
const tableWrap = box.querySelector("#tableWrap");
// ---- Normalization radios ----
function radio(name, items, value){
const wrap = document.createElement("div");
wrap.className = "h-radio";
items.forEach(v => {
const id = `${name}-${v}`;
const lab = html`<label for="${id}" style="display:inline-flex;gap:6px;align-items:center;">
<input type="radio" name="${name}" id="${id}" value="${v}" ${v===value?'checked':''}/>
<span>${v}</span>
</label>`;
wrap.append(lab);
});
return wrap;
}
const normWidget = radio("norm", ["none","stem","lemma"], "none");
normRow.append(normWidget);
function normValue(){
const el = normWidget.querySelector("input:checked");
return el ? el.value : "none";
}
// ---- Preset EN stopwords ----
const PRESET_EN = "a an the and or of to in on for with at from by is am are was were be been being it its as that this these those not very really so too just have has had do does did can could should would will about into over than then out up down more most less least again only also if when while which who whom what why how all any each other some no yes ever even".split(/\s+/);
presetEN.value = PRESET_EN.join(" ");
// ---- State ----
let STOP = new Set();
let rowsAll = []; // [{word,count}]
let rowsTop = []; // top-K applied
// ---- Sample text & reset ----
const sample = [
"I absolutely love this product. It’s incredibly easy to use and the design is delightful!",
"But the battery is not great, and the app sometimes feels slow.",
"Great value for money and super easy to use; onboarding is confusing in parts.",
"Customer support was helpful and incredibly quick to respond."
].join("\n");
function resetSample(){ txt.value = sample; }
resetSample();
// ---- Utils ----
function tokenize(s){
return (s||"").toLowerCase().match(/[a-z]+/g) ?? [];
}
function stem(w){
let s = (w||"").toLowerCase().replace(/['’]s?$/, "");
if (!s) return s;
const rep = [
[/sses$/, "ss"], [/ies$/, "y"], [/s$/, ""],
[/ingly$/, ""], [/edly$/, ""], [/ing$/, ""], [/ed$/, ""],
[/ational$/, "ate"], [/tional$/, "tion"], [/izer$/, "ize"],
[/isation$/, "ize"], [/fulness$/, "ful"], [/ousness$/, "ous"],
[/iveness$/, "ive"], [/ment$/, ""], [/ness$/, ""], [/able$/, ""],
[/ible$/, ""], [/al$/, ""], [/er$/, ""], [/est$/, ""], [/ly$/, ""]
];
for (const [re, r] of rep) s = s.replace(re, r);
s = s.replace(/([b-df-hj-np-tv-z])\1$/, "$1");
s = s.replace(/(xes|ches|shes|sses|zes)$/, () => s.slice(0, -2));
return s;
}
const lemma = (() => {
const irr = new Map(Object.entries({
am:"be", is:"be", are:"be", was:"be", were:"be", been:"be",
has:"have", had:"have", does:"do", did:"do", done:"do",
went:"go", gone:"go", ran:"run", running:"run",
ate:"eat", eaten:"eat", saw:"see", seen:"see",
bought:"buy", brought:"bring", thought:"think",
better:"good", best:"good", worse:"bad", worst:"bad",
children:"child", men:"man", women:"woman",
mice:"mouse", geese:"goose", feet:"foot", teeth:"tooth", people:"person"
}));
return function (w){
if (!w) return "";
let s = String(w).toLowerCase().replace(/['’]s?$/, "");
if (irr.has(s)) return irr.get(s);
if (/(^.{3,})ies$/.test(s)) return s.slice(0, -3) + "y";
if (/(xes|ches|shes|sses|zes)$/.test(s)) return s.slice(0, -2);
if (/s$/.test(s) && !/ss$/.test(s)) s = s.slice(0, -1);
if (/(^.{3,})ied$/.test(s)) return s.slice(0, -3) + "y";
if (/([b-df-hj-np-tv-z])\1ed$/.test(s)) return s.slice(0, -3);
if (/ed$/.test(s) && s.length > 3) s = s.replace(/ed$/, "");
if (/([b-df-hj-np-tv-z])\1ing$/.test(s)) return s.slice(0, -4);
if (/ing$/.test(s) && s.length > 4) s = s.slice(0, -3);
if (/(^.{3,})iest$/.test(s)) return s.slice(0, -4) + "y";
if (/(^.{3,})ier$/.test(s)) return s.slice(0, -3) + "y";
if (/est$/.test(s) && s.length > 4) s = s.slice(0, -3);
if (/er$/.test(s) && s.length > 4) s = s.slice(0, -2);
if (/ly$/.test(s) && s.length > 4) s = s.slice(0, -2);
return irr.get(s) || s;
};
})();
function buildStopSet(){
const manual = (stopInp.value || "").toLowerCase().match(/[a-z]+/g) ?? [];
const preset = (presetEN.value || "").toLowerCase().match(/[a-z]+/g) ?? [];
const custom = (customSW.value || "").toLowerCase().split(/[\s,]+/).filter(Boolean);
STOP = new Set([...preset, ...manual, ...custom]);
swInfo.textContent = `Stopwords loaded: ${STOP.size}`;
}
// ---- Core pipeline ----
function process(){
const tokens = tokenize(txt.value);
const kept = tokens.filter(w => !STOP.has(w));
const norm = normValue();
const final = norm === "stem" ? kept.map(stem)
: norm === "lemma" ? kept.map(lemma)
: kept;
const m = new Map();
for (const w of final) m.set(w, (m.get(w)||0) + 1);
rowsAll = Array.from(m, ([word, count]) => ({word, count}))
.sort((a,b)=> b.count - a.count || a.word.localeCompare(b.word));
const K = +kNum.value || 15;
rowsTop = rowsAll.slice(0, Math.min(K, rowsAll.length));
}
function renderPlot(){
plotWrap.innerHTML = "";
if (!rowsTop.length){
plotWrap.innerHTML = `<div class="empty">No tokens to display. Try removing some stopwords or adding more text.</div>`;
return;
}
const fig = Plot.plot({
width: plotWrap.clientWidth || 800,
height: Math.max(220, rowsTop.length * 26),
marginLeft: 110,
x: { label: "Count →" },
y: { domain: rowsTop.map(d=>d.word) },
marks: [
Plot.barX(rowsTop, {x:"count", y:"word", fill:"#4f46e5"}),
Plot.text(rowsTop, {x:"count", y:"word", text: d=>d.count, dx:6, textAnchor:"start", fill:"#111"})
]
});
plotWrap.append(fig);
}
function renderTable(){
tableWrap.innerHTML = "";
if (!showTbl.checked) { tableWrap.style.display = "none"; return; }
tableWrap.style.display = "";
const tbl = html`<table class="tbl">
<thead><tr><th>word</th><th style="text-align:right">count</th></tr></thead>
<tbody></tbody>
</table>`;
const tb = tbl.querySelector("tbody");
for (const r of rowsAll){
tb.insertAdjacentHTML("beforeend",
`<tr><td class="mono">${r.word}</td><td class="mono" style="text-align:right">${r.count}</td></tr>`
);
}
tableWrap.append(tbl);
}
function syncMeta(){
meta.textContent = `Vocab: ${rowsAll.length} • Showing top-K: ${rowsTop.length}`;
}
function toCSV(rows){
const esc = v => `"${String(v).replace(/"/g,'""')}"`;
return ["word,count"].concat(rows.map(r=>`${esc(r.word)},${r.count}`)).join("\n");
}
function rebuild(){
buildStopSet();
process();
renderPlot();
renderTable();
syncMeta();
status.textContent = "Updated";
setTimeout(()=> status.textContent = "", 800);
}
// ---- Wire up ----
// sync K number/range
function syncKFromNum(){ kRng.value = kNum.value; rebuild(); }
function syncKFromRng(){ kNum.value = kRng.value; rebuild(); }
kNum.addEventListener("input", syncKFromNum);
kRng.addEventListener("input", syncKFromRng);
// auto rebuild on change (debounced)
const debounce = (fn, ms=120)=>{ let t; return (...a)=>{ clearTimeout(t); t=setTimeout(()=>fn(...a),ms); }; };
const schedule = debounce(rebuild, 120);
txt.addEventListener("input", schedule);
stopInp.addEventListener("input", schedule);
normWidget.addEventListener("input", rebuild);
applySW.addEventListener("click", rebuild);
showTbl.addEventListener("change", renderTable);
rebuildBtn.addEventListener("click", rebuild);
resetBtn.addEventListener("click", () => { resetSample(); rebuild(); });
window.addEventListener("resize", debounce(()=>{ renderPlot(); }, 80), {passive:true});
// CSV actions
copyCSV.addEventListener("click", async () => {
const csv = toCSV(rowsAll);
try { await navigator.clipboard.writeText(csv); copyCSV.textContent = "Copied!"; setTimeout(()=>copyCSV.textContent="Copy CSV",800); }
catch { alert(csv); }
});
dlCSV.addEventListener("click", () => {
const csv = toCSV(rowsAll);
const blob = new Blob([csv], {type:"text/csv;charset=utf-8"});
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
const ts = new Date().toISOString().slice(0,19).replace(/[:T]/g,"-");
a.href = url; a.download = `bag_of_words-${ts}.csv`; document.body.appendChild(a); a.click(); a.remove();
URL.revokeObjectURL(url);
});
// ---- First build ----
rebuild();
return box;
})()(async () => {
const box = html`<div style="max-width:1200px;font:14px system-ui;">
<style>
.layout { display:grid; grid-template-columns: 360px 1fr; gap:16px; }
.card { background:#fff; border:1px solid #e5e7eb; border-radius:12px; padding:12px; }
.row { display:flex; gap:10px; align-items:center; flex-wrap:wrap; }
.pill { display:inline-block; padding:2px 8px; border-radius:999px; font:12px system-ui; border:1px solid #ddd; }
#cloudWrap { position:relative; width:100%; min-height:560px; border:1px dashed #ddd; border-radius:12px; overflow:hidden; background:#fafafa; }
.token { position:absolute; cursor:pointer; user-select:none; white-space:nowrap; transition: transform .06s ease-out, opacity .2s; }
.token:hover { outline:1px dashed rgba(0,0,0,.25); outline-offset:2px; }
.kwic { font:13px/1.5 system-ui; }
.kwic b { background: #fff3b0; padding:0 2px; border-radius:3px; }
.hint { color:#666; font-size:12px; }
.empty { position:absolute; inset:0; display:flex; align-items:center; justify-content:center; color:#666; font-size:13px; }
details.stopbox > summary { cursor:pointer; user-select:none; }
textarea.small { width:100%; min-height:80px; font:12px/1.4 system-ui; }
.badge { font:11px system-ui; border:1px solid #ddd; padding:1px 6px; border-radius:999px; }
</style>
<div class="layout">
<div class="card">
<div class="row"><span class="pill">Word Cloud Controls</span></div>
<div style="margin-top:8px">
<label>Language</label>
<div class="row" id="langRow"></div>
<label style="display:block;margin-top:8px">N-gram</label>
<div class="row" id="ngRow"></div>
<div style="margin-top:8px" id="txtRow"></div>
<div style="margin-top:8px">
<b>Options</b>
<div class="row" id="optRow"></div>
<div style="margin-top:6px" id="rngRow"></div>
<div style="margin-top:6px" id="rngRow2"></div>
</div>
<details class="stopbox" style="margin-top:10px">
<summary><b>Stopwords Menu</b> <span class="hint">(จัดการคำฟังก์ชัน/ตัวเชื่อม เช่น is, am, are, the, a ฯลฯ)</span></summary>
<div style="margin-top:8px" id="stopCtrl"></div>
</details>
<div class="row" style="margin-top:10px" id="btnRow"></div>
<div class="hint" style="margin-top:6px">
Tip: ถ้าแน่นเกินไป ลองลด Max words หรือเพิ่ม Min font
</div>
</div>
</div>
<div class="card">
<div class="row"><span class="pill">Interactive Word Cloud</span></div>
<div id="stats" style="margin-top:6px"></div>
<div id="cloudWrap" style="margin-top:8px"></div>
</div>
</div>
<div class="card" style="margin-top:16px">
<div class="row"><span class="pill">KWIC (Key Word In Context)</span></div>
<div id="kwic" class="kwic" style="margin-top:8px"></div>
</div>
</div>`;
// -------- Controls --------
const langSel = Inputs.radio(["English","Thai"], {value:"English"});
const ngSel = Inputs.radio(["unigram","bigram","trigram"], {value:"unigram"});
const sampleTxt = [
"I absolutely love this product. It’s incredibly easy to use and the design is delightful!",
"But the battery is not great, and the app sometimes feels slow.",
"บริการรวดเร็วมาก ทีมงานตอบไว ประทับใจสุด ๆ ใช้งานง่าย",
"แต่ราคาค่อนข้างแพง และบางครั้งก็มีอาการค้าง ไม่ค่อยดีเท่าไหร่"
].join("\n");
const txtArea = Inputs.textarea({label:"Text", value: sampleTxt, rows:12});
const caseFold = Inputs.toggle({label:"Lowercase (EN)", value:true});
const rmStop = Inputs.toggle({label:"Remove stopwords", value:true});
const stripShort= Inputs.toggle({label:"Remove short tokens (≤2 letters)", value:false});
const rotateOpt = Inputs.select(["none","±30°","±60°","random"], {label:"Rotation", value:"±30°"});
const scaleSel = Inputs.select(["sqrt","linear","log"], {label:"Size scale", value:"sqrt"});
const maxWords = Inputs.range([50, 2000], {label:"Max words", value:300, step:50}); // ขยายได้ถึง 2,000
const minFreq = Inputs.range([1, 50], {label:"Min frequency", value:1, step:1});
const minFont = Inputs.range([8, 36], {label:"Min font (px)", value:12, step:1});
const maxFont = Inputs.range([28, 160], {label:"Max font (px)", value:80, step:2});
const runBtn = Inputs.button("Build cloud");
const shuffleBtn = Inputs.button("Shuffle layout");
box.querySelector("#langRow").append(langSel);
box.querySelector("#ngRow").append(ngSel);
box.querySelector("#txtRow").append(txtArea);
box.querySelector("#optRow").append(caseFold, rmStop, stripShort, rotateOpt, scaleSel);
box.querySelector("#rngRow").append(maxWords, minFreq);
box.querySelector("#rngRow2").append(minFont, maxFont);
box.querySelector("#btnRow").append(runBtn, shuffleBtn);
const stats = box.querySelector("#stats");
const cloudWrap = box.querySelector("#cloudWrap");
const kwicBox = box.querySelector("#kwic");
// -------- Stopwords Base & Menu --------
const STOP_EN_BASE = new Set(("a,an,the,and,or,of,to,in,on,for,with,at,from,by,is,am,are,was,were,be,been,being,it,its,as,that,this,these,those,not,very,really,so,too,just,have,has,had,do,does,did,can,could,should,would,will,about,into,over,than,then,out,up,down,more,most,less,least,again,only,also,if,when,while,which,who,whom,what,why,how,all,any,each,other,some,no,yes,ever,even").split(","));
const STOP_TH_BASE = new Set(("และ,หรือ,ของ,ที่,ได้,ใน,บน,ให้,กับ,จาก,ว่า,ก็,ค่ะ,ครับ,นะ,น่ะ,เลย,มาก,สุดๆ,ๆ,ก็ได้,อีก,ยัง,จึง,เพราะ,แต่,เมื่อ,ซึ่ง,คือ,เป็น,ได้ว่า,ได้ไหม,โดย,อยู่,ไป,มา,แล้ว,ด้วย,หรือไม่,ไม่,ไม่ได้").split(","));
const useDefaultEN = Inputs.toggle({label:"Use default EN stopwords", value:true});
const useDefaultTH = Inputs.toggle({label:"Use default TH stopwords", value:true});
const customEN = Inputs.textarea({label:"Custom EN stopwords (comma/space/newline)", rows:4, value:""});
const customTH = Inputs.textarea({label:"Custom TH stopwords (คั่นด้วยเว้นวรรค/จุลภาค/ขึ้นบรรทัดใหม่)", rows:4, value:""});
const applyStop = Inputs.button("Apply stopwords");
const stopCtrl = box.querySelector("#stopCtrl");
stopCtrl.append(
html`<div class="row"><span class="badge">English</span></div>`,
useDefaultEN, customEN,
html`<div class="row" style="margin-top:6px"><span class="badge">Thai</span></div>`,
useDefaultTH, customTH,
html`<div class="row" style="margin-top:8px">${applyStop}</div>`,
html`<div class="hint" id="stopInfo" style="margin-top:6px"></div>`
);
const stopInfo = box.querySelector("#stopInfo");
function parseCustomList(text){
return new Set(text.split(/[\s,]+/).map(s=>s.trim()).filter(Boolean));
}
let STOP_EN = new Set(STOP_EN_BASE);
let STOP_TH = new Set(STOP_TH_BASE);
function rebuildStop(){
const addEN = parseCustomList(customEN.value);
const addTH = parseCustomList(customTH.value);
STOP_EN = new Set(useDefaultEN.value ? [...STOP_EN_BASE, ...addEN] : [...addEN]);
STOP_TH = new Set(useDefaultTH.value ? [...STOP_TH_BASE, ...addTH] : [...addTH]);
stopInfo.innerHTML = `EN stopwords: <b>${STOP_EN.size}</b> • TH stopwords: <b>${STOP_TH.size}</b>`;
}
applyStop.addEventListener("click", () => { rebuildStop(); buildCloud(false); });
rebuildStop();
// -------- Helpers --------
const palette = ["#4E79A7","#F28E2B","#E15759","#76B7B2","#59A14F","#EDC949","#AF7AA1","#FF9DA7","#9C755F","#BAB0AC"];
function extent(arr){
if (!arr.length) return [0,1];
let mn=arr[0], mx=arr[0];
for (const v of arr){ if(v<mn) mn=v; if(v>mx) mx=v; }
return [mn,mx];
}
function makeScale(kind, domain, range){
const [d0,d1] = domain, [r0,r1] = range;
if (d1 === d0) return () => (r0 + r1) / 2;
if (kind === "linear"){
const m = (r1-r0)/(d1-d0); return v => r0 + m*(v - d0);
}
if (kind === "log"){
const a = Math.max(1e-9, d0), b = Math.max(a*1.000001, d1);
const la = Math.log(a), lb = Math.log(b);
const m = (r1-r0)/(lb-la); return v => r0 + m*(Math.log(Math.max(a, v)) - la);
}
// sqrt
const m = 1/Math.sqrt(d1 - d0);
return v => r0 + (r1 - r0) * Math.sqrt(Math.max(0, v - d0)) * m;
}
// -------- Tokenization --------
function tokenize(text, lang){
if (lang === "Thai"){
return text.replace(/[“”"(),.!?:;[\]\-—]/g, " ")
.split(/\s+/).map(t=>t.trim()).filter(Boolean);
}
return text.toLowerCase()
.replace(/[^a-z0-9\s'-]/g, " ")
.split(/\s+/).map(t=>t.trim()).filter(Boolean);
}
function buildNgrams(tokens, n){
if (n===1) return tokens;
const grams = [];
for (let i=0;i<=tokens.length-n;i++){
grams.push(tokens.slice(i,i+n).join(" "));
}
return grams;
}
// -------- Frequency + filtering --------
function freqCount(text, lang, ngram, doLower, removeStop){
let t = text;
if (doLower && lang==="English") t = t.toLowerCase();
let tokens = tokenize(t, lang);
if (stripShort.value && lang==="English"){
tokens = tokens.filter(w => w.length > 2); // ตัดคำสั้นมาก ๆ
}
const n = ngram==="unigram" ? 1 : (ngram==="bigram" ? 2 : 3);
const grams = buildNgrams(tokens, n);
const stop = (lang==="Thai") ? STOP_TH : STOP_EN;
const f = new Map();
for (const g of grams){
if (removeStop && n===1 && stop.has(g)) continue;
if (removeStop && n>1){
const parts = g.split(" ");
if (parts.every(w => stop.has(w))) continue;
}
f.set(g, (f.get(g)||0)+1);
}
return f;
}
// -------- Layout (spiral + collision) --------
function rand(seed=Date.now()){
let s = seed >>> 0;
return function(){
s = Math.imul(1664525, s) + 1013904223 | 0;
return (s>>>0) / 4294967296;
};
}
function placeWords(words, W, H, rotateMode, rng, maxTrials=3500){
const placed = [];
const ctx = document.createElement("canvas").getContext("2d");
function measure(t, fontPx){
ctx.font = `${Math.round(fontPx)}px system-ui, -apple-system, Segoe UI, Roboto`;
const w = ctx.measureText(t).width;
const h = fontPx;
return [w, h];
}
function pickAngle(){
if (rotateMode==="none") return 0;
if (rotateMode==="±30°") return (rng()<0.5 ? -1 : 1) * (Math.PI/6);
if (rotateMode==="±60°") return (rng()<0.5 ? -1 : 1) * (Math.PI/3);
const deg = [-90,-60,-30,0,30,60,90][Math.floor(rng()*7)];
return deg * Math.PI/180;
}
function collide(r, others){
for (const o of others){
if (r.x + r.w < o.x || o.x + o.w < r.x || r.y + r.h < o.y || o.y + o.h < r.y) continue;
return true;
}
return false;
}
const centerX = W/2, centerY = H/2;
for (const w of words){
const angle = pickAngle();
const [w0, h0] = measure(w.text, w.size);
const cos = Math.cos(angle), sin = Math.sin(angle);
const wRot = Math.abs(w0*cos) + Math.abs(h0*sin);
const hRot = Math.abs(w0*sin) + Math.abs(h0*cos);
let success = false;
for (let t=0; t<maxTrials; t++){
const r = 2 + 4 * (t/20);
const th = t * 0.15;
const x = centerX + r * Math.cos(th) - wRot/2;
const y = centerY + r * Math.sin(th) - hRot/2;
const cand = {x, y, w:wRot+2, h:hRot+2};
if (x<0 || y<0 || x+wRot>W || y+hRot>H) continue;
if (!collide(cand, placed.map(p=>p.rect))){
placed.push({rect:cand, text:w.text, size:w.size, angle, color:w.color});
success = true;
break;
}
}
if (!success) {
// ถ้าวางไม่ได้ ให้ข้าม (กันค้างเมื่อคำเยอะมาก)
}
// soft cap เพื่อความเร็ว (รองรับได้หลายร้อยคำ)
if (placed.length > 1200) break;
}
return placed;
}
// -------- KWIC --------
function kwic(text, term, window=30){
const re = new RegExp(term.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"), "gi");
const out = [];
let m;
while ((m = re.exec(text))!==null){
const i = m.index, j = re.lastIndex;
out.push({
pre: text.slice(Math.max(0, i-window), i),
hit: text.slice(i, j),
post: text.slice(j, Math.min(text.length, j+window))
});
if (out.length>=30) break;
}
return out;
}
// -------- Render cloud --------
let seed = Date.now();
function buildCloud(shuffle=false){
if (shuffle) seed = Date.now();
const text = txtArea.value || "";
const lang = langSel.value;
const ngram = ngSel.value;
const f = freqCount(text, lang, ngram, !!caseFold.value, !!rmStop.value);
let rows = Array.from(f.entries()).map(([term, count]) => ({term, count}));
rows = rows.sort((a,b)=> b.count - a.count);
let used = rows.filter(r => r.count >= +minFreq.value).slice(0, +maxWords.value);
let note = "";
if (used.length === 0) {
used = rows.slice(0, Math.min(+maxWords.value, 500));
note = ` (fallback: no tokens ≥ Min frequency; showing top ${used.length})`;
}
const counts = used.map(d=>d.count);
const [c0, c1] = extent(counts.length ? counts : [1,1]);
const scale = makeScale(scaleSel.value, [c0, c1], [+minFont.value, +maxFont.value]);
const words = used.map((r,i)=>({
text: r.term,
size: scale(r.count),
color: palette[i % palette.length]
}));
const W = cloudWrap.clientWidth || 860;
const H = Math.max(560, cloudWrap.clientHeight || 560);
const rng = rand(seed);
const placed = placeWords(words, W, H, rotateOpt.value, rng);
cloudWrap.innerHTML = "";
if (placed.length === 0){
const empty = document.createElement("div");
empty.className = "empty";
empty.innerHTML = `No tokens to display. Try lowering <b>Min frequency</b>, turning off <b>Remove stopwords</b>, or increasing <b>Max words</b>.`;
cloudWrap.append(empty);
} else {
for (const w of placed){
const span = document.createElement("span");
span.className = "token";
span.textContent = w.text;
span.style.left = `${w.rect.x}px`;
span.style.top = `${w.rect.y}px`;
span.style.fontSize = `${Math.round(w.size)}px`;
span.style.color = w.color;
span.style.transform = `rotate(${(w.angle*180/Math.PI).toFixed(1)}deg)`;
span.title = `${w.text}`;
span.addEventListener("click", () => {
for (const el of cloudWrap.querySelectorAll(".token")) el.style.opacity = ".35";
span.style.opacity = "1";
const kw = kwic(text, w.text, 40);
kwicBox.innerHTML = kw.length
? kw.map(k => `${k.pre.replace(/</g,"<")}<b>${k.hit.replace(/</g,"<")}</b>${k.post.replace(/</g,"<")}`).join("<br>")
: `<span class="hint">No occurrences found (tokenizer/stopwords may have filtered it).</span>`;
});
cloudWrap.append(span);
}
}
stats.innerHTML = `Tokens shown: <b>${placed.length}</b>${note} • Vocab: <b>${rows.length}</b> • Min freq ≥ ${+minFreq.value} • N-gram: <b>${ngram}</b>`;
kwicBox.innerHTML = `<span class="hint">Click a word to see KWIC (up to 30 hits).</span>`;
}
// auto rebuild on change (debounced)
let timer=null;
function scheduleBuild(){ clearTimeout(timer); timer=setTimeout(()=>buildCloud(false), 120); }
[langSel, ngSel, txtArea, caseFold, rmStop, stripShort, rotateOpt, scaleSel, maxWords, minFreq, minFont, maxFont]
.forEach(el => el.addEventListener("input", scheduleBuild));
runBtn.addEventListener("click", () => buildCloud(false));
shuffleBtn.addEventListener("click", () => buildCloud(true));
window.addEventListener("resize", scheduleBuild, {passive:true});
// first render
buildCloud(false);
return box;
})()