\(~~~~~~~~~~\)Text Mining\(~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

October 7, 2025

Learning objectives

Students are able to…

  1. Describe basic concepts and works of Natural Language Processing (NLP).

  2. Explain the basics of sentiment analysis.

  3. Recognise everyday applications of sentiment analysis.

  4. Use simple tools for sentiment analysis.

  5. Interpret the results of sentiment analysis in a clear and simple way, making it easy to understand.

What is Text Mining

Text Mining (also called Text Data Mining or Text Analytics) is the process of extracting useful information, patterns, and knowledge from unstructured text data.

It combines techniques from natural language processing (NLP), machine learning, and statistics to transform text into structured data for analysis.

Application of Text Mining

Business

✅ Scenario

An e-commerce company (e.g., Amazon) wants to improve product quality and customer satisfaction. They receive thousands of product reviews daily, which are unstructured text.

🔑 Process

  1. Data Collection:

    • Gather customer reviews from the website, app stores, or third-party platforms.
  2. Preprocessing:

    • Remove stop words (“the”, “is”), punctuation, and irrelevant data.
    • Apply stemming/lemmatization (e.g., “running” → “run”).
  3. Text Mining Techniques:

    • Sentiment Analysis → Identify if reviews are positive 😊, negative 😡, or neutral 😐.
    • Topic Modeling (LDA) → Detect common themes (e.g., “delivery”, “price”, “quality”).
    • Keyword Extraction (TF–IDF) → Highlight frequent complaints or praises.
  4. Business Action:

    • If many reviews mention “late delivery”, logistics teams can investigate shipping.
    • If positive reviews mention “great packaging”, marketing can highlight this in ads.
  • Amazon uses text mining for product review analysis.

    • Sentiment analysis helps rank products and recommend items.
    • Negative reviews trigger alerts for quality control.
  • Starbucks applies text mining on Twitter and Instagram posts.

    • Detects trending flavors or complaints.
    • Adjusts marketing campaigns (e.g., launching seasonal drinks).

Healthcare

Healthcare: Mining Medical Records and Clinical Notes for Diagnosis Support

✅ Scenario

Hospitals and clinics generate massive amounts of unstructured text data such as:

  • Electronic Health Records (EHRs)
  • Doctor’s clinical notes
  • Lab reports
  • Radiology findings

These contain valuable information but are difficult to analyze manually.

🔑 Process

  1. Data Collection

    • Extract EHRs, discharge summaries, and physician notes.
  2. Preprocessing

    • Remove stopwords and normalize medical terminology.
    • Handle abbreviations (e.g., HTN → Hypertension).
  3. Text Mining Techniques

    • Named Entity Recognition (NER): Identify diseases, symptoms, and treatments in text.
    • Text Classification: Categorize notes by diagnosis type.
    • Clustering & Pattern Mining: Find common co-occurrences (e.g., diabetes + hypertension).
    • Predictive Modeling: Predict risks based on past notes (e.g., readmission risk).
  4. Business/Healthcare Impact

    • Helps doctors detect patterns and support faster diagnosis.
    • Enables personalized treatment plans.
    • Improves patient safety by detecting adverse drug interactions.
  • IBM Watson Health: Use text mining to extract meaningful insights from clinical notes for diagnosis support.

  • Mount Sinai Hospital (New York): Applied NLP to EHRs to predict heart failure risk earlier than traditional methods.

Finance

Finance: Detecting Fraud or Analyzing News Sentiment for Stock Prediction

✅ Scenario

Financial institutions handle enormous volumes of unstructured data:

  • Customer transaction logs
  • Credit card records
  • Financial news & analyst reports
  • Social media posts about stocks

This data contains hidden signals for fraud detection and investment prediction.

🔑 How Text Mining Works in Finance

  1. Fraud Detection (Credit Cards & Transactions)

    • Data Sources: transaction descriptions, merchant names, customer complaint notes

    • Techniques:

      • Natural Language Processing (NLP) to parse transaction text
      • Anomaly detection to flag unusual behavior
      • Classification models (legitimate ✅ vs. suspicious ❌)
    • Impact: Real-time fraud alerts, reduced financial losses

  2. News Sentiment for Stock Prediction

    • Data Sources: news headlines, financial articles, Twitter posts

    • Techniques:

      • Sentiment analysis (positive/negative/neutral)
      • Named Entity Recognition (NER) to identify companies & tickers
      • Correlation with market movements
    • Impact: Helps traders forecast price direction, build sentiment-driven trading strategies

  • JPMorgan Chase 🏦

    • Uses text mining + machine learning to scan millions of customer emails, chats, and documents for signs of fraud or insider trading.
  • Bloomberg Terminal & Reuters 📰

    • Apply real-time sentiment analysis on global financial news.
    • Traders see alerts when sentiment about a stock/commodity changes sharply.
  • S&P Global Market Intelligence 📈

    • Uses NLP to mine earnings call transcripts.
    • Analysts detect tone and sentiment shifts in CEO/CFO statements → early signal of company performance.

Education & Research

Education & Research: Summarizing Articles, Plagiarism Detection, or Learning Analytics 🎓📚

✅ Scenario

Universities and researchers deal with massive amounts of unstructured text:

  • Research papers
  • Student essays and assignments
  • Online learning logs and forum posts

Text mining makes it possible to process and analyze this information efficiently.

🔑 How Text Mining Works in Education & Research

  1. Summarizing Articles

    • NLP algorithms create concise summaries of long research papers.
    • Saves time for students and researchers scanning large literature databases.
    • Example: Elsevier uses AI summarization in its academic platforms.
  2. Plagiarism Detection

    • Systems compare a student’s assignment against millions of documents.
    • Detects copied or paraphrased text.
    • Example: Turnitin (widely used globally) applies text mining + similarity analysis.
  3. Learning Analytics

    • Mining discussion forums, assignments, or quiz responses.
    • Identifies at-risk students based on writing style or engagement.
    • Example: Moodle Analytics and Coursera apply NLP to track learner progress.
  • Turnitin → Plagiarism detection across millions of student papers.

  • Coursera & edX → Analyze forum discussions to improve course design.

  • Semantic Scholar (Allen Institute for AI) → Uses NLP to summarize and recommend research papers.

Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language (spoken or written).

It combines techniques from linguistics, computer science, and machine learning to bridge the gap between human communication and computer understanding.

Key Capabilities of NLP

  1. Text Preprocessing → Tokenization, stemming, lemmatization, stop-word removal.
  2. Text Classification → Spam detection, sentiment analysis, topic labeling.
  3. Named Entity Recognition (NER) → Identifying people, places, dates, organizations in text.
  4. Machine Translation → Google Translate, DeepL.
  5. Sentiment Analysis → Detecting emotions (positive, negative, neutral) in text.
  6. Speech Recognition → Turning speech into text (e.g., Siri, Alexa).
  7. Text Generation → Chatbots, large language models (chatGPT, Gemini ✨).

Sentiment analysis

Sentiment Analysis is a technique used to determine the emotional tone in text. It helps computers identify whether a piece of text expresses positive, negative, or neutral sentiment.

Example Sentence and Sentiment Value

We often assign a sentiment score to text.

Sentiment values usually range between –1 (very negative) and +1 (very positive).

Sentence:

“The movie was fantastic and inspiring.”

  • Sentiment Value: +0.85 (strongly positive)

Sentence:

“The service was terrible and disappointing.”

  • Sentiment Value: –0.80 (strongly negative)

Sentence:

“The food was okay, nothing special.”

  • Sentiment Value: 0.05 (neutral / slightly positive)

1. Standard Sentiment Analysis (SSA) (Standard)

Task: Classify text into broad categories → Positive, Negative, or Neutral.

Example

  • Text: “The food was delicious.”Positive
  • Text: “The service was slow.”Negative

2. Fine-grained Sentiment Analysis (SSA Upgrade)

Task: Break down sentiment into levels of polarity

  • Very Positive: 😍 / 🤩 / 🥳 / ⭐⭐⭐⭐⭐

  • Positive: 🙂 / 😊 / ⭐⭐⭐⭐⭐

  • Neutral: 😐 / 😶 / ⭐⭐⭐

  • Negative: 🙁 / 😟 / ⭐⭐

  • Very Negative: 😡 / 😠 / 😭 / ⭐

Sentence”

  • Text: “The movie was absolutely amazing!”Very Positive
  • Text: “The product is okay.”Neutral
  • Text: “This was the worst experience ever!”Very Negative

3. Emotion Detection (identifies specific emotions)

Task: Use NLP and psychology-based models to classify emotions.

  • Happiness
  • Anger
  • Sadness
  • Fear
  • Surprise
  • Disgust
  • etc.
  1. I’m so excited for my new job!Joy/Excitement 😀🤩

  2. I’m scared about the results.Fear 😨

  3. This food tastes terrible.Disgust 🤢

  4. Wow, I didn’t expect that surprise party!Surprise 😲

4. Aspect-Based Sentiment Analysis (ABSA)

Looks at specific aspects/features of a product or service.

Task: Identify what part of the product/service the sentiment refers to.

Aspect Sentiments:

  • CameraPositive

  • BatteryNegative

  • PriceNeutral

Summary

  • SSA → Positive / Negative / Neutral
  • Fine-grained → Adds intensity (Very Positive → Very Negative)
  • Emotion Detection → Identifies specific feelings (joy, anger, fear, etc.)
  • ABSA → Links sentiment to specific product features or aspects

Interactive Sentiment Analysis (Demo)

Example

  • I absolutely love this product—super easy to use! 🙂

  • The app is good, but the battery life is not great.

  • This update is incredibly fast and really impressive.

  • It’s not bad, just a bit slow sometimes.

  • The UX is terrible… I’m so disappointed. 👎

  • ใช้งานง่ายมาก ชอบฟีเจอร์ใหม่ที่สุด!

  • ไม่ดีเท่าไหร่ แถมค้างบ่อยๆ จนหงุดหงิด 😡

  • บริการโอเคนะ แต่ไม่ได้เร็วมาก

  • ราคาแพงไปนิด แต่คุณภาพก็ดีมากจริงๆ

  • Nothing special—works as expected.

Workflow of Sentiment Analysis

Preprocessing Steps: Cleaning, Normalizing, and Structuring

  1. Tokenization

    • Breaking sentences into smaller units (tokens) such as words or phrases.
    • Example: “The movie was great” → [“The”, “movie”, “was”, “great”]
  2. Lowercasing / Normalization

    • Converting all text into lowercase to avoid duplication.
    • Example: Great and great are treated the same.
  1. Stop-word Removal

    • Removing common words that don’t add much meaning.
    • Example: “the”, “is”, “and”, “of
  2. Stemming

    • Reducing words to their root form by chopping off suffixes.
    • Example: running”, “runsrun
  1. Lemmatization

    • Converting words to their base form using grammar and vocabulary.
    • Example: bettergood, am/are/isbe
  2. Punctuation & Special Character Removal

    • Cleaning out unnecessary symbols, numbers, or punctuation.
    • Example: “!!!” → “”
  3. Handling Negations

    • Keeping track of words like not good so the meaning is preserved.

🔎 Feature Extraction

Feature Extraction is the process of transforming preprocessed text into numerical vectors that machine learning or deep learning models can understand.

Main Techniques

1. Bag of Words (BoW)

  • Concept: Represents text by counting how many times each word appears, ignoring grammar and word order.

  • Pros: Simple, easy to implement.

  • Cons: Loses word context, results in sparse data.

Example:

  • Text: The movie was great, great acting
  • Features: {the:1, movie:1, was:1, great:2, acting:1}

2. TF–IDF

Term Frequency – Inverse Document Frequency

  • Concept: Assigns weight to words based on how frequently they appear in a document compared to across all documents.
  • Pros: Reduces the importance of common words like the, is.
  • Cons: Still does not capture semantic context.

Example:

  • The word quality in product reviews gets higher weight than the word the.

3. Word Embeddings

  • Concept: Converts words into dense vectors where words with similar meanings are close in vector space.
  • Models: Word2Vec, GloVe, fastText.
  • Pros: Captures semantic similarity between words.
  • Cons: Pre-trained embeddings may not cover domain-specific vocabulary.

Example:

  • king – man + woman ≈ queen

4. Contextual Embeddings

  • Concept: Uses advanced language models (BERT, RoBERTa, GPT embeddings) to capture word meaning based on sentence context.
  • Pros]{.B5}: Context-aware, achieves state-of-the-art performance.
  • Cons]{.B1}: Computationally expensive.

Example:

  • bank in river bankbank in financial bank

Model

Classification

  • Input: Raw text (reviews, tweets, news)

  • Process: Classification model (Naive Bayes, Logistic Regression, SVM, Neural Net)

  • Output: Discrete labels (e.g., Positive / Negative / Neutral, Spam / Not Spam)

Regression

  • Input: Raw text (reviews, financial news, social media posts)

  • Process: Regression model (Linear Regression, Ridge/Lasso, SVR, Neural Networks)

  • Output: Continuous values (e.g., Predicted Rating = 4.2, Stock Change = –1.5%, Engagement Score = 2000 likes)

Clustering

  • Input: Raw text (customer reviews, research papers, survey responses)

  • Process: Clustering model (k-Means, Hierarchical Clustering, DBSCAN, Topic Modeling such as LDA)

  • Output: Groups of similar texts (e.g., Delivery Issues, Price Concerns, Product Quality)

Output & Visualization

After preprocessing, feature extraction, and classification, the system produces results that can be interpreted and visualized.

Key Outputs

  1. Sentiment Label

    • The main classification result.
    • Categories: Positive, Negative, Neutral (or Very Positive → Very Negative in fine-grained analysis).
    • Example: “The product is excellent”Positive
  1. Sentiment Score / Probability

    • A numeric value representing sentiment intensity.

    • Range: –1.0 (very negative) to +1.0 (very positive).

    • Example:

      • “I love this phone” → +0.85
      • “The service is awful” → –0.90
  1. Aspect-Based Sentiment

    • Sentiment toward specific product features.

    • Example:

      • “The phone’s camera is great but the battery is bad”

        • Camera → Positive (+0.8)
        • Battery → Negative (–0.7)

Visualization Techniques

  1. Pie Charts

    • Show proportion of Positive / Neutral / Negative reviews.
  2. Bar Charts

    • Compare sentiment across different products, brands, or time periods.
  3. Time-Series Plots

    • Track sentiment trend over time (e.g., Twitter posts during an event).
  4. Word Clouds

    • Highlight frequent positive/negative keywords.
  5. Dashboards

    • Combine charts & KPIs for decision-makers.

Interactive Bag of Words

Interactive Word Cloud (demo)