End-to-End NLP Pipeline: From Data to Deployment

Somsak Chanaim

International College of Digital Innovation, CMU

October 10, 2025

1) Data

2) Tokenization (Demo)

3) Features (demo)

Example: Classification

Example: Clustering

Note: Feature Types

Bag of Words (Count)

What it is: Represents each document by raw token counts.

How to compute (per document \(d\), token \(t\)):

  1. Tokenize text (lowercasing, punctuation removal, stopword filtering, stemming/lemmatization as configured).

  2. Build a vocabulary of unique tokens.

  3. For each token \(t\) in \(d\), set the feature value to the count: \[\text{BoW}(t,d) = \#\text{occurrences of } t \text{ in } d\]

Use when: You want a simple baseline and your documents are short and similarly sized.

Term Frequency (TF)

What it is: Normalizes raw counts by the document’s length, so longer docs don’t dominate just because they have more words.

How to compute: \[\text{TF}(t,d) = \frac{\#(t \text{ in } d)}{\sum_{w}\#(w \text{ in } d)}\]

i.e., count of (t) divided by total tokens in (d).

Use when: You need length-invariance across documents and care about within-document salience.

TF–IDF (Term Frequency–Inverse Document Frequency)

What it is: Downweights tokens that appear in many documents (e.g., “movie”, “phone”, “issue”) and upweights tokens that are distinctive for a document (e.g., “thrilling”, “refund”).

How to compute (common variant):

  1. TF as above.

  2. IDF (document rarity): \[\text{IDF}(t) = \log\frac{N}{\text{df}(t)} \quad\text{or}\quad \log\Big(\frac{N+1}{\text{df}(t)+1}\Big) + 1 ~(\text{smoothing})\]

    where (N) = number of documents, ((t)) = number of documents containing (t).

  3. TF–IDF: \[\text{TF–IDF}(t,d) = \text{TF}(t,d)\times \text{IDF}(t)\]

Use when: You want to emphasize discriminative words and reduce the weight of ubiquitous words.

Optional post-processing you might see in the demo

  • L2 row normalization (recommended for cosine similarity): After computing BoW/TF/TF–IDF, scale each document vector \((x_d)\) so \(||x_d||_2 = 1\).

  • N-grams (1–2): In addition to unigrams (single tokens), include bigrams like very_good. This captures short phrases and simple word order patterns.

  • Stemming / Lemmatization: Collapses inflected forms (running, runsrun; bettergood with custom lemma map). This reduces sparsity and can improve generalization.

  • Stopword removal: Drops extremely common function words (“the”, “and”), which rarely help classification.

Tiny worked example

Corpus (N=2):

  • \(d_1\): “good movie good acting”

  • \(d_2\): “bad movie bad plot”

Vocabulary: {good, movie, acting, bad, plot}

BoW (counts)

  • \(d_1\): {good:2, movie:1, acting:1, bad:0, plot:0}

  • \(d_2\): {good:0, movie:1, acting:0, bad:2, plot:1}

TF (divide by doc length; each doc length = 4)

  • \(d_1\): {good:0.5, movie:0.25, acting:0.25, bad:0, plot:0}
  • \(d_2\): {bad:0.5, movie:0.25, plot:0.25, good:0, acting:0}

IDF (plain; df: movie=2, others=1)

  • idf(good)=log(2/1)=0.693, idf(movie)=log(2/2)=0, idf(acting)=0.693, idf(bad)=0.693, idf(plot)=0.693

TF–IDF

  • \(d_1\): {good:0.5×0.693=0.347, movie:0, acting:0.25×0.693=0.173, bad:0, plot:0}

  • \(d_2\): {bad:0.5×0.693=0.347, movie:0, plot:0.25×0.693=0.173, good:0, acting:0}

Notice how “movie” (present in both docs) gets 0 weight with this IDF, while distinctive words keep weight.

Quick guidance

  • BoW: simplest; can work okay with strong models and lots of data, but is length-biased.

  • TF: normalizes for length; good general baseline.

  • TF–IDF: usually best for traditional text classification and search ranking when you don’t use pretrained embeddings.

Summarize

  • \(t\) = a term (aka token or feature). Examples: good, movie, running, or a bigram like very_good.

  • \(d\) = a document (one text item in your corpus). Examples: one review, one email, one tweet, one sentence—whatever unit you choose.

Quick glossary (ties it together):

  • Corpus = your whole dataset (a set of documents).
  • Vocabulary = the set of all unique terms \(t\) found in the corpus.
  • \(N\) = number of documents in the corpus.
  • \(\#(t \text{ in } d)\) = how many times term \(t\) appears in document \(d\).
  • \(\text{df}(t)\) = document frequency = number of documents that contain \(t\).
  • \(\text{TF}(t,d)\) = term frequency of \(t\) in \(d\), usually \(\#(t \text{ in } d) / \text{(total tokens in } d)\).
  • \(\text{IDF}(t)\) = inverse document frequency, e.g. \(\log\big(\frac{N+1}{\text{df}(t)+1}\big)+1\).
  • \(\text{TF–IDF}(t,d)\) = \(\text{TF}(t,d)\times \text{IDF}(t)\).

Tiny example:

  • Documents:

    • \(d_1\): “good movie good acting”

    • \(d_2\): “bad movie bad plot”

  • Terms \(t\) (vocabulary): {good, movie, acting, bad, plot}

So when you see \(\text{TF}(t,d)\), just read it as: “the TF value for term \(t\) in document \(d\).”