\(\)Data Wrangling\(\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 7, 2025

Data Wrangling

Data wrangling, also known as data cleaning or data remediation, is the process of transforming and refining data in various ways to make it suitable and ready for further use.

Common Data Wrangling Activities

Merging data from multiple sources into a single dataset for analysis
Identifying missing values (e.g., blank cells in a spreadsheet) to either fill in or remove them
Removing unnecessary or irrelevant data that is not related to the project at hand
Detecting extreme outliers in the data and explaining or removing them to allow for proper analysis

Data Frames

A data frame is a dataset displayed in tabular format.
Each column can contain:

Character values
Numeric values
Logical values
Date values
etc.

Data on Thailand and China from Gapminder

Reshaping Data:

Reshaping Data refers to modifying or transforming the structure of a dataset to make it more suitable for analysis or presentation.

Key Concept of Reshaping Data

Reshaping often involves converting data from one format to another, such as:

Wide format → Long format (transforming data from wide to long structure)
Long format → Wide format (transforming data from long to wide structure)

Wide format → Long format

viewof clicks = Inputs.button("Click to show result")

The columns lifeExp, pop, and gdpPercap, which were previously separate, are gathered into a single column called metric.
The values of these variables are moved into a new column called value.

Long format → Wide format

The values that were previously stored in the metric column
are now spread back into separate columns: lifeExp, pop, and gdpPercap.
The values from the value column are distributed back to their corresponding variables.

viewof clicks2 = Inputs.button("Click to show result")

Combine Multiple Columns into a Single Column

Combining the country and year variables into a single column.

viewof clicks3 = Inputs.button("Click to show result")

Split a Column into Multiple Columns

viewof clicks4 = Inputs.button("Click to show result")

Sort Data in Ascending or Descending Order Using Specific Columns

You can sort the data by columns such as year, lifeExp, or pop.

(Click on the column header to sort)

Rename Variables (Do not use duplicate names)

Rename the variable country

viewof name1 = Inputs.text({
label: "To"
, placeholder: "country", value : "country"})

Rename the variable year

viewof name2 = Inputs.text({
  label: "To"
, placeholder: "year", value: "year" })

Rename the variable lifeExp

viewof name3 = Inputs.text({
label: "To"
, placeholder: "lifeExp", value: "lifeExp"})

Rename the variable pop

viewof name4 = Inputs.text({
label: "To"
, placeholder:"pop", value:"pop"})

Rename the variable gdpPercap

viewof name5 = Inputs.text({
label: "To"
, placeholder: "gdpPercap", value: "gdpPercap"})

Subset Observations(Rows)

Select rows that match a logical condition
Randomly sample a fraction of the entire dataset
Randomly sample a specific number of rows (without exceeding the total)
Select rows by specified positions

Filter Rows by Logical Condition

Select rows that match a logical condition, such as: From the year onward (out of total rows)

viewof years = Inputs.radio(
  ["1952", "1957", "1962", "1967", "1972", "1977", "1982", "1987", "1992", "1997", "2002"], 
  {
    label: "Select data from year", 
    value: "1962"
  }
)

viewof count = Inputs.radio(
  ["Thailand", "China", "Both countries"], 
  {
    label: "Select country", 
    value: "Both countries"
  }
)

Randomly Sample a Fraction of the Dataset

Randomly select a sample as a fraction of the entire dataset, such as of all data (from total rows)

viewof prop = Inputs.radio(
  ["10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%"], 
  {
    label: "Select sampling proportion", 
    value: "50%"
  }
)

viewof count2 = Inputs.radio(
  ["Thailand", "China", "Both countries"], 
  {
    label: "Select country", 
    value: "Both countries"
  }
)

Randomly Sample a Specific Number of Rows (No More Than Available)

Randomly sample rows

viewof value = Inputs.range([1, 24], {label: "Select sample size", step: 1, value: 10})

Select Rows by Specified Positions

Select rows from to

viewof value2 = Inputs.range([1, 24], {
label: "From row", 
step: 1, value: 10})

viewof value3 = Inputs.range([1, 24], {label: 
"To row",
step: 1, value: 15})

Subset Variables (Using Variable Conditions)

Select columns by specific variable names
Select columns that contain a specified string in their names
Select columns that end with a specific character or string

Select all columns
Select columns that match a specified pattern (Regular Expression)
Select columns from a defined list of names
Select all columns except certain ones

Select Columns by Variable Name

Select Variables

viewof vars = Inputs.checkbox(
  ["country", "year", "lifeExp", "pop", "gdpPercap"],
  {
    value: ["country","year"],
    label: ""
  }
)

// ถ้า vars ว่าง ให้กำหนดค่าเป็น ["k"] ถ้ามีค่าแล้ว ให้ใช้ vars
selected_vars = (vars?.length ?? 0) === 0 ? ["k"] : vars

Select Columns that Contain a Specified String

Search Column Names

viewof name6 = Inputs.text({
label: "Search keywords"
, placeholder:"pop", value:"pop"})

If you want to search for multiple keywords, separate them with commas without spaces. A maximum of 5 keywords is allowed. For example: pop, life, count

Summarise Data

1 Select the variable you want to calculate

viewof vari = Inputs.checkbox(
  ["lifeExp", "pop", "gdpPercap"],
  { value: ["lifeExp"], label: ""}
)
selected_vari = (vari ?? []).length === 0 ? ["null"] : vari

2 Select the statistic you want to compute

viewof stats = Inputs.checkbox(
  ["mean", "median", "variance", "min", "max"],
  { value: ["mean","variance"], label: ""}
)
selected_stats = (stats ?? []).length === 0 ? ["null"] : stats

3 Group the calculation by the variable country or not?

viewof country3 = Inputs.radio(["Yes", "No"], 
{label: ""
,value: "Yes"}
)

Create New Variables

Examples of New Variables You Can Create

1. Total GDP of a Country (gdpTotal)

\[\tiny \text{gdpTotal} = \text{gdpPercap} \times \text{pop}\]

Meaning: Calculate the total GDP of a country using gdpPercap × pop

2. Convert Population to Millions (pop_million)

\[\tiny \text{pop_million} = \dfrac{\text{pop}}{1,000,000}\]

Meaning: Convert the population unit from individuals to millions by dividing pop by 1,000,000

New Data Set

Data Set A

Data Set B

Given that:

Data set A contains variables x1 and x2
Data set B contains variables x1 and x3

Combining Data Sets

Left join: Used to combine data from two tables (data frames) by keeping all rows from the main (left) table and matching rows from the secondary (right) table based on a shared key.

Right join: Used to combine data from two tables (data frames) by keeping all rows from the secondary (right) table and matching rows from the main (left) table based on a shared key.

Inner join: Used to combine data from two tables (data frames) by keeping only the rows that have matching values in the key column used for joining.

Full join: Used to combine data from two tables (data frames) by keeping all rows from both tables, regardless of whether the key values match. If a value in the key column does not exist in the other table, NA will be filled in for missing values.

Why Do We Need One-Hot Encoding?

Problem with Categorical Data

Machine learning models can’t interpret text labels like "Regular", "VIP", "New" as-is.
Using numeric labels like 1, 2, 3 creates false ordinal relationships: It implies "VIP" > "Regular", which may not be true.

One-Hot Encoding Solves This

Transforms categories into independent binary variables
Avoids implying ranking or distance between categories
Makes data compatible with algorithms like logistic regression, decision trees, neural networks

Example: Customer Type

viewof clicks20 = Inputs.button("Click to show result")

Data

\(\rightarrow\)

One hot encode

Business Use Cases

Customer segmentation
Predicting churn or purchase behavior
Classifying transactions or complaints by category

Why Do We Need to Normalize Data?

1. Ensures Fair Comparison

Variables like Sales (in thousands) and Customer Age (in years) have different scales.
Without normalization, larger values dominate model training.

2. Improves Model Performance

Algorithms like: K-Means, K-NN, SVM, Neural Networks or rely on distance or gradients
Normalized data ensures faster convergence and more accurate results.

3. Prevents Bias from Feature Magnitude

A feature with large numbers (e.g. income) may overpower smaller-scaled features (e.g. purchase frequency).
Normalization gives each feature equal weight.

Normalize vs Standardize Fomula

1. Min-Max Normalization

Scales values to a fixed range, usually \([0, 1]\):

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

Use when range matters (e.g., neural networks)

viewof clicks21 = Inputs.button("Click to show result")

Data

\(\rightarrow\)

Min-Max Normalization:

2. Z-Score Standardization

Also known as standard scaling:

\[ x' = \frac{x - \mu}{\sigma} \]

Where:

\(\mu\) is the mean
\(\sigma\) is the standard deviation (sd)

Use when data has different units or outliers

viewof clicks22 = Inputs.button("Click to show result")

Data

\(\rightarrow\)

Standardization:

References

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation (เวอร์ชัน 1.1.4) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://dplyr.tidyverse.org
Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy Messy Data (เวอร์ชัน 1.3.1) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://tidyr.tidyverse.org

\(~~~~~~~~~~\)Data Wrangling\(~~~~~~~~~~\)

Data Wrangling

Common Data Wrangling Activities

Data Frames

Data on Thailand and China from Gapminder

Reshaping Data:

Wide format → Long format

Long format → Wide format

Combine Multiple Columns into a Single Column

Split a Column into Multiple Columns

Sort Data in Ascending or Descending Order Using Specific Columns

Rename Variables (Do not use duplicate names)

Subset Observations(Rows)

Filter Rows by Logical Condition

Randomly Sample a Fraction of the Dataset

Randomly Sample a Specific Number of Rows (No More Than Available)

Select Rows by Specified Positions

Subset Variables (Using Variable Conditions)

Select Columns by Variable Name

Select Columns that Contain a Specified String

Summarise Data

Create New Variables

Examples of New Variables You Can Create

New Data Set

Combining Data Sets

Why Do We Need One-Hot Encoding?

One-Hot Encoding Solves This

Business Use Cases

Why Do We Need to Normalize Data?

Normalize vs Standardize Fomula

References

\(\)Data Wrangling\(\)