\(~~~~~~~~~~\)Data Wrangling\(~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

September 7, 2025

Data Wrangling

Data wrangling, also known as data cleaning or data remediation, is the process of transforming and refining data in various ways to make it suitable and ready for further use.

Common Data Wrangling Activities

  • Merging data from multiple sources into a single dataset for analysis

  • Identifying missing values (e.g., blank cells in a spreadsheet) to either fill in or remove them

  • Removing unnecessary or irrelevant data that is not related to the project at hand

  • Detecting extreme outliers in the data and explaining or removing them to allow for proper analysis

Data Frames

A data frame is a dataset displayed in tabular format.
Each column can contain:

  • Character values

  • Numeric values

  • Logical values

  • Date values

  • etc.


Data on Thailand and China from Gapminder

Reshaping Data:

Reshaping Data refers to modifying or transforming the structure of a dataset to make it more suitable for analysis or presentation.

Key Concept of Reshaping Data

Reshaping often involves converting data from one format to another, such as:

  • Wide formatLong format (transforming data from wide to long structure)

  • Long formatWide format (transforming data from long to wide structure)

Wide format → Long format

  • The columns lifeExp, pop, and gdpPercap, which were previously separate, are gathered into a single column called metric.

  • The values of these variables are moved into a new column called value.

Long format → Wide format

  • The values that were previously stored in the metric column
    are now spread back into separate columns: lifeExp, pop, and gdpPercap.

  • The values from the value column are distributed back to their corresponding variables.

Combine Multiple Columns into a Single Column

Combining the country and year variables into a single column.

Split a Column into Multiple Columns

Sort Data in Ascending or Descending Order Using Specific Columns

You can sort the data by columns such as year, lifeExp, or pop.

(Click on the column header to sort)

Rename Variables (Do not use duplicate names)

Rename the variable country

Rename the variable year

Rename the variable lifeExp

Rename the variable pop

Rename the variable gdpPercap

Subset Observations(Rows)

  • Select rows that match a logical condition

  • Randomly sample a fraction of the entire dataset

  • Randomly sample a specific number of rows (without exceeding the total)

  • Select rows by specified positions

Filter Rows by Logical Condition

Select rows that match a logical condition, such as: From the year onward (out of total rows)

Randomly Sample a Fraction of the Dataset

Randomly select a sample as a fraction of the entire dataset, such as of all data (from total rows)

Randomly Sample a Specific Number of Rows (No More Than Available)

Randomly sample rows

Select Rows by Specified Positions

Select rows from to

Subset Variables (Using Variable Conditions)

  • Select columns by specific variable names

  • Select columns that contain a specified string in their names

  • Select columns that end with a specific character or string

  • Select all columns

  • Select columns that match a specified pattern (Regular Expression)

  • Select columns from a defined list of names

  • Select all columns except certain ones

Select Columns by Variable Name

Select Variables

Select Columns that Contain a Specified String

Search Column Names

If you want to search for multiple keywords, separate them with commas without spaces. A maximum of 5 keywords is allowed. For example: pop, life, count

Summarise Data

1 Select the variable you want to calculate

2 Select the statistic you want to compute

3 Group the calculation by the variable country or not?

Create New Variables

Examples of New Variables You Can Create

1. Total GDP of a Country (gdpTotal)

\[\tiny \text{gdpTotal} = \text{gdpPercap} \times \text{pop}\]

Meaning: Calculate the total GDP of a country using gdpPercap × pop

2. Convert Population to Millions (pop_million)

\[\tiny \text{pop_million} = \dfrac{\text{pop}}{1,000,000}\]

Meaning: Convert the population unit from individuals to millions by dividing pop by 1,000,000

New Data Set

Data Set A

Data Set B

Given that:

  • Data set A contains variables x1 and x2
  • Data set B contains variables x1 and x3

Combining Data Sets

Left join: Used to combine data from two tables (data frames) by keeping all rows from the main (left) table and matching rows from the secondary (right) table based on a shared key.


Right join: Used to combine data from two tables (data frames) by keeping all rows from the secondary (right) table and matching rows from the main (left) table based on a shared key.


Inner join: Used to combine data from two tables (data frames) by keeping only the rows that have matching values in the key column used for joining.


Full join: Used to combine data from two tables (data frames) by keeping all rows from both tables, regardless of whether the key values match. If a value in the key column does not exist in the other table, NA will be filled in for missing values.


Why Do We Need One-Hot Encoding?

Problem with Categorical Data

  • Machine learning models can’t interpret text labels like "Regular", "VIP", "New" as-is.

  • Using numeric labels like 1, 2, 3 creates false ordinal relationships: It implies "VIP" > "Regular", which may not be true.

One-Hot Encoding Solves This

  • Transforms categories into independent binary variables

  • Avoids implying ranking or distance between categories

  • Makes data compatible with algorithms like logistic regression, decision trees, neural networks

Example: Customer Type

Data

\(\rightarrow\)

One hot encode

Business Use Cases

  • Customer segmentation

  • Predicting churn or purchase behavior

  • Classifying transactions or complaints by category

Why Do We Need to Normalize Data?

1. Ensures Fair Comparison

  • Variables like Sales (in thousands) and Customer Age (in years) have different scales.

  • Without normalization, larger values dominate model training.

2. Improves Model Performance

  • Algorithms like: K-Means, K-NN, SVM, Neural Networks or rely on distance or gradients

  • Normalized data ensures faster convergence and more accurate results.

3. Prevents Bias from Feature Magnitude

  • A feature with large numbers (e.g. income) may overpower smaller-scaled features (e.g. purchase frequency).
  • Normalization gives each feature equal weight.

Normalize vs Standardize Fomula

1. Min-Max Normalization

Scales values to a fixed range, usually \([0, 1]\):

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

Use when range matters (e.g., neural networks)

Data

\(\rightarrow\)

Min-Max Normalization:

2. Z-Score Standardization

Also known as standard scaling:

\[ x' = \frac{x - \mu}{\sigma} \]

Where:

  • \(\mu\) is the mean

  • \(\sigma\) is the standard deviation (sd)

Use when data has different units or outliers

Data

\(\rightarrow\)

Standardization:

References

  • Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation (เวอร์ชัน 1.1.4) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://dplyr.tidyverse.org

  • Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy Messy Data (เวอร์ชัน 1.3.1) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://tidyr.tidyverse.org