International College of Digital Innovation, CMU
September 7, 2025
Data wrangling, also known as data cleaning or data remediation, is the process of transforming and refining data in various ways to make it suitable and ready for further use.
Merging data from multiple sources into a single dataset for analysis
Identifying missing values (e.g., blank cells in a spreadsheet) to either fill in or remove them
Removing unnecessary or irrelevant data that is not related to the project at hand
Detecting extreme outliers in the data and explaining or removing them to allow for proper analysis
Reshaping Data refers to modifying or transforming the structure of a dataset to make it more suitable for analysis or presentation.
Key Concept of Reshaping Data
Reshaping often involves converting data from one format to another, such as:
Wide format → Long format (transforming data from wide to long structure)
Long format → Wide format (transforming data from long to wide structure)
The columns lifeExp, pop, and gdpPercap, which were previously separate, are gathered into a single column called metric.
The values of these variables are moved into a new column called value.
The values that were previously stored in the metric column
are now spread back into separate columns: lifeExp, pop, and gdpPercap.
The values from the value column are distributed back to their corresponding variables.
You can sort the data by columns such as year, lifeExp, or pop.
(Click on the column header to sort)
Select rows that match a logical condition, such as: From the year onward (out of total rows)
Randomly select a sample as a fraction of the entire dataset, such as of all data (from total rows)
Select columns by specific variable names
Select columns that contain a specified string in their names
Select columns that end with a specific character or string
Select all columns
Select columns that match a specified pattern (Regular Expression)
Select columns from a defined list of names
Select all columns except certain ones
1. Total GDP of a Country (gdpTotal)
\[\tiny \text{gdpTotal} = \text{gdpPercap} \times \text{pop}\]
Meaning: Calculate the total GDP of a country using gdpPercap × pop
2. Convert Population to Millions (pop_million)
\[\tiny \text{pop_million} = \dfrac{\text{pop}}{1,000,000}\]
Meaning: Convert the population unit from individuals to millions by dividing pop by 1,000,000
Data Set A
Data Set B
Given that:
x1 and x2x1 and x3Left join: Used to combine data from two tables (data frames) by keeping all rows from the main (left) table and matching rows from the secondary (right) table based on a shared key.
Right join: Used to combine data from two tables (data frames) by keeping all rows from the secondary (right) table and matching rows from the main (left) table based on a shared key.
Inner join: Used to combine data from two tables (data frames) by keeping only the rows that have matching values in the key column used for joining.
Full join: Used to combine data from two tables (data frames) by keeping all rows from both tables, regardless of whether the key values match. If a value in the key column does not exist in the other table, NA will be filled in for missing values.
Problem with Categorical Data
Machine learning models can’t interpret text labels like "Regular", "VIP", "New" as-is.
Using numeric labels like 1, 2, 3 creates false ordinal relationships: It implies "VIP" > "Regular", which may not be true.
Transforms categories into independent binary variables
Avoids implying ranking or distance between categories
Makes data compatible with algorithms like logistic regression, decision trees, neural networks
Example: Customer Type
Data
\(\rightarrow\)
One hot encode
Customer segmentation
Predicting churn or purchase behavior
Classifying transactions or complaints by category
1. Ensures Fair Comparison
Variables like Sales (in thousands) and Customer Age (in years) have different scales.
Without normalization, larger values dominate model training.
2. Improves Model Performance
Algorithms like: K-Means, K-NN, SVM, Neural Networks or rely on distance or gradients
Normalized data ensures faster convergence and more accurate results.
3. Prevents Bias from Feature Magnitude
1. Min-Max Normalization
Scales values to a fixed range, usually \([0, 1]\):
\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]
Use when range matters (e.g., neural networks)
Data
\(\rightarrow\)
Min-Max Normalization:
2. Z-Score Standardization
Also known as standard scaling:
\[ x' = \frac{x - \mu}{\sigma} \]
Where:
\(\mu\) is the mean
\(\sigma\) is the standard deviation (sd)
Use when data has different units or outliers
Data
\(\rightarrow\)
Standardization:
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation (เวอร์ชัน 1.1.4) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://dplyr.tidyverse.org
Wickham, H., Vaughan, D., & Girlich, M. (2024). tidyr: Tidy Messy Data (เวอร์ชัน 1.3.1) [ซอฟต์แวร์คอมพิวเตอร์]. สืบค้นจาก https://tidyr.tidyverse.org