\(~~~~~~~~~~~\)Data Wrangling\(~~~~~~~~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

April 8, 2025

Warning

“The results from a data frame or tibble are shown as a table using the kable() function from the knitr package.”

The dplyr Package

What is the dplyr package

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

Five verbs:

  • filter() picks cases based on their values.

  • mutate() adds new variables that are functions of existing variables

  • arrange() changes the ordering of the rows.

  • select() picks variables based on their names.

  • summarise() reduces multiple values down to a single summary.

How to install and use this package

How to use this package

Data frame for learning

The filter() function

The filter() function is used to subset rows in a data frame (or tibble) based on conditions.

It allows you to keep only those rows that meet the specified conditions.

  • filter()returns a data frame or tibble with only the rows that satisfy the condition(s).

  • You can chain filter() with other dplyr functions using the pipe operator |> for more complex data manipulation tasks.

Basic Syntax:

From the data frame Data, we need only group A

The standard code

Use Pipe operator

We need only group A, and the score more than 80

The standard code

Use pipe operator

From the previous code, assign the result to the variable Data2

From Data2, we need the score less than 85

From the previous code, the score not equal 83 too

The mutate() function

The mutate() function is used to create new columns or modify existing ones in a data frame or tibble.

It allows you to perform transformations on your data and add the results as new variables.

Basic Syntax:

Example: From the data frame Data, reducing the score from everyone 20.

if the score are more than or equal 70, your grade is A, else B.

More condition:

score grade
75+ A
71-74 B+
66-70 B
<66 C

case_when() function

case_when() is a super useful function in the dplyr package in R.

It’s like a vectorized version of if...else if...else, and is used mainly for conditional transformations inside mutate() or transmute().

Basic Syntax

  • Each condition is followed by ~ and the value to assign if that condition is TRUE.

  • The last condition should usually be TRUE (like an else).

Common mistakes:

  • Don’t forget the TRUE ~ ... for default values.

  • All return values must be of the same type (e.g., all character, or all numeric).

  • case_when() evaluates conditions in order, and stops at the first TRUE.

When to use case_when():

  • Recoding/categorizing variables

  • Creating new columns with conditions

  • Replacing nested ifelse() statements with clearer syntax

The select() function

The select() function is used to choose specific columns from a data frame or tibble.

It allows you to keep only the columns you are interested in, either by naming them directly or using helper functions for more advanced selections.

  • select() allows for both inclusion and exclusion of columns using positive or negative column names.

  • Helper functions like starts_with(), ends_with(), contains(), matches(), and where() can be used within select() for more flexible column selection.

  • The select() function can be used in combination with other dplyr functions using the pipe operator |> for streamlined data manipulation tasks.

Basic Syntax:

Selecting the variable group and grade from the data frame Data

From the Data, I need only group B, and grade more than B, and select the variable group and grade only.

The arrange() function

The arrange() function is used to reorder the rows of a data frame or tibble based on the values of one or more columns.

It can sort the data in ascending or descending order.

  • arrange() orders rows based on the specified columns. The default order is ascending, but you can use desc() to sort in descending order.

  • You can sort by multiple columns by listing them in the order of priority.

  • arrange() can be combined with other dplyr functions using the pipe operator |> for more complex data processing workflows.

Basic Syntax:

From the mtcars dataset, I need the variables cyl, gear, mpg, and disp assigned to the object Data3.

Sorting Data3 by the variable cyl (from minimum to maximum).

Sorting Data3 by the variable cyl (from maximum to minimum).

Sorting Data3 by the variables cyl and gear (from minimum to maximum).

The constant vector we should known.

Sorting data frame MONTH by month.abb

The unique() function

Used to find the unique values in a categorical variable from a data frame.

The factor() variable in R

In R, a factor is a data structure used to represent categorical variables.

Factors are particularly useful when you have data that falls into a limited number of distinct categories, such as gender, blood type, or education level.

Basic Syntax:

  • x is a vector of character

  • levels is the list of unique value in the vector and ordering from left to right.

Levels: Factors store the unique values as levels. If a value is not included in the levels, it will be treated as NA.

Ordered vs. Unordered: Factors can be unordered (nominal) or ordered (ordinal). Ordering affects how comparisons are made between levels.

Factors and Models: Factors are commonly used in statistical modeling because they help R understand the categorical nature of the variable.

Factors are a powerful way to handle categorical data in R, especially when you need to ensure that the data is treated as a categorical variable in analyses or visualizations.

Change variable month in MONTH2, from character to factor.

The summarize() function

The summarize() (or summarise()) function is used to create a summary statistic of a data frame or tibble.

It reduces multiple values down to a single value per group, often used in conjunction with the group_by() function.

Basic Syntax:

Grouping: When used withgroup_by(), summarize() can generate summary statistics for each group in your data.

Aggregation: Common summary functions include mean(), sum(), min(), max(), median(), sd(), and n().

Output: The output of summarize() is a data frame or tibble with one row per group (or a single row if not grouped) .

Custom Functions: You can use custom functions within summarize() to create more specific summaries. summarize() is a powerful tool for aggregating data and obtaining summary statistics, making it essential for data analysis workflows.

Summary Statistics with summary() function

I need mean, median, SD, VAR from the variable disp.

Compute summary statistic based on the variable cyl

Compute summary statistic based on the variable on the variables cyl and gear

stat() function in R

  • mean(): The average value of numeric variables.

  • sd(): The standard deviation

  • var(): The variance

  • median(): The demian

  • min(): The minimum value from the variable.

  • max(): The maximum value from the variable.

  • quantile(): Find the quantile value at level \(p \in(0,1)\)

  • n(): Count the number of observations in a group or variable (an internal function to use within functions from the dplyr package)

Question

From mtcars dataset,

selecting the cyl = 4 only,

selecting mpg and gear,

create kpg (kilo M. per gallon) (1 mpg (US) = 1.6093 kilometers per gallon (US))

compute (n, mean, quantile 25, 75 amd SD) of kpg separate by gear

Join Data Frame

Join data frame with the function from dplyr package

Join Data Frame funtion

To join data frames using the dplyr package in R, you can use several types of joins depending on how you want to combine the data.

This course we need only 4 ways.

  1. Inner Join: inner_join()

  2. Left Join: left_join()

  3. Right Join: right_join()

  4. Full Join: full_join()

Example

Created two data frames, df1 and df2:

1. Inner Join: inner_join()

Returns rows with matching keys in both data frames.

2. Left Join: left_join()

Returns all rows from the left data frame and the matching rows from the right data frame.

3. Right Join: right_join()

Returns all rows from the right data frame and the matching rows from the left data frame.

4. Full Join: full_join()

Returns all rows when there is a match in either data frame.

Exercise: Part 1

Exercise 1: Using filter()

Task: Filter the rows of the mtcars dataset where the number of cylinders (cyl) is 6 and the miles per gallon (mpg) is greater than 20.

Solution:

mtcars |> 
       filter( cyl == 6, mpg > 20)

Exercise 2: Using mutate()

Task: Create a new variable in the mtcars dataset called hp_per_cyl that represents horsepower (hp) per cylinder (cyl).

Solution:

mtcars |> 
         mutate( hp_per_cyl = hp / cyl)

Exercise 3: Using select()

Task: Select only the columns mpg, hp, and wt from the mtcars dataset.

Solution:

mtcars |> 
       select(mpg, hp, wt)

Exercise 4: Using arrange()

Task: Arrange the mtcars dataset in ascending order of mpg and then by descending order of hp.

Solution:

mtcars |> 
        arrange( mpg, desc(hp))

Exercise 5: Using summarize()

Task: Calculate the mean horsepower (hp) for each number of cylinders (cyl) in the mtcars dataset.

Solution:

mtcars |> 
        group_by(cyl) |> 
        summarize(mean_hp = mean(hp))

Exercise: Part 2

Exercise 6: Combining filter() and mutate()

Task: Filter the mtcars dataset for cars with more than 4 gears, then create a variable mpg_per_wt that represents miles per gallon per unit of weight.

Solution:

mtcars |> 
        filter(gear > 4) |> 
        mutate(mpg_per_wt = mpg / wt)

Exercise 7: Combining select() and arrange()

Task: Select the mpg, hp, and wt columns from the mtcars dataset and then arrange the resulting data in descending order of wt.

Solution:

mtcars |> 
       select(mpg, hp, wt) |> 
       arrange(desc(wt))

Exercise 8: Combining filter(), mutate(), and summarize()

Task: Filter the mtcars dataset for cars with more than 100 horsepower, create a variable hp_per_mpg as the ratio of horsepower to miles per gallon, and then calculate the average hp_per_mpg for each number of cylinders.

Solution:

mtcars |> 
       filter(hp > 100) |> 
       mutate(hp_per_mpg = hp / mpg) |> 
       group_by(cyl) |> 
       summarize(avg_hp_per_mpg = mean(hp_per_mpg))

Exercise 9: Using mutate() with conditional logic

Task: Create a variable in the mtcars dataset called performance that categorizes cars as "High" if horsepower (hp) is above 150, and "Low" otherwise.

Hint: Use the function if_else() or ifelse

Solution:

mtcars |> 
         mutate(performance = if_else(hp > 150, "High", "Low"))

Exercise 10: Combining All Functions

Task:

  • Filter the mtcars dataset for cars with more than 4 cylinders.

  • Select the mpg, hp, and wt columns.

  • Create a variable efficiency as the ratio of mpg to wt, arrange the data by efficiency.

  • Calculate the average efficiency for each number of gears.

Solution:

mtcars |> 
       filter(cyl > 4) |> 
       select(mpg, hp, wt, gear) |> 
       mutate(efficiency = mpg / wt) |> 
       arrange(efficiency) |> 
       group_by(gear) |> 
       summarize(avg_efficiency = mean(efficiency))

Exercise: Part 3

Exercise 11: Analyze Fuel Efficiency

Task:

  • Using the mtcars dataset, filter for cars with more than 4 cylinders and manual transmission (am == 1).

  • Create a variable efficiency_ratio as the ratio of miles per gallon (mpg) to weight (wt).

  • Select the columns cyl, mpg, wt, and efficiency_ratio, then arrange the data by efficiency_ratio in descending order.

  • Finally, calculate the average efficiency_ratio for each number of cylinders.

Solution:

library(dplyr)
result <- mtcars |>
          filter(cyl > 4, am == 1) |>
          mutate(efficiency_ratio = mpg / wt) |>
          select(cyl, mpg, wt, efficiency_ratio) |>
          arrange(desc(efficiency_ratio)) |>
          group_by(cyl) |>
          summarize(avg_efficiency = mean(efficiency_ratio))
result

Exercise 12: Performance Analysis

Task:

  • Filter the mtcars dataset for cars with more than 6 cylinders.

  • Create a variable performance_index as the product of horsepower (hp) and quarter-mile time (qsec).

  • Select the columns cyl, hp, qsec, and performance_index.

  • Arrange the data by performance_index in ascending order.

  • Calculate the median performance_index for each level of cylinders.

Solution:

result <- mtcars |>
          filter(cyl > 6) |>
          mutate(performance_index = hp * qsec) |>
          select(cyl, hp, qsec, performance_index) |>
          arrange(performance_index) |>
          group_by(cyl) |>
          summarize(median_performance = median(performance_index))
result

Exercise 13: Cost Efficiency

Task:

  • Filter the mtcars dataset for cars with miles per gallon (mpg) greater than 20.

  • Create a variable cost_efficiency as the ratio of horsepower (hp) to miles per gallon (mpg).

  • Select the columns mpg, hp, wt, and cost_efficiency.

  • Arrange the data by cost_efficiency in descending order.

  • Finally, calculate the maximum cost_efficiency for each number of gears.

Solution:

result <- mtcars |>
          filter(mpg > 20) |>
          mutate(cost_efficiency = hp / mpg) |>
          select(mpg, hp, wt, cost_efficiency, gear) |>
          arrange(desc(cost_efficiency)) |>
          group_by(gear) |>
          summarize(max_cost_efficiency = max(cost_efficiency))
result

Exercise 14: Weight-to-Power Ratio

Task:

  • Using the mtcars dataset, filter for cars with a weight (wt) greater than 3.

  • Create a variable weight_to_power as the ratio of weight (wt) to horsepower (hp).

  • Select the columns wt, hp, and weight_to_power.

  • Arrange the data by weight_to_power in ascending order. Calculate the average weight_to_power for each number of gears.

Solution:

result <- mtcars |>
          filter(wt > 3) |>
          mutate(weight_to_power = wt / hp) |>
          select(wt, hp, weight_to_power, gear) |>
          arrange(weight_to_power) |>
          group_by(gear) |>
          summarize(avg_weight_to_power = mean(weight_to_power))
result

Exercise 15: Power Efficiency by Cylinder

Task:

  • Filter the mtcars dataset for cars with 4 or 6 cylinders.

  • Create a variable power_efficiency as the ratio of horsepower (hp) to miles per gallon (mpg).

  • Select the columns cyl, hp, mpg, and power_efficiency.

  • Arrange the data by power_efficiency in descending order.

  • Calculate the average power_efficiency for each number of cylinders.

Solution:

result <- mtcars |>
          filter(cyl %in% c(4, 6)) |>
          mutate(power_efficiency = hp / mpg) |>
          select(cyl, hp, mpg, power_efficiency) |>
          arrange(desc(power_efficiency)) |>
          group_by(cyl) |>
          summarize(avg_power_efficiency = mean(power_efficiency))
result

Thank you

All animations are from gadenbuie