International College of Digital Innovation, CMU
April 8, 2025
“The results from a data frame or tibble are shown as a table using the kable()
function from the knitr
package.”
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
Five verbs:
filter()
picks cases based on their values.
mutate()
adds new variables that are functions of existing variables
arrange()
changes the ordering of the rows.
select()
picks variables based on their names.
summarise()
reduces multiple values down to a single summary.
How to use this package
The filter()
function is used to subset rows in a data frame (or tibble) based on conditions.
It allows you to keep only those rows that meet the specified conditions.
filter()
returns a data frame or tibble with only the rows that satisfy the condition(s).
You can chain filter()
with other dplyr functions using the pipe operator |>
for more complex data manipulation tasks.
From the data frame Data
, we need only group A
The standard code
Use Pipe operator
We need only group A
, and the score more than 80
The standard code
Use pipe operator
From the previous code, assign the result to the variable Data2
From Data2, we need the score less than 85
From the previous code, the score not equal 83 too
The mutate()
function is used to create new columns or modify existing ones in a data frame or tibble.
It allows you to perform transformations on your data and add the results as new variables.
Example: From the data frame Data
, reducing the score from everyone 20.
if the score are more than or equal 70, your grade is A, else B.
More condition:
score | grade |
---|---|
75+ | A |
71-74 | B+ |
66-70 | B |
<66 | C |
case_when() function
case_when()
is a super useful function in the dplyr
package in R.
It’s like a vectorized version of if...else if...else
, and is used mainly for conditional transformations inside mutate()
or transmute()
.
Basic Syntax
Each condition is followed by ~
and the value to assign if that condition is TRUE.
The last condition should usually be TRUE
(like an else
).
Common mistakes:
Don’t forget the TRUE ~ ...
for default values.
All return values must be of the same type (e.g., all character, or all numeric).
case_when()
evaluates conditions in order, and stops at the first TRUE
.
When to use case_when()
:
Recoding/categorizing variables
Creating new columns with conditions
Replacing nested ifelse()
statements with clearer syntax
The select()
function is used to choose specific columns from a data frame or tibble.
It allows you to keep only the columns you are interested in, either by naming them directly or using helper functions for more advanced selections.
select()
allows for both inclusion and exclusion of columns using positive or negative column names.
Helper functions like starts_with()
, ends_with()
, contains()
, matches()
, and where()
can be used within select()
for more flexible column selection.
The select()
function can be used in combination with other dplyr functions using the pipe operator |>
for streamlined data manipulation tasks.
Selecting the variable group and grade from the data frame Data
From the Data, I need only group B, and grade more than B, and select the variable group and grade only.
The arrange()
function is used to reorder the rows of a data frame or tibble based on the values of one or more columns.
It can sort the data in ascending or descending order.
arrange()
orders rows based on the specified columns. The default order is ascending, but you can use desc()
to sort in descending order.
You can sort by multiple columns by listing them in the order of priority.
arrange()
can be combined with other dplyr functions using the pipe operator |>
for more complex data processing workflows.
From the mtcars
dataset, I need the variables cyl
, gear
, mpg
, and disp
assigned to the object Data3
.
Sorting Data3
by the variable cyl
(from minimum to maximum).
Sorting Data3 by the variable cyl (from maximum to minimum).
Sorting Data3
by the variables cyl
and gear
(from minimum to maximum).
The constant vector we should known.
Sorting data frame MONTH by month.abb
Used to find the unique values in a categorical variable from a data frame.
In R, a factor
is a data structure used to represent categorical variables.
Factors are particularly useful when you have data that falls into a limited number of distinct categories, such as gender, blood type, or education level.
Basic Syntax:
x
is a vector of character
levels
is the list of unique value in the vector and ordering from left to right.
Levels: Factors store the unique values as levels. If a value is not included in the levels, it will be treated as NA.
Ordered vs. Unordered: Factors can be unordered (nominal) or ordered (ordinal). Ordering affects how comparisons are made between levels.
Factors and Models: Factors are commonly used in statistical modeling because they help R understand the categorical nature of the variable.
Factors are a powerful way to handle categorical data in R, especially when you need to ensure that the data is treated as a categorical variable in analyses or visualizations.
Change variable month in MONTH2, from character to factor.
The summarize()
(or summarise()
) function is used to create a summary statistic of a data frame or tibble.
It reduces multiple values down to a single value per group, often used in conjunction with the group_by()
function.
Grouping: When used withgroup_by()
, summarize()
can generate summary statistics for each group in your data.
Aggregation: Common summary functions include mean()
, sum()
, min()
, max()
, median()
, sd()
, and n()
.
Output: The output of summarize()
is a data frame or tibble with one row per group (or a single row if not grouped) .
Custom Functions: You can use custom functions within summarize()
to create more specific summaries. summarize() is a powerful tool for aggregating data and obtaining summary statistics, making it essential for data analysis workflows.
Summary Statistics with summary() function
I need mean, median, SD, VAR from the variable disp
.
Compute summary statistic based on the variable cyl
Compute summary statistic based on the variable on the variables cyl
and gear
mean()
: The average value of numeric variables.
sd()
: The standard deviation
var()
: The variance
median()
: The demian
min()
: The minimum value from the variable.
max()
: The maximum value from the variable.
quantile()
: Find the quantile value at level \(p \in(0,1)\)
n()
: Count the number of observations in a group or variable (an internal function to use within functions from the dplyr
package)
From mtcars dataset,
selecting the cyl = 4
only,
selecting mpg
and gear
,
create kpg
(kilo M. per gallon) (1 mpg (US) = 1.6093 kilometers per gallon (US))
compute (n, mean, quantile 25, 75 amd SD) of kpg separate by gear
To join data frames using the dplyr
package in R, you can use several types of joins depending on how you want to combine the data.
This course we need only 4 ways.
Inner Join: inner_join()
Left Join: left_join()
Right Join: right_join()
Full Join: full_join()
Example
Created two data frames, df1
and df2
:
Returns rows with matching keys in both data frames.
Returns all rows from the left data frame and the matching rows from the right data frame.
Returns all rows from the right data frame and the matching rows from the left data frame.
Returns all rows when there is a match in either data frame.
Task: Filter the rows of the mtcars
dataset where the number of cylinders (cyl
) is 6 and the miles per gallon (mpg
) is greater than 20.
Task: Create a new variable in the mtcars
dataset called hp_per_cyl
that represents horsepower (hp
) per cylinder (cyl
).
Task: Select only the columns mpg
, hp
, and wt
from the mtcars
dataset.
Task: Arrange the mtcars
dataset in ascending order of mpg
and then by descending order of hp
.
Task: Calculate the mean horsepower (hp
) for each number of cylinders (cyl
) in the mtcars
dataset.
Task: Filter the mtcars
dataset for cars with more than 4 gears, then create a variable mpg_per_wt
that represents miles per gallon per unit of weight.
Task: Select the mpg
, hp
, and wt
columns from the mtcars
dataset and then arrange the resulting data in descending order of wt
.
Task: Filter the mtcars
dataset for cars with more than 100 horsepower, create a variable hp_per_mpg
as the ratio of horsepower to miles per gallon, and then calculate the average hp_per_mpg
for each number of cylinders.
Task: Create a variable in the mtcars
dataset called performance
that categorizes cars as "High"
if horsepower (hp
) is above 150, and "Low"
otherwise.
Hint: Use the function if_else()
or ifelse
Task:
Filter the mtcars
dataset for cars with more than 4 cylinders.
Select the mpg
, hp
, and wt
columns.
Create a variable efficiency
as the ratio of mpg
to wt
, arrange the data by efficiency
.
Calculate the average efficiency
for each number of gears.
Task:
Using the mtcars
dataset, filter for cars with more than 4 cylinders and manual transmission (am == 1
).
Create a variable efficiency_ratio
as the ratio of miles per gallon (mpg
) to weight (wt
).
Select the columns cyl
, mpg
, wt
, and efficiency_ratio
, then arrange the data by efficiency_ratio
in descending order.
Finally, calculate the average efficiency_ratio
for each number of cylinders.
Task:
Filter the mtcars
dataset for cars with more than 6 cylinders.
Create a variable performance_index
as the product of horsepower (hp
) and quarter-mile time (qsec
).
Select the columns cyl
, hp
, qsec
, and performance_index
.
Arrange the data by performance_index
in ascending order.
Calculate the median performance_index
for each level of cylinders.
Task:
Filter the mtcars
dataset for cars with miles per gallon (mpg
) greater than 20.
Create a variable cost_efficiency
as the ratio of horsepower (hp
) to miles per gallon (mpg
).
Select the columns mpg
, hp
, wt
, and cost_efficiency
.
Arrange the data by cost_efficiency
in descending order.
Finally, calculate the maximum cost_efficiency
for each number of gears.
Task:
Using the mtcars
dataset, filter for cars with a weight (wt
) greater than 3.
Create a variable weight_to_power
as the ratio of weight (wt
) to horsepower (hp
).
Select the columns wt
, hp
, and weight_to_power
.
Arrange the data by weight_to_power
in ascending order. Calculate the average weight_to_power
for each number of gears.
Task:
Filter the mtcars
dataset for cars with 4 or 6 cylinders.
Create a variable power_efficiency
as the ratio of horsepower (hp
) to miles per gallon (mpg
).
Select the columns cyl
, hp
, mpg
, and power_efficiency
.
Arrange the data by power_efficiency
in descending order.
Calculate the average power_efficiency
for each number of cylinders.
All animations are from gadenbuie