International College of Digital Innovation, CMU
October 30, 2024
The ggplot2
package is one of the most popular and powerful data visualization packages in R.
It is based on the “Grammar of Graphics,” a framework that breaks down graphs into components such as scales, layers, and themes.
This approach allows users to build complex and customized plots in a systematic and consistent way.
Benefits of Using ggplot2:
Consistency: The grammar of graphics approach makes it easier to build and understand plots.
Flexibility: It allows for extensive customization, from simple plots to complex multi-layered visualizations.
Community Support: As one of the most widely used R packages, ggplot2
has a large community, extensive documentation, and numerous tutorials and examples.
Open your R console or RStudio and run the following command:
This command will download and install ggplot2
.
Load the Package
After installation, you need to load the package into your R session using the library()
function:
The basic structure typically involves the following components:
Data: The dataset that you want to visualize.
Aesthetics (aes
): Mappings of data variables to visual properties like x and y coordinates, colors, sizes, and shapes.
Geometries (geom
): The type of plot or visual elements to represent the data (e.g., points, lines, bars).
Facets: Optional; used to create multiple plots based on subsets of the data.
Scales: Optional; used to control the mapping of data to aesthetics.
Coordinates: Optional; control the coordinate system.
Themes: Optional; used to customize the appearance of the plot.
Let’s break down an example of creating a scatter plot using ggplot2
:
Basic Structure
After load the package ggplot2
Explanation:
geom_histogram()
creates the histogram.
bins = 30
Specifies the number of bins instead of using
fill = "skyblue"
sets the fill color of the bars.
color = "black"
outlines the bars in black.
After create the ggplot object if we need to modifies/add titles, axis labels, legends, or captions etc.
We use the labs()
function to customize the labels of various elements in a plot, including titles, axis labels, legends, and captions.
Basic Syntax
labs(
title = NULL, # Title of the plot
subtitle = NULL, # Subtitle of the plot
x = NULL, # Label for the x-axis
y = NULL, # Label for the y-axis
caption = NULL, # Caption at the bottom of the plot
tag = NULL, # Tag for the plot (like a figure number)
fill = NULL, # Label for fill legend (if applicable)
color = NULL, # Label for color legend (if applicable)
size = NULL, # Label for size legend (if applicable)
shape = NULL # Label for shape legend (if applicable)
)
Question:
Add, the title is “The histogram of N(0,1)”
the caption is ‘Your Name’.
The default theme is theme_gray()
.
Task: Create a simple histogram of the mpg
(miles per gallon) variable from the mtcars
dataset. Set the bin width to 2 and use default colors.
Expected Outcome: A histogram that shows the distribution of miles per gallon for cars in the mtcars
dataset.
Task: Modify the histogram created in Exercise 1 by changing the fill color to lightblue
and the border color of the bars to black
.
Expected Outcome: A histogram with lightblue
bars and black
borders.
Task: Overlay a density curve on top of the histogram of the mpg
variable from the mtcars
dataset. Use a bin width of 2, and set the density curve color to red.
Expected Outcome: A histogram with a red density curve overlaid, representing the smoothed distribution of mpg
.
Solution
# Histogram with density curve overlay
ggplot(mtcars) +
aes(x = mpg, y = after_stat(density)) +
geom_histogram(
binwidth = 2,
fill = "lightblue",
color = "black",
alpha = 0.6) +
geom_density(color = "red") +
labs(title = "Histogram and Density of Miles Per Gallon",
x = "Miles Per Gallon",
y = "Density")
Task: Create faceted histograms of the mpg
variable, separated by the number of cylinders (cyl
) in the mtcars
dataset. Each facet should show a histogram for a different number of cylinders.
Expected Outcome: A set of histograms, each representing the distribution of mpg
for cars with 4, 6, and 8 cylinders, respectively.
Task: Create a histogram of the mpg
variable from the mtcars
dataset using a specific number of bins (e.g., 10 bins). Customize the title and axis labels.
Expected Outcome: A histogram with exactly 10 bins, showing the distribution of mpg
.
We just change the functiongoem_histogram()
to geom_density()
. If you want to plot the histogram and density plot into the same graph. In the aes()
function, we set the argument y = after_stat(density)
Question
From the code, if you start with a density plot and follow it with a histogram, what happens?
Example if we have to know about histogram of income between gender male and female from this data.
Question and remark
Or remove The argument position = "identity"
, what happens?
The argument position = "identity"
is very important don’t forgot.
From the code, if you move the argument fill = gender
from aes()
function to geom_histogram()
, what happens?
We change any color by add the scale_fill_manual()
function, an the argument inside is
values = <vector of color>
Remark: The order of colors follows the order of the characters or factors.
theme(legend.position = <position>)
, <position> = “top”, “bottom”, “right”. “left”, or “none”
viewof legends = Inputs.select(
[ "top", "bottom",
"left",
"right",
"none",
],
{ value: "right", label: "Position" }
);
set.seed(1)
male <- rnorm(n =500, mean = 18000, sd = 2000)
female <- rnorm(n =500, mean = 25000, sd = 1500)
Data <- data.frame(gender = rep(c("male","female"), each = 500),
income = c(male, female))
Data |> ggplot() +
aes(x = income, fill = gender) +
geom_histogram( color ="black",
alpha = 0.7, bins = 30,
position = "identity") +
scale_fill_manual(values =c("blue","red")) +
theme(legend.position = "xxx")
In ggplot2
, a facet is a way to create multiple plots (panels) based on the levels of one or more categorical variables, allowing you to compare different subsets of the data side by side. Faceting is especially useful when you want to visualize the same relationship across different groups in the data.
How Faceting Works
Facet by a Single Variable: You can create separate panels for each level of a single categorical variable.
Facet by Two Variables: You can create a grid of panels, where rows correspond to one variable and columns correspond to another.
This function creates a grid of panels based on two variables.
The rows correspond to levels of one variable, and the columns correspond to levels of another variable.
Useful when you want to explore the interaction between two categorical variables.
By Columns
By Rows
Why Use Facets
Comparison: Faceting allows you to easily compare different subsets of your data.
Clarity: By splitting data into separate panels, facets can make complex plots easier to read.
Exploration: Faceting helps explore how relationships in the data change across different groups.
Summary
Facets in ggplot2
are a powerful tool for visualizing multi-panel plots, enabling comparisons across different groups or categories in your data, faceting enhances the ability to understand complex relationships in your data by breaking them down into more manageable, comparable pieces.
Task:
Use the mpg
dataset to create a histogram of the hwy
variable (highway miles per gallon).
Facet the histogram by the cyl
(number of cylinders) variable to compare the distribution of highway miles per gallon across different cylinder counts.
Hint:
facet_grip(.~ cyl)
to create the facets.Task:
Filter the mpg
dataset to only include cars with displ
(engine displacement) greater than 3.
Create a histogram of the hwy
variable, coloring the bars by the class
variable.
Use scale_fill_brewer
to apply a custom color palette and add a legend.
Hint:
Use filter()
from the dplyr
package to filter the data.
Use aes(fill = class)
to map colors to the class
variable.
Solution:
mpg_filtered <- mpg |> filter(displ > 3)
mpg_filtered |>
ggplot() +
aes(x = hwy, fill = class) +
geom_density(color = "black", position = "identity", alpha =0.5) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Histogram of Highway MPG for Cars with Displacement > 3",
x = "Highway MPG",
y = "Count",
fill = "Class") +
theme(legend.position = "top")
Task:
Create a faceted histogram of the hwy
variable, faceting by drv
(drive type).
Use custom fill colors for the bars using scale_fill_manual
.
Filter the data to include only cars with cyl
equal to 4 or 6.
Hint:
Use filter(cyl %in% c(4, 6))
to filter the data.
Customize the fill colors using scale_fill_manual()
.
Solution:
mpg_filtered <- mpg |> filter(cyl %in% c(4, 6))
mpg_filtered |>
ggplot() +
aes(x = hwy, fill = drv) +
geom_histogram(binwidth = 2, color = "black") +
facet_grid(.~ drv) +
scale_fill_manual(values = c("4" = "red",
"f" = "green",
"r" = "blue")) +
labs(title = "Histogram of Highway MPG by Drive Type",
x = "Highway MPG",
y = "Count",
fill = "Drive Type")
Task:
Create a stacked histogram of the hwy
variable, stacking by class
.
Facet the histogram by the manufacturer
variable and filter the dataset to include only cars with year
equal to 2008.
Use scale_fill_brewer
to apply a custom color palette.
Hint:
Use aes(fill = class)
for stacking by class
.
Facet by manufacturer
.
Solution:
mpg_filtered <- mpg |>
filter(year == 2008)
mpg_filtered |>
ggplot() +
aes(x = hwy, fill = class) +
geom_histogram(binwidth = 2, color = "black") +
facet_grid(.~ manufacturer) +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "Stacked Histogram of Highway MPG by Manufacturer (2008)",
x = "Highway MPG",
y = "Count",
fill = "Class") +
theme(legend.position = "bottom")
Task:
Create a histogram of the hwy
variable and overlay a density plot on top.
Use facets by the class
variable, and apply a custom color palette to the density plot using scale_fill_manual
.
Filter the data to include only cars with displ
between 2 and 4.
Hint:
Use geom_density()
for the density plot.
Adjust transparency of the density plot with alpha
.
Solution:
mpg_filtered <- mpg |>
filter(displ >= 2) |>
filter(displ <= 4)
mpg_filtered |>
ggplot() +
aes(x = hwy) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 2,
fill = "lightblue",
color = "black") +
geom_density(aes(fill = class), alpha = 0.4) +
facet_grid(.~ class) +
scale_fill_manual(values = c("compact" = "red",
"suv" = "green",
"pickup" = "blue",
"minivan" = "orange")) +
labs(title = "Histogram and Density Plot of Highway MPG by Class",
x = "Highway MPG",
y = "Density",
fill = "Class")