Visualizing Data in R with ggplot2: Histograms

Somsak Chanaim

International College of Digital Innovation, CMU

October 30, 2024

What is the ggplot2 package?

The ggplot2 package is one of the most popular and powerful data visualization packages in R.

It is based on the “Grammar of Graphics,” a framework that breaks down graphs into components such as scales, layers, and themes.

This approach allows users to build complex and customized plots in a systematic and consistent way.

Benefits of Using ggplot2:

  • Consistency: The grammar of graphics approach makes it easier to build and understand plots.

  • Flexibility: It allows for extensive customization, from simple plots to complex multi-layered visualizations.

  • Community Support: As one of the most widely used R packages, ggplot2 has a large community, extensive documentation, and numerous tutorials and examples.

Install and Load the Package

Open your R console or RStudio and run the following command:

install.packages("ggplot2")

This command will download and install ggplot2.

Load the Package

After installation, you need to load the package into your R session using the library() function:

library(ggplot2)

Load the Packages for this Topics.

library(ggplot2)
library(dplyr)
library(DT)

The Structure of Code in ggplot2

The basic structure typically involves the following components:

  • Data: The dataset that you want to visualize.

  • Aesthetics (aes): Mappings of data variables to visual properties like x and y coordinates, colors, sizes, and shapes.

  • Geometries (geom): The type of plot or visual elements to represent the data (e.g., points, lines, bars).

  • Facets: Optional; used to create multiple plots based on subsets of the data.

  • Scales: Optional; used to control the mapping of data to aesthetics.

  • Coordinates: Optional; control the coordinate system.

  • Themes: Optional; used to customize the appearance of the plot.

Basic Structure of ggplot2 Code

ggplot(data = <DATA>) +
   aes(<MAPPINGS>) +
  <GEOM_FUNCTION>() +
  <FACET_FUNCTION>() +
  <SCALE_FUNCTIONS>() +
  <COORDINATE_FUNCTION>() +
  <THEME_FUNCTION>()

Example Breakdown

Let’s break down an example of creating a scatter plot using ggplot2:

The Histogram with ggplot2

Basic Structure

After load the package ggplot2

Explanation:

  • geom_histogram() creates the histogram.

  • bins = 30 Specifies the number of bins instead of using

  • fill = "skyblue" sets the fill color of the bars.

  • color = "black" outlines the bars in black.

Modified the ggplot object with the labs() Function

After create the ggplot object if we need to modifies/add titles, axis labels, legends, or captions etc.

We use the labs() function to customize the labels of various elements in a plot, including titles, axis labels, legends, and captions.

Basic Syntax

labs(
  title = NULL,      # Title of the plot
  subtitle = NULL,   # Subtitle of the plot
  x = NULL,          # Label for the x-axis
  y = NULL,          # Label for the y-axis
  caption = NULL,    # Caption at the bottom of the plot
  tag = NULL,        # Tag for the plot (like a figure number)
  fill = NULL,       # Label for fill legend (if applicable)
  color = NULL,      # Label for color legend (if applicable)
  size = NULL,       # Label for size legend (if applicable)
  shape = NULL       # Label for shape legend (if applicable)
)

Question:

Add, the title is “The histogram of N(0,1)”

the caption is ‘Your Name’.

ggplot Theme

set.seed(1)
mu <- 0; sigma <- 1
# Random data from N(mu,sigma^2)
Data <- data.frame(
           x = rnorm(n, 
                     mean = mu,
                     sd=sigma))
ggplot(data = Data) +
 aes(x = x) +
 geom_histogram(fill = "skyblue",
                color = "black",
                 bins = 30) +
 theme_xxx()

The default theme is theme_gray().

Exercise 1: Basic Histogram

Task: Create a simple histogram of the mpg (miles per gallon) variable from the mtcars dataset. Set the bin width to 2 and use default colors.

Expected Outcome: A histogram that shows the distribution of miles per gallon for cars in the mtcars dataset.

Solution

# Create a histogram of the mpg variable
ggplot(mtcars) +
  aes(x = mpg) +
  geom_histogram(binwidth = 2) +
  labs(title = "Histogram of Miles Per Gallon", 
           x = "Miles Per Gallon", 
           y = "Frequency")

Exercise 2: Customized Histogram Colors

Task: Modify the histogram created in Exercise 1 by changing the fill color to lightblue and the border color of the bars to black.

Expected Outcome: A histogram with lightblue bars and black borders.

Solution

# Create a histogram with customized colors
ggplot(mtcars) +
  aes(x = mpg) +
  geom_histogram(binwidth = 2, 
                     fill = "lightblue", 
                    color = "black") +
  labs(title = "Histogram of Miles Per Gallon", 
           x = "Miles Per Gallon", 
           y = "Frequency")

Exercise 3: Histogram with Density Curve

Task: Overlay a density curve on top of the histogram of the mpg variable from the mtcars dataset. Use a bin width of 2, and set the density curve color to red.

Expected Outcome: A histogram with a red density curve overlaid, representing the smoothed distribution of mpg.

Solution

# Histogram with density curve overlay
ggplot(mtcars) +
  aes(x = mpg, y = after_stat(density)) +
  geom_histogram( 
              binwidth = 2, 
                  fill = "lightblue", 
                 color = "black", 
                 alpha = 0.6) +
  geom_density(color = "red") +
  labs(title = "Histogram and Density of Miles Per Gallon", 
           x = "Miles Per Gallon", 
           y = "Density")

Exercise 4: Faceted Histogram

Task: Create faceted histograms of the mpg variable, separated by the number of cylinders (cyl) in the mtcars dataset. Each facet should show a histogram for a different number of cylinders.

Expected Outcome: A set of histograms, each representing the distribution of mpg for cars with 4, 6, and 8 cylinders, respectively.

Solution

# Faceted histogram by number of cylinders
ggplot(mtcars) +
  aes(x = mpg) +
  geom_histogram(binwidth = 2, 
                     fill = "lightgreen", 
                    color = "black") +
  labs(title = "Histogram of Miles Per Gallon by Number of Cylinders", 
           x = "Miles Per Gallon", 
           y = "Frequency") +
  facet_grid(.~ cyl)

Exercise 5: Histogram with Custom Number of Bins

Task: Create a histogram of the mpg variable from the mtcars dataset using a specific number of bins (e.g., 10 bins). Customize the title and axis labels.

Expected Outcome: A histogram with exactly 10 bins, showing the distribution of mpg.

Solution

# Histogram with custom number of bins
ggplot(mtcars) +
  aes(x = mpg) +
  geom_histogram(bins = 10, 
                 fill = "orange", 
                color = "black") +
  labs(title = "Histogram of Miles Per Gallon with 10 Bins", 
           x = "Miles Per Gallon", 
           y = "Frequency")

Density Plot with Iris Data

Density Plot the ggplot2

We just change the functiongoem_histogram() to geom_density(). If you want to plot the histogram and density plot into the same graph. In the aes() function, we set the argument y = after_stat(density)

Question

From the code, if you start with a density plot and follow it with a histogram, what happens?

Separate histogram or density by fill

Example if we have to know about histogram of income between gender male and female from this data.

Question and remark

Or remove The argument position = "identity", what happens?

The argument position = "identity" is very important don’t forgot.

From the code, if you move the argument fill = gender from aes() function to geom_histogram(), what happens?

Change the color in the arugumet fill

We change any color by add the scale_fill_manual() function, an the argument inside is

values = <vector of color>

Remark: The order of colors follows the order of the characters or factors.

Legend position

theme(legend.position = <position>) , <position> = “top”, “bottom”, “right”. “left”, or “none”

set.seed(1)
male <- rnorm(n =500, mean = 18000, sd = 2000)
female <- rnorm(n =500, mean = 25000, sd = 1500)
Data <- data.frame(gender = rep(c("male","female"), each = 500),
                   income = c(male, female))

Data |> ggplot() +
        aes(x = income, fill = gender) +
        geom_histogram( color ="black", 
                        alpha = 0.7, bins = 30, 
                        position = "identity") +
        scale_fill_manual(values =c("blue","red")) +
        theme(legend.position = "xxx")

The facet() function

In ggplot2, a facet is a way to create multiple plots (panels) based on the levels of one or more categorical variables, allowing you to compare different subsets of the data side by side. Faceting is especially useful when you want to visualize the same relationship across different groups in the data.

How Faceting Works

  • Facet by a Single Variable: You can create separate panels for each level of a single categorical variable.

  • Facet by Two Variables: You can create a grid of panels, where rows correspond to one variable and columns correspond to another.

facet_grid(rows ~ cols)

  • This function creates a grid of panels based on two variables.

  • The rows correspond to levels of one variable, and the columns correspond to levels of another variable.

  • Useful when you want to explore the interaction between two categorical variables.

By Columns

By Rows

Why Use Facets

  • Comparison: Faceting allows you to easily compare different subsets of your data.

  • Clarity: By splitting data into separate panels, facets can make complex plots easier to read.

  • Exploration: Faceting helps explore how relationships in the data change across different groups.

Summary

Facets in ggplot2 are a powerful tool for visualizing multi-panel plots, enabling comparisons across different groups or categories in your data, faceting enhances the ability to understand complex relationships in your data by breaking them down into more manageable, comparable pieces.

Example

Exercise 6: Basic Histogram with Facets

Task:

  • Use the mpg dataset to create a histogram of the hwy variable (highway miles per gallon).

  • Facet the histogram by the cyl (number of cylinders) variable to compare the distribution of highway miles per gallon across different cylinder counts.

Hint:

  • Use facet_grip(.~ cyl) to create the facets.

Solution:

mpg |> 
ggplot() +
  aes(x = hwy) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  facet_grid(.~ cyl) +
  labs(title = "Histogram of Highway MPG by Cylinder Count",
           x = "Highway MPG",
           y = "Count")

Exercise 7: Filtered density Plot with Custom Colors and Legend

Task:

  • Filter the mpg dataset to only include cars with displ (engine displacement) greater than 3.

  • Create a histogram of the hwy variable, coloring the bars by the class variable.

  • Use scale_fill_brewer to apply a custom color palette and add a legend.

Hint:

  • Use filter() from the dplyr package to filter the data.

  • Use aes(fill = class) to map colors to the class variable.

Solution:

mpg_filtered <- mpg |> filter(displ > 3)
mpg_filtered |> 
ggplot() +
  aes(x = hwy, fill = class) +
  geom_density(color = "black", position = "identity", alpha =0.5) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Histogram of Highway MPG for Cars with Displacement > 3",
           x = "Highway MPG",
           y = "Count",
        fill = "Class") +
  theme(legend.position = "top")

Exercise 8: Faceted Histogram with Custom Fill Colors

Task:

  • Create a faceted histogram of the hwy variable, faceting by drv (drive type).

  • Use custom fill colors for the bars using scale_fill_manual.

  • Filter the data to include only cars with cyl equal to 4 or 6.

Hint:

  • Use filter(cyl %in% c(4, 6)) to filter the data.

  • Customize the fill colors using scale_fill_manual().

Solution:

mpg_filtered <- mpg |> filter(cyl %in% c(4, 6))

mpg_filtered |> 
 ggplot() +
  aes(x = hwy, fill = drv) +
  geom_histogram(binwidth = 2, color = "black") +
  facet_grid(.~ drv) +
  scale_fill_manual(values = c("4" = "red", 
                               "f" = "green", 
                               "r" = "blue")) +
  labs(title = "Histogram of Highway MPG by Drive Type",
           x = "Highway MPG",
           y = "Count",
         fill = "Drive Type")

Exercise 9: Stacked Histogram with Custom Colors and Facets

Task:

  • Create a stacked histogram of the hwy variable, stacking by class.

  • Facet the histogram by the manufacturer variable and filter the dataset to include only cars with year equal to 2008.

  • Use scale_fill_brewer to apply a custom color palette.

Hint:

  • Use aes(fill = class) for stacking by class.

  • Facet by manufacturer.

Solution:

mpg_filtered <- mpg |> 
                    filter(year == 2008)
mpg_filtered |> 
ggplot() +
  aes(x = hwy, fill = class) +
  geom_histogram(binwidth = 2, color = "black") +
  facet_grid(.~ manufacturer) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Stacked Histogram of Highway MPG by Manufacturer (2008)",
           x = "Highway MPG",
           y = "Count",
        fill = "Class") +
  theme(legend.position = "bottom")

Exercise 10: Density Plot Overlaid on Histogram with Facets and Custom Colors

Task:

  • Create a histogram of the hwy variable and overlay a density plot on top.

  • Use facets by the class variable, and apply a custom color palette to the density plot using scale_fill_manual.

  • Filter the data to include only cars with displ between 2 and 4.

Hint:

  • Use geom_density() for the density plot.

  • Adjust transparency of the density plot with alpha.

Solution:

mpg_filtered <- mpg |> 
                filter(displ >= 2) |> 
                filter(displ <= 4)
mpg_filtered |> 
     ggplot() +
        aes(x = hwy) +
        geom_histogram(aes(y = after_stat(density)), 
                    binwidth = 2, 
                        fill = "lightblue", 
                       color = "black") +
       geom_density(aes(fill = class), alpha = 0.4) +
       facet_grid(.~ class) +
       scale_fill_manual(values = c("compact" = "red", 
                                        "suv" = "green", 
                                     "pickup" = "blue",
                                    "minivan" = "orange")) +
       labs(title = "Histogram and Density Plot of Highway MPG by Class",
                x = "Highway MPG",
                y = "Density",
             fill = "Class")