Visualizing Data in R with ggplot2:
Scatter Plot

Somsak Chanaim

International College of Digital Innovation, CMU

October 30, 2024

What is a Scatter Plot?

A scatter plot is a type of data visualization that displays individual data points on a two-dimensional graph.

Each point on the scatter plot represents the values of two variables.

The position of a point on the x-axis corresponds to the value of one variable, while the position on the y-axis corresponds to the value of the other variable.

Key Features of a Scatter Plot:

  • Two Variables: Scatter plots typically show the relationship between two continuous variables.

  • Data Points: Each point represents an observation in the dataset.

  • Trends: Scatter plots are useful for identifying patterns, correlations, or trends between the two variables.

  • No Line: Unlike line plots, scatter plots do not connect the points with a line; each point stands alone.

When to Use a Scatter Plot

  • Correlation: To determine if there is a relationship between two variables.

  • Outliers: To spot any outliers or unusual observations in the data.

  • Trends: To visually inspect trends, such as whether one variable tends to increase as the other increases (positive correlation), decrease as the other increases (negative correlation), or show no clear pattern (no correlation).

Scatter plots are a fundamental tool in exploratory data analysis and are commonly used in various fields, including statistics, economics, and the natural sciences.

The geom_point() function in ggplot2

The geom_point() function in ggplot2 is used to create scatter plots.

Exercise Basic Usage

data |> 
ggplot() +
  aes(x = variable1, y = variable2) +
  geom_point()
  • data: The dataset being used.

  • aes(x = variable1, y = variable2): Defines the aesthetics, mapping the variables to the x and y axes. (x and y are continuous or integer number)

  • geom_point(): Adds the points to the plot.

Example

Let’s create a basic scatter plot using the mpg dataset:

This code creates a scatter plot of engine displacement (displ) versus highway miles per gallon (hwy).

Customizing geom_point(): Changing Point Color

You can change the color of the points using the color argument:

Customizing geom_point(): Mapping Color to a Variable

You can map a color to a variable, which will change the color of the points based on the values of that variable:

In this case, points will be colored based on the car’s class.

Customizing geom_point(): Changing Point Size

You can adjust the size of the points with the size argument:

The default of size is one.

Customizing geom_point(): Mapping Size to a Variable

You can map the size of the points to a variable:

Here, the size of each point corresponds to the number of cylinders (cyl).

Customizing geom_point(): Changing Point Shape

The shape of the points can be changed using the shape argument:

Customizing geom_point(): Combining Aesthetics

You can combine multiple aesthetics (color, size, shape) in one plot:

This creates a scatter plot where the color represents the car class, the shape represents the drv (drive type), and the size of the points is fixed.

Customizing geom_point(): xlim() and ylim()

xlim() and ylim() are functions that control the limits of the x and y axes, respectively.

These functions are often used together to create scatter plots with customized axis ranges.

  • xlim(<x_min>, <x_max>): Sets the minimum and maximum limits for the x-axis.

  • ylim(<y_min>, <y_max>): Sets the minimum and maximum limits for the y-axis.

Suppose you want to create a scatter plot using the mpg dataset to show the relationship between engine displacement (displ) and highway miles per gallon (hwy), and you want to restrict the x-axis to the range 2 to 6 and the y-axis to the range 15 to 40.

Explanation:

  • xlim(2, 6): Sets the x-axis to display values between 2 and 6.

  • ylim(15, 40): Sets the y-axis to display values between 15 and 40.

Customizing geom_point(): Adding a Regression Line with geom_smooth()

To add a regression line to a scatter plot in ggplot2 after using geom_point(), you can use the geom_smooth() function.

The geom_smooth() function can fit and add a variety of trend lines to your plot, including linear regression lines.

Using the mpg dataset to create a scatter plot of engine displacement (displ) versus highway miles per gallon (hwy), with a linear regression line added:

Explanation:

  • geom_point(): Creates a scatter plot of displ versus hwy.

  • geom_smooth(method = "lm", se = FALSE):

    • method = "lm": Specifies that a linear model (linear regression) should be fitted to the data.

    • se = FALSE: Removes the confidence interval shading around the regression line. If you want to display the confidence interval, you can set se = TRUE or omit the argument.

Adding Confidence Interval:

By default, geom_smooth() adds a shaded area around the regression line representing the confidence interval. You can enable or disable this with the se argument:

Changing Line Color:

We can change the color of the regression line using the color argument:

Changing Line Type:

To change the type of line (e.g., dashed, dotted), use the linetype argument:

Adding Non-Linear Trend Lines:

If we want to fit a non-linear model (e.g., a LOESS curve), you can specify a different method in geom_smooth():

After this you can modifies color, linetype, or se to non-linear line.

Question

Use the Gapminder dataset from the gapminder package.

Exercise

Exercise 1. Basic Scatter Plot

  • Task: Create a scatter plot showing the relationship between GDP per capita (gdpPercap) and life expectancy (lifeExp) for the year 2007.

  • Hint: Use geom_point() and filter the data to include only the year 2007.

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp) +
      geom_point()

Exercise 2. Scatter Plot with Color Mapping

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, and color the points by continent.

  • Hint: Use the color aesthetic to map the continent variable.

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp, color =continent) +
      geom_point()

Exercise 3. Scatter Plot with Size Mapping

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, with point sizes representing the population (pop).

  • Hint: Use the size aesthetic to map the pop variable.

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp, color =continent, size = pop) +
      geom_point()

Exercise 4. Logarithmic Transformation

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, but apply a logarithmic transformation to the x-axis.

  • Hint: Use scale_x_log10() to apply the transformation.

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp, color =continent, size = pop) +
      scale_x_log10() +
      geom_point()

Exercise 5. Adding a Trend Line

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007, and add a linear regression line.

  • Hint: Use geom_smooth(method = "lm", se = FALSE) after geom_point().

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE)

Exercise 6. Facet Wrap by Continent

  • Task: Create scatter plots of GDP per capita vs. life expectancy for each continent for ywar 2007, displayed in separate panels.

  • Hint: Use facet_grid(.~ continent).

solution

gapminder |>  
  filter(year == 2007)|> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp) +
      geom_point() +
      facet_grid(.~ continent)

Exercise 7. Facet Grid by Year and Continent

  • Task: Create scatter plots of GDP per capita vs. life expectancy, with a grid of plots showing different years on the x-axis and continents on the y-axis.

  • Hint: Use facet_grid(continent~year).

solution

gapminder |>  
    ggplot() +
      aes(x = gdpPercap, y = lifeExp) +
      geom_point() +
      facet_grid(continent~year)

Exercise 8. Highlight Specific Countries

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007, but highlight the points for China, India, and the United States in red.

  • Hint: Create the the new variable COLOR to assign the color, and use scale_color_identity().

solution

gapminder$COLOR = ifelse(gapminder$country %in% 
                        c("China", "India", "United States"), "red", "black")
gapminder |>
 filter(year == 2007) |> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp, color =COLOR) +
      geom_point()+
      scale_color_identity()

Exercise 9. Customizing Point Shapes

  • Task: Create a scatter plot of GDP per capita vs. life expectancy for year 2007, using different point shapes for each continent.

  • Hint: Use the shape aesthetic to map the continent variable.

solution

gapminder |>
 filter(year == 2007) |> 
    ggplot() +
      aes(x = gdpPercap, y = lifeExp, shape =continent) +
      geom_point()+
      scale_color_identity()