International College of Digital Innovation, CMU
October 30, 2024
A scatter plot is a type of data visualization that displays individual data points on a two-dimensional graph.
Each point on the scatter plot represents the values of two variables.
The position of a point on the x-axis corresponds to the value of one variable, while the position on the y-axis corresponds to the value of the other variable.
Key Features of a Scatter Plot:
Two Variables: Scatter plots typically show the relationship between two continuous variables.
Data Points: Each point represents an observation in the dataset.
Trends: Scatter plots are useful for identifying patterns, correlations, or trends between the two variables.
No Line: Unlike line plots, scatter plots do not connect the points with a line; each point stands alone.
When to Use a Scatter Plot
Correlation: To determine if there is a relationship between two variables.
Outliers: To spot any outliers or unusual observations in the data.
Trends: To visually inspect trends, such as whether one variable tends to increase as the other increases (positive correlation), decrease as the other increases (negative correlation), or show no clear pattern (no correlation).
Scatter plots are a fundamental tool in exploratory data analysis and are commonly used in various fields, including statistics, economics, and the natural sciences.
The geom_point()
function in ggplot2
is used to create scatter plots.
Exercise Basic Usage
data
: The dataset being used.
aes(x = variable1, y = variable2)
: Defines the aesthetics, mapping the variables to the x and y axes. (x and y are continuous or integer number)
geom_point()
: Adds the points to the plot.
Let’s create a basic scatter plot using the mpg
dataset:
This code creates a scatter plot of engine displacement (displ
) versus highway miles per gallon (hwy
).
You can change the color of the points using the color
argument:
You can map a color to a variable, which will change the color of the points based on the values of that variable:
In this case, points will be colored based on the car’s class
.
You can adjust the size of the points with the size
argument:
The default of size is one.
You can map the size of the points to a variable:
Here, the size of each point corresponds to the number of cylinders (cyl
).
The shape of the points can be changed using the shape
argument:
You can combine multiple aesthetics (color
, size
, shape
) in one plot:
This creates a scatter plot where the color represents the car class
, the shape represents the drv
(drive type), and the size of the points is fixed.
xlim()
and ylim()
are functions that control the limits of the x and y axes, respectively.
These functions are often used together to create scatter plots with customized axis ranges.
xlim(<x_min>, <x_max>)
: Sets the minimum and maximum limits for the x-axis.
ylim(<y_min>, <y_max>)
: Sets the minimum and maximum limits for the y-axis.
Suppose you want to create a scatter plot using the mpg
dataset to show the relationship between engine displacement (displ
) and highway miles per gallon (hwy
), and you want to restrict the x-axis to the range 2 to 6 and the y-axis to the range 15 to 40.
Explanation:
xlim(2, 6)
: Sets the x-axis to display values between 2 and 6.
ylim(15, 40)
: Sets the y-axis to display values between 15 and 40.
To add a regression line to a scatter plot in ggplot2
after using geom_point()
, you can use the geom_smooth()
function.
The geom_smooth()
function can fit and add a variety of trend lines
to your plot, including linear regression lines
.
Using the mpg
dataset to create a scatter plot of engine displacement (displ
) versus highway miles per gallon (hwy
), with a linear regression line added:
Explanation:
geom_point()
: Creates a scatter plot of displ
versus hwy
.
geom_smooth(method = "lm", se = FALSE)
:
method = "lm"
: Specifies that a linear model (linear regression) should be fitted to the data.
se = FALSE
: Removes the confidence interval shading around the regression line. If you want to display the confidence interval, you can set se = TRUE
or omit the argument.
Adding Confidence Interval:
By default, geom_smooth()
adds a shaded area around the regression line representing the confidence interval. You can enable or disable this with the se
argument:
Changing Line Color:
We can change the color of the regression line using the color
argument:
Changing Line Type:
To change the type of line (e.g., dashed, dotted), use the linetype
argument:
Adding Non-Linear Trend Lines:
If we want to fit a non-linear model (e.g., a LOESS curve), you can specify a different method in geom_smooth()
:
After this you can modifies color
, linetype
, or se
to non-linear line.
Use the Gapminder dataset from the gapminder
package.
Task: Create a scatter plot showing the relationship between GDP per capita (gdpPercap
) and life expectancy (lifeExp
) for the year 2007.
Hint: Use geom_point()
and filter the data to include only the year 2007.
Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, and color the points by continent.
Hint: Use the color
aesthetic to map the continent
variable.
Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, with point sizes representing the population (pop
).
Hint: Use the size
aesthetic to map the pop
variable.
Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007 only, but apply a logarithmic transformation to the x-axis.
Hint: Use scale_x_log10()
to apply the transformation.
Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007, and add a linear regression line.
Hint: Use geom_smooth(method = "lm", se = FALSE)
after geom_point()
.
Task: Create scatter plots of GDP per capita vs. life expectancy for each continent for ywar 2007, displayed in separate panels.
Hint: Use facet_grid(.~ continent)
.
Task: Create scatter plots of GDP per capita vs. life expectancy, with a grid of plots showing different years on the x-axis and continents on the y-axis.
Hint: Use facet_grid(continent~year)
.
Task: Create a scatter plot of GDP per capita vs. life expectancy for the year 2007, but highlight the points for China
, India
, and the United States
in red.
Hint: Create the the new variable COLOR to assign the color, and use scale_color_identity()
.
Task: Create a scatter plot of GDP per capita vs. life expectancy for year 2007, using different point shapes for each continent.
Hint: Use the shape
aesthetic to map the continent
variable.
solution