Visualizing Data in R with Default Package: Histograms

Somsak Chanaim

International College of Digital Innovation, CMU

April 8, 2025

Over all

Credit: https://www.thinklytics.io/how-to-choose-the-better-graph-for-data-visualization/

How to Select the Chart?

Selecting the right chart for your data depends on the type of data you have and the message you want to convey. Here are some general guidelines to help you choose the appropriate chart:

Histogram:

  • Used to represent the distribution of a continuous dataset.

  • Useful for understanding the underlying frequency distribution of a set of continuous or discrete data.

Bar Chart:

  • Use when you want to compare values across categories.

  • Helpful for showing trends over time or comparing values for different groups.

Pie Chart:

  • Suitable for displaying parts of a whole.

  • Avoid using for more than 5-7 categories, as it can become difficult to interpret.

Line Chart:

  • Ideal for showing trends and changes over continuous intervals (time, temperature, etc.).

  • Useful for displaying data with multiple series.

Scatter Plot:

  • Shows the relationship between two variables.

  • Useful for identifying patterns and outliers in data.

Bubble Chart:

  • Represents three dimensions of data: x-axis, y-axis, and size of the bubble.

  • Useful for showing relationships between three variables.

Financial Chart:

  • Represents the financial data for technical analysis.

  • Consider the trend of price, momentum etc.

When selecting a chart, consider the nature of your data, the story you want to tell, and the audience you are addressing.

Experiment with different chart types and choose the one that effectively communicates your message.

Additionally, you can read more the at https://r-graph-gallery.com

How to choose the better graph

https://www.thinklytics.io

The components of a statistical graph

The components of a statistical graph include various elements that help visualize and communicate statistical information effectively.

The key components commonly found in statistical graphs:

  1. Axes: Graphs typically have X and Y axes that represent the horizontal and vertical dimensions, respectively. The axes provide a reference for the data points and numerical scales.

  2. Data Points: These are individual markers or symbols representing specific data values on the graph. Data points can take the form of dots, crosses, or other symbols, depending on the graph type.

  3. Lines and Curves: In line graphs, lines or curves connect data points, showing trends or patterns in the data. This is common in time series or continuous data representation.

  4. Bars: Bar graphs use bars of varying lengths to represent different data categories or groups. The height of the bar corresponds to the value of the data it represents.

  5. Title and Labels: Graphs typically have a title that describes the content of the graph. Labels for the X and Y axes help identify the variables and their units, providing context for interpretation.

  6. Symbols: Symbols, such as different shapes or colors, may be used to distinguish between different data sets or categories within the graph. These symbols aid in clarity and interpretation.

  7. Legend: A legend is a key that explains the meaning of symbols, colors, or line types used in the graph. It helps readers understand the representation of different elements in the graph.

  8. Color: Color can be used to differentiate between data sets, highlight specific points, or convey additional information. Careful color choices enhance the visual appeal and clarity of the graph.

  9. Gridlines: Gridlines on the graph provide a reference to the scale, aiding in the interpretation of values and the overall structure of the graph.

  10. Scale: The scale on the axes defines the numerical values represented by the graph. It helps readers understand the magnitude of the data and facilitates accurate interpretation.

  11. Tick Marks: Tick marks along the axes indicate specific points on the scale, aiding in reading and interpreting the graph.

  12. Frame: The frame is the boundary or border that encloses the entire graph. It provides a visual boundary and contributes to the overall aesthetics of the graph.

Effective use and combination of these components depend on the type of graph and the specific information being presented.

Well-designed statistical graphs enhance data communication and understanding for the audience.

Histogram

A histogram is a graphical representation of the distribution of data.

It is a type of bar chart that displays the frequencies or counts of data within specified intervals or bins.

The horizontal axis of the histogram represents the range of values (or intervals), and the vertical axis represents the frequency or count of occurrences within each interval.

How a histogram is typically constructed:

  • Data Collection: Gather a set of data that you want to analyze.

  • Divide into Intervals (Bins): Divide the range of the data into intervals or bins. Each bin represents a specific range of values.

  • Count Frequencies: Count the number of data points that fall into each bin.

  • Create Bars: Draw bars above each bin on the histogram. The height of each bar corresponds to the frequency of data points in that bin.

Histograms are useful for visualizing the distribution of data, showing whether the data is symmetric or skewed.

Interactive example

The hist() function

The hist() function in R is used to create a histogram, which is a type of plot that shows the distribution of a numeric variable.

key Arguments

Argument Description
x A numeric vector (the data you want to plot)
breaks Controls the number or placement of bins
main Title of the plot
xlab, ylab Labels for x- and y-axis
col Fill color for the bars
border Color of the borders around bars

change the color

You can find the color names or the color hex codes fromhttps://www.color-hex.com

Try their hex colors

#fff8e7, #a8e4a0, #b2ec5d, #e8f48c, #bfefff, #e0ffff, #e0b0ff

Change the title

Change border color

#fff8e7, #a8e4a0, #b2ec5d, #e8f48c, #bfefff, #e0ffff, #e0b0ff

Aplication of the histogram

The mean-variance criteria

The mean-variance criteria is a decision-making approach commonly used in finance and investment theory, it was introduced by Harry Markowitz in 1952 and is a key component of modern portfolio theory (MPT).

The mean-variance criteria aims to optimize investment decisions by considering two key factors: the expected return and the volatility (or risk) of a portfolio.

  • Mean (Expected Return): This represents the average return an investor can expect from a portfolio. The higher the expected return, the better.

  • Variance (or Standard Deviation): This measures the volatility or risk associated with the returns of a portfolio. A lower variance indicates less risk.

The mean-variance criteria seeks to find the optimal portfolio by balancing these two factors. Investors are assumed to be risk-averse, meaning they prefer portfolios with higher returns and lower risk.

Draw the two histograms in the same plot

Work or not work?

How to solve?

We need to put the argument add = TRUE into the second hist() function.

Question

If you run the code for return.stock3 how to made 3 histogram into the same plot?

Add legend

You can choose the position by “topleft”, “bottomleft”, “topright”, or “bottomright”.

Add box

The Density Plot

A density plot is a smoothed version of a histogram that shows the probability density of a continuous variable.

In R, we can create a density plot using the base plot() and density() functions

Histogram + Density

Add argument probability = TRUE into hist() function

  • xlab: This argument sets the label for the x-axis in a plot.

  • ylab: This argument sets the label for the y-axis in a plot.

  • lwd: This stands for “line width” and controls the thickness of lines in a plot. The default line width is lwd = 1, and increasing this value will make the line thicker.

Density + Histogram

Histogram + Density

  • xlim: This argument sets the limits (range) of the x-axis. It takes a vector of two numbers, where the first number is the lower limit and the second is the upper limit.

  • ylim: This argument sets the limits (range) of the y-axis. It also takes a vector of two numbers, specifying the lower and upper limits.

  • lty: This stands for “line type” and controls the style of lines in a plot. It accepts integers or character strings representing different line styles. Common lty values:

    • lty = 1 or “solid”: A solid line (default).

    • lty = 2 or “dashed”: A dashed line.

    • lty = 3 or “dotted”: A dotted line.

    • lty = 4 or “dotdash”: A dot-dash line.

    • lty = 5 or “longdash”: A long-dash line.

    • lty = 6 or “twodash”: A two-dash line.

Extra Topic: Add another color to histogram.

If we need to use the red color for x > 1.75

Extra Topic: Add another color to histogram.

Step 1: Create a histogram without plot by assign the plot object to new variable and use the argument plot = FALSE in the hist() function

Step 2: select the midpoint from object h, In this case we select 1.75. and use the plot() function.

Exercise 1: Basic Histogram

Task:

Create a basic histogram of the mpg (miles per gallon) variable from the mtcars dataset.

Instructions:

  • Load the mtcars dataset using data(mtcars).

  • Use the hist() function to create the histogram.

  • Add a title and labels to the axes using the main, xlab, and ylab arguments.

Solution:

# Load the mtcars dataset
data(mtcars)

# Create a basic histogram
hist(mtcars$mpg,
     main = "Histogram of Miles Per Gallon",
     xlab = "Miles Per Gallon",
     ylab = "Frequency",
     col = "lightblue",
     border = "black")

Exercise 2: Adjusting Bin Width

Task:

Create a histogram of the Sepal.Length variable from the iris dataset with custom bin widths.

Instructions:

  • Load the iris dataset using data(iris).

  • Use the breaks argument in the hist() function to create three histograms with different numbers of bins (e.g., breaks = 5, breaks = 15, breaks = 30).

  • Observe how changing the number of bins affects the histogram.

Solution:

# Load the iris dataset
data(iris)

# Histogram with 5 bins
hist(iris$Sepal.Length,
     breaks = 5,
     main = "Histogram of Sepal Length (5 bins)",
     xlab = "Sepal Length",
     col = "lightgreen",
     border = "black")

# Histogram with 15 bins
hist(iris$Sepal.Length,
     breaks = 15,
     main = "Histogram of Sepal Length (15 bins)",
     xlab = "Sepal Length",
     col = "lightcoral",
     border = "black")

# Histogram with 30 bins
hist(iris$Sepal.Length,
     breaks = 30,
     main = "Histogram of Sepal Length (30 bins)",
     xlab = "Sepal Length",
     col = "lightblue",
     border = "black")

Exercise 3: Customizing Colors

Task:

Create a histogram of the weight variable from a custom dataset and apply custom colors to the bars.

Instructions:

  • Create a vector weight <- c(58, 62, 67, 70, 73, 75, 80, 85, 90, 95, 100, 105, 110).

  • Use the hist() function to plot the histogram.

  • Apply a custom color to the bars using the col argument (e.g., col = "skyblue").

  • Customize the border color of the bars using the border argument (e.g., border = "black").

Solution:

# Create a custom weight vector
weight <- c(58, 62, 67, 70, 73, 75, 80, 85, 90, 95, 100, 105, 110)

# Plot the histogram with custom colors
hist(weight,
     main = "Histogram of Weight",
     xlab = "Weight (kg)",
     ylab = "Frequency",
     col = "skyblue",
     border = "black")

Exercise 4: Adding Density and Rug Plot

Task:

Overlay a density line on top of the histogram of the mpg variable from the mtcars dataset and add a rug plot.

Instructions:

  • Load the mtcars dataset.

  • Use hist(mtcars$mpg, freq = FALSE) to plot the histogram with density on the y-axis.

  • Use the lines() function with the density() function to add a density plot.

  • Add a rug plot using the rug() function.

Solution:

# Load the mtcars dataset
data(mtcars)

# Plot the histogram with density
hist(mtcars$mpg,
     freq = FALSE,
     main = "Histogram with Density and Rug Plot",
     xlab = "Miles Per Gallon",
     ylab = "Density",
     col = "lightgray",
     border = "black")

# Add a density line
lines(density(mtcars$mpg),
      col = "red",
      lwd = 2)

# Add a rug plot
rug(mtcars$mpg)

Exercise 5: Comparing Two Distributions

Task:

Compare the histograms of two different variables (Sepal.Length and Sepal.Width) from the iris dataset on the same plot.

Instructions:

  • Load the iris dataset.

  • Use the hist() function to plot the histogram of Sepal.Length with a specific color and transparency (col = rgb(1, 0, 0, 0.5)).

  • Use the hist() function again to plot the histogram of Sepal.Width on the same plot by setting add = TRUE and choosing a different color (col = rgb(0, 0, 1, 0.5)).

  • Add a legend to differentiate between the two histograms.

Solution:

# Load the iris dataset
data(iris)

# Plot the histogram of Sepal.Length
hist(iris$Sepal.Length,
     col = rgb(1, 0, 0, 0.5),
     main = "Histogram of Sepal Length and Sepal Width",
     xlab = "Length/Width",
     ylab = "Frequency",
     border = "black")

# Add the histogram of Sepal.Width on the same plot
hist(iris$Sepal.Width,
     col = rgb(0, 0, 1, 0.5),
     add = TRUE,
     border = "black")

# Add a legend
legend("topright",
       legend = c("Sepal Length", "Sepal Width"),
       fill = c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5)))