International College of Digital Innovation, CMU
April 8, 2025
Credit: https://www.thinklytics.io/how-to-choose-the-better-graph-for-data-visualization/
Selecting the right chart for your data depends on the type of data you have and the message you want to convey. Here are some general guidelines to help you choose the appropriate chart:
Histogram:
Used to represent the distribution of a continuous dataset.
Useful for understanding the underlying frequency distribution of a set of continuous or discrete data.
Bar Chart:
Use when you want to compare values across categories.
Helpful for showing trends over time or comparing values for different groups.
Pie Chart:
Suitable for displaying parts of a whole.
Avoid using for more than 5-7 categories, as it can become difficult to interpret.
Line Chart:
Ideal for showing trends and changes over continuous intervals (time, temperature, etc.).
Useful for displaying data with multiple series.
Scatter Plot:
Shows the relationship between two variables.
Useful for identifying patterns and outliers in data.
Bubble Chart:
Represents three dimensions of data: x-axis, y-axis, and size of the bubble.
Useful for showing relationships between three variables.
Financial Chart:
Represents the financial data for technical analysis.
Consider the trend of price, momentum etc.
When selecting a chart, consider the nature of your data, the story you want to tell, and the audience you are addressing.
Experiment with different chart types and choose the one that effectively communicates your message.
Additionally, you can read more the at https://r-graph-gallery.com
The components of a statistical graph include various elements that help visualize and communicate statistical information effectively.
The key components commonly found in statistical graphs:
Axes: Graphs typically have X and Y axes that represent the horizontal and vertical dimensions, respectively. The axes provide a reference for the data points and numerical scales.
Data Points: These are individual markers or symbols representing specific data values on the graph. Data points can take the form of dots, crosses, or other symbols, depending on the graph type.
Lines and Curves: In line graphs, lines or curves connect data points, showing trends or patterns in the data. This is common in time series or continuous data representation.
Bars: Bar graphs use bars of varying lengths to represent different data categories or groups. The height of the bar corresponds to the value of the data it represents.
Title and Labels: Graphs typically have a title that describes the content of the graph. Labels for the X and Y axes help identify the variables and their units, providing context for interpretation.
Symbols: Symbols, such as different shapes or colors, may be used to distinguish between different data sets or categories within the graph. These symbols aid in clarity and interpretation.
Legend: A legend is a key that explains the meaning of symbols, colors, or line types used in the graph. It helps readers understand the representation of different elements in the graph.
Color: Color can be used to differentiate between data sets, highlight specific points, or convey additional information. Careful color choices enhance the visual appeal and clarity of the graph.
Gridlines: Gridlines on the graph provide a reference to the scale, aiding in the interpretation of values and the overall structure of the graph.
Scale: The scale on the axes defines the numerical values represented by the graph. It helps readers understand the magnitude of the data and facilitates accurate interpretation.
Tick Marks: Tick marks along the axes indicate specific points on the scale, aiding in reading and interpreting the graph.
Frame: The frame is the boundary or border that encloses the entire graph. It provides a visual boundary and contributes to the overall aesthetics of the graph.
Effective use and combination of these components depend on the type of graph and the specific information being presented.
Well-designed statistical graphs enhance data communication and understanding for the audience.
A histogram is a graphical representation of the distribution of data.
It is a type of bar chart that displays the frequencies or counts of data within specified intervals or bins.
The horizontal axis of the histogram represents the range of values
(or intervals), and the vertical axis represents the frequency
or count of occurrences
within each interval.
How a histogram is typically constructed:
Data Collection: Gather a set of data that you want to analyze.
Divide into Intervals (Bins): Divide the range of the data into intervals or bins. Each bin represents a specific range of values.
Count Frequencies: Count the number of data points that fall into each bin.
Create Bars: Draw bars above each bin on the histogram. The height of each bar corresponds to the frequency of data points in that bin.
Histograms are useful for visualizing the distribution of data, showing whether the data is symmetric or skewed.
viewof N = Inputs.range([1000, 10000], {step: 100, label: "N"})
viewof myColor = Inputs.color({ label: "Choose a color", value: "#ff0000" })
viewof myText = Inputs.text({ label: "Enter text", placeholder: "Type title" })
viewof Choices = Inputs.radio([
"✔️ Yes",
"❌ No"
], { label: "Theoretical curve", value:
"❌ No" })
viewof clicks = Inputs.button("Click to Random")
The hist()
function in R is used to create a histogram, which is a type of plot that shows the distribution of a numeric variable.
key Arguments
Argument | Description |
---|---|
x |
A numeric vector (the data you want to plot) |
breaks |
Controls the number or placement of bins |
main |
Title of the plot |
xlab , ylab |
Labels for x- and y-axis |
col |
Fill color for the bars |
border |
Color of the borders around bars |
You can find the color names or the color hex codes fromhttps://www.color-hex.com
#fff8e7, #a8e4a0, #b2ec5d, #e8f48c, #bfefff, #e0ffff, #e0b0ff
#fff8e7, #a8e4a0, #b2ec5d, #e8f48c, #bfefff, #e0ffff, #e0b0ff
The mean-variance criteria
The mean-variance criteria is a decision-making approach commonly used in finance and investment theory, it was introduced by Harry Markowitz in 1952 and is a key component of modern portfolio theory (MPT).
The mean-variance criteria aims to optimize investment decisions by considering two key factors: the expected return and the volatility (or risk) of a portfolio.
Mean (Expected Return): This represents the average return an investor can expect from a portfolio. The higher the expected return, the better.
Variance (or Standard Deviation): This measures the volatility or risk associated with the returns of a portfolio. A lower variance indicates less risk.
The mean-variance criteria seeks to find the optimal portfolio by balancing these two factors. Investors are assumed to be risk-averse, meaning they prefer portfolios with higher returns and lower risk.
Work or not work?
We need to put the argument add = TRUE
into the second hist() function.
Question
If you run the code for return.stock3 how to made 3 histogram into the same plot?
You can choose the position by “topleft”, “bottomleft”, “topright”, or “bottomright”.
A density plot is a smoothed version of a histogram that shows the probability density of a continuous variable.
In R, we can create a density plot using the base plot()
and density()
functions
Add argument probability = TRUE
into hist() function
xlab
: This argument sets the label for the x-axis in a plot.
ylab
: This argument sets the label for the y-axis in a plot.
lwd
: This stands for “line width” and controls the thickness of lines in a plot. The default line width is lwd = 1, and increasing this value will make the line thicker.
xlim
: This argument sets the limits (range) of the x-axis. It takes a vector of two numbers, where the first number is the lower limit and the second is the upper limit.
ylim
: This argument sets the limits (range) of the y-axis. It also takes a vector of two numbers, specifying the lower and upper limits.
lty
: This stands for “line type” and controls the style of lines in a plot. It accepts integers or character strings representing different line styles. Common lty values:
lty = 1 or “solid”: A solid line (default).
lty = 2 or “dashed”: A dashed line.
lty = 3 or “dotted”: A dotted line.
lty = 4 or “dotdash”: A dot-dash line.
lty = 5 or “longdash”: A long-dash line.
lty = 6 or “twodash”: A two-dash line.
If we need to use the red color for x > 1.75
Step 1: Create a histogram without plot by assign the plot object to new variable and use the argument plot = FALSE
in the hist() function
Step 2: select the midpoint from object h
, In this case we select 1.75. and use the plot() function.
Task:
Create a basic histogram of the mpg
(miles per gallon) variable from the mtcars
dataset.
Instructions:
Load the mtcars
dataset using data(mtcars)
.
Use the hist()
function to create the histogram.
Add a title and labels to the axes using the main
, xlab
, and ylab
arguments.
Task:
Create a histogram of the Sepal.Length
variable from the iris
dataset with custom bin widths.
Instructions:
Load the iris
dataset using data(iris)
.
Use the breaks
argument in the hist()
function to create three histograms with different numbers of bins (e.g., breaks = 5
, breaks = 15
, breaks = 30
).
Observe how changing the number of bins affects the histogram.
Solution:
# Load the iris dataset
data(iris)
# Histogram with 5 bins
hist(iris$Sepal.Length,
breaks = 5,
main = "Histogram of Sepal Length (5 bins)",
xlab = "Sepal Length",
col = "lightgreen",
border = "black")
# Histogram with 15 bins
hist(iris$Sepal.Length,
breaks = 15,
main = "Histogram of Sepal Length (15 bins)",
xlab = "Sepal Length",
col = "lightcoral",
border = "black")
# Histogram with 30 bins
hist(iris$Sepal.Length,
breaks = 30,
main = "Histogram of Sepal Length (30 bins)",
xlab = "Sepal Length",
col = "lightblue",
border = "black")
Task:
Create a histogram of the weight
variable from a custom dataset and apply custom colors to the bars.
Instructions:
Create a vector weight <- c(58, 62, 67, 70, 73, 75, 80, 85, 90, 95, 100, 105, 110)
.
Use the hist()
function to plot the histogram.
Apply a custom color to the bars using the col
argument (e.g., col = "skyblue"
).
Customize the border color of the bars using the border
argument (e.g., border = "black"
).
Task:
Overlay a density line on top of the histogram of the mpg
variable from the mtcars
dataset and add a rug plot.
Instructions:
Load the mtcars
dataset.
Use hist(mtcars$mpg, freq = FALSE)
to plot the histogram with density on the y-axis.
Use the lines()
function with the density()
function to add a density plot.
Add a rug plot using the rug()
function.
Solution:
# Load the mtcars dataset
data(mtcars)
# Plot the histogram with density
hist(mtcars$mpg,
freq = FALSE,
main = "Histogram with Density and Rug Plot",
xlab = "Miles Per Gallon",
ylab = "Density",
col = "lightgray",
border = "black")
# Add a density line
lines(density(mtcars$mpg),
col = "red",
lwd = 2)
# Add a rug plot
rug(mtcars$mpg)
Task:
Compare the histograms of two different variables (Sepal.Length
and Sepal.Width
) from the iris
dataset on the same plot.
Instructions:
Load the iris
dataset.
Use the hist()
function to plot the histogram of Sepal.Length
with a specific color and transparency (col = rgb(1, 0, 0, 0.5)
).
Use the hist()
function again to plot the histogram of Sepal.Width
on the same plot by setting add = TRUE
and choosing a different color (col = rgb(0, 0, 1, 0.5)
).
Add a legend to differentiate between the two histograms.
Solution:
# Load the iris dataset
data(iris)
# Plot the histogram of Sepal.Length
hist(iris$Sepal.Length,
col = rgb(1, 0, 0, 0.5),
main = "Histogram of Sepal Length and Sepal Width",
xlab = "Length/Width",
ylab = "Frequency",
border = "black")
# Add the histogram of Sepal.Width on the same plot
hist(iris$Sepal.Width,
col = rgb(0, 0, 1, 0.5),
add = TRUE,
border = "black")
# Add a legend
legend("topright",
legend = c("Sepal Length", "Sepal Width"),
fill = c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5)))