International College of Digital Innovation, CMU
December 17, 2024
In R, a data frame is a fundamental data structure used for storing and organizing data in a tabular format.
It’s similar to a table in a database or a spreadsheet in which data is arranged in rows and columns.
Here are a few key points about data frames in R:
Tabular Structure: Data frames consist of rows and columns where each column can hold different types of data (numeric, character, factor, etc.). Rows represent observations, while columns represent variables or attributes.
Mixed Data Types: Unlike matrices, data frames can contain columns with different data types. For instance, one column might contain numeric values, another might have strings, and another might hold categorical data.
Data Manipulation: Data frames allow for easy manipulation, subsetting, and transformation of data using various functions and operations provided by R.
Importing and Exporting Data: R provides functions to import data from various file formats (such as CSV, Excel, etc.) into data frames, making it convenient to work with external datasets. Similarly, data frames can be exported to these formats as well.
The data frame is created from multiple vector objects in R by using the data.frame()
function.
Example
provide alternative names for variables in a data frame.
We can check the structure of a data frame with the str()
function.
The results show the following:
The number of variables and observation values in the data frame.
The types of variables: character, numeric, integer, logical, factor, etc.
The datatable()
function is used to display the data frame in an interactive style and is very useful for HTML output.
Install the DT package
By using datatable()
function
The colnames()
function in R is used to get or set the column names of a matrix or data frame.
Example of colnames() function usage
Change the variable “is_Thai” to “is_Chinese”
Use the cbind() function
Another way to add a new variable to the data frame using
We can access any value from the data frame in a manner similar to accessing a matrix.
First observation value in the first variable.
All observation values in the first variable.
First 5 observations value from every variables.
Observation 1, 3, and 5 from the variable 1 and 3.
The data frame command to access one variable from the data frame.
Show every value from the second variable.
Show the first 5 observations from the second variable.
All variables in the data frame Data
.
To delete the variable letter.
head()
function: Return the first n parts of the data frame object.
Show the first 6 observations
Show the first 3 observations
tail()
function: Return the last n parts of the data frame object.
Show the last 6 observations
Show the first 4 observations
summary()
: Basic descriptive statistics.
For export a CSV
file, use the readr
package.
For export an XLSX
file, use the writexl
package.
1) Install 3 packages (Only one time.)
\(~\)
Load library (Put on the top of your R script)
Export a data frame “Data” to “Data152.csv”
Export a data frame “Data” to “Data152.xlsx”
For import a CSV
file to R, use the read.csv()
function.
For import a XLSX
file to R, use the readxl
package.
Example
tibble vs data frame
Both tibble
and data frame
are structures used to store tabular data in R, but they differ in behavior and functionality in several ways.
Printing Output
data frame
: Displays all the data when printed, which can be overwhelming if the dataset is large.
tibble
: Prints in a more compact format, showing only a few rows and columns that fit the screen, making it cleaner and easier to read.
tidyverse
suite of packages.Subsetting a data frame in R is crucial for several reasons related to data analysis, manipulation, and visualization. Here are some key reasons why subsetting is essential:
1. Extracting Relevant Data:
Data frames often contain a large amount of data.
Subsetting allows you to extract and work with specific rows, columns, or subsets of data that are relevant to your analysis.
This helps in focusing on the relevant parts of the data without being overwhelmed by unnecessary information.
2. Filtering Data:
Subsetting enables you to filter rows based on specific conditions.
For example, you can extract all rows where a certain column meets a criteria (e.g., all customers from a specific city, all transactions above a certain amount).
3. Creating New Data Frames:
Subsetting allows you to create new data frames that contain only the subset of data you are interested in.
This can be useful for creating subsets for different analyses or for sharing specific parts of the data with others.
4. Data Manipulation:
Once you have subsets of data, you can perform various operations such as calculating summary statistics, aggregating data, or creating plots.
Subsetting helps in efficiently manipulating data for these tasks.
5. Improving Performance:
Working with smaller subsets of data can improve the performance of your analysis, especially when dealing with large datasets.
Subsetting allows you to focus computations and visualizations on smaller portions of the data, which can be processed more quickly.
Selecting Columns: Select only some variable in the data frame.
Filtering Rows: filters rows based on a condition specified in condition
.
Slicing: df[row_indices, col_indices]
selects specific rows and columns based on indices or logical conditions.(Previous topic)
1. Selecting rows from the mtcars dataset where mpg > 20
2. Selecting rows from the mtcars dataset where mpg > 20 and mpg < 25.
3. From mtcars select the data with mpg > 20 and mpg < 25, then select variable mpg, cyl and disp
The pipe operator |>
takes the output from the expression on its left-hand side and passes it as the first argument to the function call on its right-hand side.
This allows you to chain multiple function calls together, where each function operates on the result of the previous one.
Note
MAC: command + shift + m
WINDOWS: crtl +shift + m
The standard code
Use pipe operator
Important
You will see more of the benefits of the pipe operation in the data wrangling chapter.
Readability: Code written with the pipe operator reads left-to-right, making it easier to understand the flow of operations.
Code Structure: It allows for a more modular approach to coding, where each step in a data manipulation or analysis pipeline is clear and separate.
Debugging: It simplifies debugging because you can comment out or inspect intermediate steps easily.
Avoiding Nested Functions: It reduces the need for nested function calls (f(g(h(x)))
), making the code more readable and maintainable.
Create a data frame named my_data
with three columns:
ID
: (containing numbers 1 to 5).
Name
: (containing the names “Alice”, “Bob”, “Charlie”, “David”, “Eva”).
Age
” (containing the ages 25, 30, 35, 40, 45).
Name
column from my_data
.Age
is greater than 30.Salary
to my_data
with values 50000, 55000, 60000, 65000, and 70000.ID
to EmployeeID
and Name
to EmployeeName
.Salary
column from my_data
.my_data
by the Age
column in descending order.Create another data frame my_data2
with columns
EmployeeID
(1 to 5)
Department
(“HR”, “IT”, “Finance”, “Marketing”, “Sales”).
After that merge my_data
with my_data2
on the EmployeeID
column.
my_data
.Hint, use the mean()
function.
From merged_data
in question 8.
Select the EmployeeID
and Department
columns for employees in the “IT” department.
Use rbind
to combine them into a single data frame.
Use cbind
to combine them into one data frame.
subset
function to extract rows where a certain column meets a specified condition (e.g., subset(df, column_name > 50)
).rbind
.Ensure the new row has the same column structure as the original data frame.
cbind
.The new column can be a vector of the same length as the number of rows in the data frame.
Use rbind
to combine them, and handle the resulting NA
values appropriately.
subset
function to select specific columns from a data frame, returning a new data frame with only those columns.Use a conditional statement to rbind
only the rows from the second data frame that meet a certain condition into the first data frame.
subset
to extract rows that meet multiple conditions (e.g., subset(df, column1 > 50 & column2 == "value")
).Use cbind
to combine them and handle the resulting mismatch.
Solution: