International College of Digital Innovation, CMU
October 29, 2025
Data Stucture in R (ref: First Steps in R)
Input
viewof distTypeDF = Inputs.radio(
[
"Create / Inspect",
"Access Columns & Rows",
"Add / Remove Columns",
"Filter / Subset Rows",
"Select Columns (Base R)",
"Summary by Column",
"Sort / Order / Rank",
"Merge / Join Data Frames",
"Aggregate / Group Summary",
"Apply & Transform Columns",
"Handle Missing Data (NA)",
"Factor Variables",
"Convert Types",
"Reshape: wide ↔ long",
"Bind Rows / Cols",
"Plot Data Frame"
],
{ label: "Data Frame Topics", value: "Create / Inspect", inline: true }
)In R, a data frame is a fundamental data structure used for storing and organizing data in a tabular format.
It’s similar to a table in a database or a spreadsheet in which data is arranged in rows and columns.
Here are a few key points about data frames in R:
Tabular Structure: Data frames consist of rows and columns where each column can hold different types of data (numeric, character, factor, etc.). Rows represent observations, while columns represent variables or attributes.
Mixed Data Types: Unlike matrices, data frames can contain columns with different data types. For instance, one column might contain numeric values, another might have strings, and another might hold categorical data.
Data Manipulation: Data frames allow for easy manipulation, subsetting, and transformation of data using various functions and operations provided by R.
Importing and Exporting Data: R provides functions to import data from various file formats (such as CSV, Excel, etc.) into data frames, making it convenient to work with external datasets. Similarly, data frames can be exported to these formats as well.
The data frame is created from multiple vector objects in R by using the data.frame() function.
Example
provide alternative names for variables in a data frame.
We can check the structure of a data frame with the str() function.
The results show the following:
The number of variables and observation values in the data frame.
The types of variables: character, numeric, integer, logical, factor, etc.
The datatable() function is used to display the data frame in an interactive style and is very useful for HTML output.
Install the DT package
By using datatable() function
The colnames() function in R is used to get or set the column names of a matrix or data frame.
Example of colnames() function usage
Change the variable “is_Thai” to “is_Chinese”
Use the cbind() function
Another way to add a new variable to the data frame using
We can access any value from the data frame in a manner similar to accessing a matrix.
First observation value in the first variable.
All observation values in the first variable.
First 5 observations value from every variables.
Observation 1, 3, and 5 from the variable 1 and 3.
The data frame command to access one variable from the data frame.
Show every value from the second variable.
Show the first 5 observations from the second variable.
All variables in the data frame Data.
Please Run this code again
To delete the variable letter.
head() function: Return the first n parts of the data frame object.
Show the first 6 observations
Show the first 3 observations
tail() function: Return the last n parts of the data frame object.
Show the last 6 observations
Show the last 4 observations
summary(): Basic descriptive statistics.
For export a CSV file, use the readr package.
For export an XLSX file, use the writexl package.
Load library (Put on the top of your R script)
Export a data frame “Data” to “Data152.csv”
Export a data frame “Data” to “Data152.xlsx”
For import a CSV file to R, use the read.csv() function.
For import a XLSX file to R, use the readxl package.
Example
Both tibble and data frame are structures used to store tabular data in R, but they differ in behavior and functionality in several ways.
Key Differences between tibble and data frame:
Printing Output
data frame: Displays all the data when printed, which can be overwhelming if the dataset is large.
tibble: Prints in a more compact format, showing only a few rows and columns that fit the screen, making it cleaner and easier to read.
Handles large datasets better.
More intuitive printing and subsetting behavior.
Reduces errors from partial name matching.
Integrates seamlessly with the tidyverse suite of packages.
Works well with base R functions.
Familiar and widely used for general R programming tasks.
No need to load additional packages to work with it.
Subsetting a data frame in R is crucial for several reasons related to data analysis, manipulation, and visualization. Here are some key reasons why subsetting is essential:
1. Extracting Relevant Data:
Data frames often contain a large amount of data.
Subsetting allows you to extract and work with specific rows, columns, or subsets of data that are relevant to your analysis.
This helps in focusing on the relevant parts of the data without being overwhelmed by unnecessary information.
2. Filtering Data:
Subsetting enables you to filter rows based on specific conditions.
For example, you can extract all rows where a certain column meets a criteria (e.g., all customers from a specific city, all transactions above a certain amount).
3. Creating New Data Frames:
Subsetting allows you to create new data frames that contain only the subset of data you are interested in.
This can be useful for creating subsets for different analyses or for sharing specific parts of the data with others.
4. Data Manipulation:
Once you have subsets of data, you can perform various operations such as calculating summary statistics, aggregating data, or creating plots.
Subsetting helps in efficiently manipulating data for these tasks.
5. Improving Performance:
Working with smaller subsets of data can improve the performance of your analysis, especially when dealing with large datasets.
Subsetting allows you to focus computations and visualizations on smaller portions of the data, which can be processed more quickly.
Selecting Columns: Select only some variable in the data frame.
Filtering Rows: filters rows based on a condition specified in condition.
Slicing: df[row_indices, col_indices] selects specific rows and columns based on indices or logical conditions.(Previous topic)
1. Selecting rows from the mtcars dataset where mpg > 20
2. Selecting rows from the mtcars dataset where mpg > 20 and mpg < 25.
3. From mtcars select the data with mpg > 20 and mpg < 25, then select variable mpg, cyl and disp
The pipe operator |> takes the output from the expression on its left-hand side and passes it as the first argument to the function call on its right-hand side.
This allows you to chain multiple function calls together, where each function operates on the result of the previous one.
Shortcuts Key
MAC: command + shift + m
WINDOWS: crtl +shift + m
Camparing between standard code and using
The standard code
Use pipe operator
Important
We’ll explore the advantages of the pipe operator further in the data wrangling chapter.
Benefits of Using the Pipe Operator
Readability: Code written with the pipe operator reads left-to-right, making it easier to understand the flow of operations.
Code Structure: It allows for a more modular approach to coding, where each step in a data manipulation or analysis pipeline is clear and separate.
Debugging: It simplifies debugging because you can comment out or inspect intermediate steps easily.
Avoiding Nested Functions: It reduces the need for nested function calls (f(g(h(x)))), making the code more readable and maintainable.
Create a data frame named my_data with three columns:
ID: (containing numbers 1 to 5).
Name: (containing the names “Alice”, “Bob”, “Charlie”, “David”, “Eva”).
Age” (containing the ages 25, 30, 35, 40, 45).
Name column from my_data.Age is greater than 30.Salary to my_data with values 50000, 55000, 60000, 65000, and 70000.ID to EmployeeID and Name to EmployeeName.Salary column from my_data.my_data by the Age column in descending order.Create another data frame my_data2 with columns
EmployeeID (1 to 5)
Department (“HR”, “IT”, “Finance”, “Marketing”, “Sales”).
After that merge my_data with my_data2 on the EmployeeID column.
my_data.Hint, use the mean() function.
From merged_data in question 8.
Select the EmployeeID and Department columns for employees in the “IT” department.
Use rbind to combine them into a single data frame.
Use cbind to combine them into one data frame.
subset function to extract rows where a certain column meets a specified condition (e.g., subset(df, column_name > 50)).rbind.Ensure the new row has the same column structure as the original data frame.
cbind.The new column can be a vector of the same length as the number of rows in the data frame.
Use rbind to combine them, and handle the resulting NA values appropriately.
subset function to select specific columns from a data frame, returning a new data frame with only those columns.Use a conditional statement to rbind only the rows from the second data frame that meet a certain condition into the first data frame.
subset to extract rows that meet multiple conditions (e.g., subset(df, column1 > 50 & column2 == "value")).Use cbind to combine them and handle the resulting mismatch.
Solution: