\(~~~~\)Data Structure in R: Data Frame\(~~~~\)

Somsak Chanaim

International College of Digital Innovation, CMU

December 17, 2024

Data Stuctures

Data Stucture in R (ref: First Steps in R)

Data Frame

In R, a data frame is a fundamental data structure used for storing and organizing data in a tabular format.

It’s similar to a table in a database or a spreadsheet in which data is arranged in rows and columns.

Here are a few key points about data frames in R:

  1. Tabular Structure: Data frames consist of rows and columns where each column can hold different types of data (numeric, character, factor, etc.). Rows represent observations, while columns represent variables or attributes.

  2. Mixed Data Types: Unlike matrices, data frames can contain columns with different data types. For instance, one column might contain numeric values, another might have strings, and another might hold categorical data.

  3. Data Manipulation: Data frames allow for easy manipulation, subsetting, and transformation of data using various functions and operations provided by R.

  4. Importing and Exporting Data: R provides functions to import data from various file formats (such as CSV, Excel, etc.) into data frames, making it convenient to work with external datasets. Similarly, data frames can be exported to these formats as well.

How the create the data frame

The data frame is created from multiple vector objects in R by using the data.frame() function.

Example

provide alternative names for variables in a data frame.

The str() function

We can check the structure of a data frame with the str() function.

The results show the following:

  • The number of variables and observation values in the data frame.

  • The types of variables: character, numeric, integer, logical, factor, etc.

The datatable() function from the DT package.

The datatable() function is used to display the data frame in an interactive style and is very useful for HTML output.

Install the DT package

By using datatable() function

The colnames() function

The colnames() function in R is used to get or set the column names of a matrix or data frame.

Example of colnames() function usage

Change the variable “is_Thai” to “is_Chinese”

How to add another variable to the data frame

Use the cbind() function

Another way to add a new variable to the data frame using

How to access/edit the data frame

We can access any value from the data frame in a manner similar to accessing a matrix.

First observation value in the first variable.

All observation values in the first variable.

First 5 observations value from every variables.

Observation 1, 3, and 5 from the variable 1 and 3.

The data frame command to access one variable from the data frame.

Show every value from the second variable.

Show the first 5 observations from the second variable.

How to remove the variable in the data frame

All variables in the data frame Data.

To delete the variable letter.

The head(), tail(), and summary() functions

head() function: Return the first n parts of the data frame object.

Show the first 6 observations

Show the first 3 observations

tail() function: Return the last n parts of the data frame object.

Show the last 6 observations

Show the first 4 observations

summary(): Basic descriptive statistics.

Export a data frame to CSV or XLSX files.

  • For export a CSV file, use the readr package.

  • For export an XLSX file, use the writexl package.

1) Install 3 packages (Only one time.)

\(~\)

Load library (Put on the top of your R script)

Export a data frame “Data” to “Data152.csv”

Export a data frame “Data” to “Data152.xlsx”

Import a data frame from csv or xlsx files to R

  • For import a CSV file to R, use the read.csv() function.

  • For import a XLSX file to R, use the readxl package.

Example

tibble vs data frame

Both tibble and data frame are structures used to store tabular data in R, but they differ in behavior and functionality in several ways.

Key Differences between tibble and data frame:

Printing Output

  • data frame: Displays all the data when printed, which can be overwhelming if the dataset is large.

  • tibble: Prints in a more compact format, showing only a few rows and columns that fit the screen, making it cleaner and easier to read.

Advantages of tibble:

  • Handles large datasets better.
  • More intuitive printing and subsetting behavior.
  • Reduces errors from partial name matching.
  • Integrates seamlessly with the tidyverse suite of packages.

Advantages of data frame:

  • Works well with base R functions.
  • Familiar and widely used for general R programming tasks.
  • No need to load additional packages to work with it.

The subset() function

Subsetting a data frame in R is crucial for several reasons related to data analysis, manipulation, and visualization. Here are some key reasons why subsetting is essential:

1. Extracting Relevant Data:

Data frames often contain a large amount of data.

Subsetting allows you to extract and work with specific rows, columns, or subsets of data that are relevant to your analysis.

This helps in focusing on the relevant parts of the data without being overwhelmed by unnecessary information.

2. Filtering Data:

Subsetting enables you to filter rows based on specific conditions.

For example, you can extract all rows where a certain column meets a criteria (e.g., all customers from a specific city, all transactions above a certain amount).

3. Creating New Data Frames:

Subsetting allows you to create new data frames that contain only the subset of data you are interested in.

This can be useful for creating subsets for different analyses or for sharing specific parts of the data with others.

4. Data Manipulation:

Once you have subsets of data, you can perform various operations such as calculating summary statistics, aggregating data, or creating plots.

Subsetting helps in efficiently manipulating data for these tasks.

5. Improving Performance:

Working with smaller subsets of data can improve the performance of your analysis, especially when dealing with large datasets.

Subsetting allows you to focus computations and visualizations on smaller portions of the data, which can be processed more quickly.

Examples of Subsetting

  • Selecting Columns: Select only some variable in the data frame.

  • Filtering Rows: filters rows based on a condition specified in condition.

  • Slicing: df[row_indices, col_indices] selects specific rows and columns based on indices or logical conditions.(Previous topic)

1. Selecting rows from the mtcars dataset where mpg > 20

2. Selecting rows from the mtcars dataset where mpg > 20 and mpg < 25.

3. From mtcars select the data with mpg > 20 and mpg < 25, then select variable mpg, cyl and disp

Pipe Operation (|>)

The pipe operator |> takes the output from the expression on its left-hand side and passes it as the first argument to the function call on its right-hand side.

This allows you to chain multiple function calls together, where each function operates on the result of the previous one.

Note

MAC: command + shift + m

WINDOWS: crtl +shift + m

Camparing between standard code and using

The standard code

Use pipe operator

Important

You will see more of the benefits of the pipe operation in the data wrangling chapter.

Benefits of Using the Pipe Operator

  1. Readability: Code written with the pipe operator reads left-to-right, making it easier to understand the flow of operations.

  2. Code Structure: It allows for a more modular approach to coding, where each step in a data manipulation or analysis pipeline is clear and separate.

  3. Debugging: It simplifies debugging because you can comment out or inspect intermediate steps easily.

  4. Avoiding Nested Functions: It reduces the need for nested function calls (f(g(h(x)))), making the code more readable and maintainable.

Exercise: Data Frame Part 1

1. Create a Data Frame

Create a data frame named my_data with three columns:

  • ID: (containing numbers 1 to 5).

  • Name: (containing the names “Alice”, “Bob”, “Charlie”, “David”, “Eva”).

  • Age” (containing the ages 25, 30, 35, 40, 45).

Solution:

my_data <- data.frame(
       ID = 1:5,
       Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
       Age = c(25, 30, 35, 40, 45)
   )
my_data

2. Access a Column

  • Access the Name column from my_data.

Solution:

my_data$Name
#Or
name_column <- my_data$Name
name_column

3. Subset Rows Based on a Condition

  • Subset the rows where Age is greater than 30.

Solution:

subset(my_data, subset =  Age > 30 )

4. Add a New Column

  • Add a new column named Salary to my_data with values 50000, 55000, 60000, 65000, and 70000.

Solution:

my_data$Salary <- c(50000, 55000, 60000, 65000, 70000)
my_data

5. Rename Columns

  • Rename the columns ID to EmployeeID and Name to EmployeeName.
  • Show only row number 1 to 3.

Solution:

colnames(my_data) <- c("EmployeeID", "EmployeeName", "Age", "Salary")
head(my_data, 3)

6. Remove a Column

  • Remove the Salary column from my_data.

Solution:

my_data <- my_data[, -4]

# or 

my_data$Salary <- NULL

7. Sort the Data Frame

  • Sort my_data by the Age column in descending order.

Solution:

my_data[order(-my_data$Age), ]
# or
sorted_data <- my_data[order(-my_data$Age), ]
sorted_data

8. Merge Two Data Frames

Create another data frame my_data2 with columns

  • EmployeeID (1 to 5)

  • Department (“HR”, “IT”, “Finance”, “Marketing”, “Sales”).

After that merge my_data with my_data2 on the EmployeeID column.

Solution:

my_data2 <- data.frame(
       EmployeeID = 1:5,
       Department = c("HR", "IT", "Finance", "Marketing", "Sales")
       )
merged_data <- merge(my_data, my_data2, by = "EmployeeID")
merged_data

9. Calculate Summary Statistics

  • Calculate the mean age of employees in my_data.

Hint, use the mean() function.

Solution:

mean_age <- mean(my_data$Age)
mean_age

10. Filter and Select Specific Columns

From merged_data in question 8.

Select the EmployeeID and Department columns for employees in the “IT” department.

Solution:

it_department <- merged_data |> 
   subset(Department =="IT") |> 
   subset(select = c("EmployeeName", "Department"))
it_department

Exercise: Data Frame Part 2

11. Basic Row Binding

  • Create two small data frames with the same column names but different rows.

Use rbind to combine them into a single data frame.

Solution:

combined_df <- rbind(df1, df2)
combined_df

12. Column Binding with Matching Rows

  • Create two data frames with the same number of rows but different columns.

Use cbind to combine them into one data frame.

Solution:

combined_df <- cbind(df1, df2)
combined_df

13. Subsetting by Condition

  • Given a data frame, use the subset function to extract rows where a certain column meets a specified condition (e.g., subset(df, column_name > 50)).

Solution:

df <- data.frame(A = 1:10, B = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50))
subset_df <- subset(df, B > 30)
subset_df

14. Adding a New Row

  • Create a data frame, then add a new row to it using rbind.

Ensure the new row has the same column structure as the original data frame.

Solution:

updated_df <- rbind(df, new_row)
updated_df

15. Adding a New Column

  • Start with a data frame, then add a new column to it using cbind.

The new column can be a vector of the same length as the number of rows in the data frame.

Solution:

updated_df <- cbind(df, new_column)
updated_df

16. Combining Data Frames with Different Columns

  • Create two data frames with different column names.

Use rbind to combine them, and handle the resulting NA values appropriately.

Solution:

# To handle this, you can add the missing columns with NA values
df1$C <- NA
df1$D <- NA
df2$A <- NA
df2$B <- NA
combined_df <- rbind(df1, df2)
combined_df

17. Subsetting Specific Columns

  • Use the subset function to select specific columns from a data frame, returning a new data frame with only those columns.

Solution:

subset_df <- subset(df, select = c(A, C))
subset_df

18. Conditional Row Binding

  • Create two data frames.

Use a conditional statement to rbind only the rows from the second data frame that meet a certain condition into the first data frame.

Solution:


df2_subset <- subset(df2, A > 4)
combined_df <- rbind(df1, df2_subset)
combined_df

19. Subsetting Rows by Multiple Conditions

  • Given a data frame, use subset to extract rows that meet multiple conditions (e.g., subset(df, column1 > 50 & column2 == "value")).

Solution:


subset_df <- subset(df, A > 5 & B < 40)
subset_df

20. Combining with Different Row Numbers

  • Create two data frames with a different number of rows.

Use cbind to combine them and handle the resulting mismatch.

Solution:

# To handle this, make sure the data frames have the same number of rows
df2 <- data.frame(C = c(TRUE, FALSE, TRUE, NA))
combined_df <- cbind(df1, df2)
combined_df