Data Wragling
Here are some of the most common and essential functions from the dplyr
package in R. dplyr
is widely used for data manipulation, especially with data frames (or tibble
objects) due to its intuitive, chainable syntax with the |>
pipe operator.
1. Selecting Columns
select()
– Choose specific columns by name. Supports renaming and helper functions likestarts_with()
,ends_with()
,contains()
, etc.rename()
– Rename specific columns without selecting other columns.
2. Filtering Rows
filter()
– Subset rows based on conditions (e.g.,filter(data, col1 > 10 & col2 == "A")
).distinct()
– Remove duplicate rows based on one or more columns.
3. Adding or Modifying Columns
mutate()
– Add new columns or modify existing ones. You can create complex transformations within the function.transmute()
– Add or modify columns, but only keep the new columns created.
4. Summarizing and Aggregating Data
summarize()
/summarise()
– Compute summary statistics for each group or the entire data (e.g., mean, sum, min, max).count()
– Count the number of occurrences of each unique value in one or more columns.n()
– A helper function used withinsummarize()
to count the number of observations in each group.
5. Grouping and Ungrouping
group_by()
– Group data by one or more columns, typically used withsummarize()
,mutate()
, orfilter()
to perform operations within each group.ungroup()
– Remove grouping from a data frame.
6. Arranging Rows
arrange()
– Sort rows based on one or more columns, in ascending or descending order (usedesc()
for descending).
7. Joining Data Frames
inner_join()
– Return rows that match in both data frames.left_join()
– Return all rows from the left data frame and matching rows from the right.right_join()
– Return all rows from the right data frame and matching rows from the left.full_join()
– Return all rows from both data frames, withNA
where there are no matches.semi_join()
– Return all rows from the left data frame where there are matches in the right data frame.anti_join()
– Return all rows from the left data frame where there are no matches in the right data frame.
8. Reshaping Data
bind_rows()
– Bind multiple data frames by rows.bind_cols()
– Bind multiple data frames by columns.rowwise()
– Convert a data frame to row-wise operations, useful for applying functions row-by-row.
9. Window Functions
lag()
andlead()
– Shift values up or down by a specified number of rows.cumsum()
,cummean()
, etc. – Cumulative sum, mean, and other cumulative operations.ntile()
– Divide data inton
quantiles.min_rank()
,dense_rank()
,row_number()
– Ranking functions to assign ranks within groups.
10. Combining Data Manipulations with Piping
|>
(Pipe operator) – Allows chaining of multipledplyr
functions together for cleaner, more readable code.
11. Helper Functions
everything()
– Select all columns, often used inselect()
to rearrange the order of columns.all_of()
– Select multiple columns by name, useful when column names are stored in a variable.any_of()
– Select columns if they exist, ignoring errors if they don’t.across()
– Apply a function to multiple columns inmutate()
orsummarize()
.
These dplyr
functions make it easy to perform complex data manipulations in a clear, readable way.