Data Wragling

Full Screen

Here are some of the most common and essential functions from the dplyr package in R. dplyr is widely used for data manipulation, especially with data frames (or tibble objects) due to its intuitive, chainable syntax with the |> pipe operator.

1. Selecting Columns

  • select() – Choose specific columns by name. Supports renaming and helper functions like starts_with(), ends_with(), contains(), etc.
  • rename() – Rename specific columns without selecting other columns.

2. Filtering Rows

  • filter() – Subset rows based on conditions (e.g., filter(data, col1 > 10 & col2 == "A")).
  • distinct() – Remove duplicate rows based on one or more columns.

3. Adding or Modifying Columns

  • mutate() – Add new columns or modify existing ones. You can create complex transformations within the function.
  • transmute() – Add or modify columns, but only keep the new columns created.

4. Summarizing and Aggregating Data

  • summarize() / summarise() – Compute summary statistics for each group or the entire data (e.g., mean, sum, min, max).
  • count() – Count the number of occurrences of each unique value in one or more columns.
  • n() – A helper function used within summarize() to count the number of observations in each group.

5. Grouping and Ungrouping

  • group_by() – Group data by one or more columns, typically used with summarize(), mutate(), or filter() to perform operations within each group.
  • ungroup() – Remove grouping from a data frame.

6. Arranging Rows

  • arrange() – Sort rows based on one or more columns, in ascending or descending order (use desc() for descending).

7. Joining Data Frames

  • inner_join() – Return rows that match in both data frames.
  • left_join() – Return all rows from the left data frame and matching rows from the right.
  • right_join() – Return all rows from the right data frame and matching rows from the left.
  • full_join() – Return all rows from both data frames, with NA where there are no matches.
  • semi_join() – Return all rows from the left data frame where there are matches in the right data frame.
  • anti_join() – Return all rows from the left data frame where there are no matches in the right data frame.

8. Reshaping Data

  • bind_rows() – Bind multiple data frames by rows.
  • bind_cols() – Bind multiple data frames by columns.
  • rowwise() – Convert a data frame to row-wise operations, useful for applying functions row-by-row.

9. Window Functions

  • lag() and lead() – Shift values up or down by a specified number of rows.
  • cumsum(), cummean(), etc. – Cumulative sum, mean, and other cumulative operations.
  • ntile() – Divide data into n quantiles.
  • min_rank(), dense_rank(), row_number() – Ranking functions to assign ranks within groups.

10. Combining Data Manipulations with Piping

  • |> (Pipe operator) – Allows chaining of multiple dplyr functions together for cleaner, more readable code.

11. Helper Functions

  • everything() – Select all columns, often used in select() to rearrange the order of columns.
  • all_of() – Select multiple columns by name, useful when column names are stored in a variable.
  • any_of() – Select columns if they exist, ignoring errors if they don’t.
  • across() – Apply a function to multiple columns in mutate() or summarize().

These dplyr functions make it easy to perform complex data manipulations in a clear, readable way.