Cheat Sheet Tidyverse



  • 4 Tidyverse packages
    • 4.4 Table and vector manipulation
    • 4.5 Visualize data
  1. I use parts of the Introduction to Tidyverse course in my introductory Data Analytics course. My son and I have also been exploring Tidyverse together. While the Datacamp cheat sheet is well.
  2. Jan 29, 2019 - Tidyverse for Beginners - R for Data Science Cheat Sheet Credit: DataCamp.com.

The Tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy. Advantages of the tidyverse. Consistent functions. Workflow coverage. A parsimonious approach to the development of data science tools.

The tidyverse universe of packages, a collection of packages specially focused on data science, marked a milestone in R programming. In this post I am going to summarize very briefly the most essential to start in this world. The tidyverse grammar follows a common structure in all functions. The most essential thing is that the first argument is the object and then come the rest of the arguments. In addition, a set of verbs is provided to facilitate the use of the functions. The tidyverse philosophy and grammar of functions are also reflected in other packages that make its use compatible with the collection. For example, the sf package (simple feature) is a standardized way to encode spatial vector data and allows the use of multiple functions that we can find in the dplyr package.

The core of the tidyverse collection is made up of the following packages:

PackageDescription
ggplot2Grammar for creating graphics
purrrR functional programming
tibbleModern and effective table system
dplyrGrammar for data manipulation
tidyrSet of functions to create tidy data
stringrFunction set to work with characters
readrAn easy and fast way to import data
forcatsTools to easily work with factors

In addition to the mentioned packages, lubridate is also used very frequently to work with dates and times, and also readxl which allows us to import files in Excel format. To know all the available packages we can use the function tidyverse_packages().

It is very easy to get conflicts between functions, that is, that the same function name exists in several packages. To avoid this, we can write the name of the package in front of the function we want to use, separated by the colon symbol written twice (package_name::function_name).

Before I get started with the packages, I hope it will be a really short introduction, some comments on the style when programming in R.

In R there is no universal style guide, that is, in the R syntax it is not necessary to follow specific rules for our scripts. But it is recommended to work in a homogeneous, uniform, legible and clear way when writing scripts. The tidyverse collection has its own guide (https://style.tidyverse.org/).

The most important recommendations are:

  • Avoid using more than 80 characters per line to allow reading the complete code.
  • Always use a space after a comma, never before.
  • The operators (, +, -, <-,%>%, etc.) must have a space before and after.
  • There is no space between the name of a function and the first parenthesis, nor between the last argument and the final parenthesis of a function.
  • Avoid reusing names of functions and common variables (c <- 5 vs. c())
  • Sort the script separating the parts with the comment form # Import data -----
  • Avoid accent marks or special symbols in names, files, routes, etc.
  • Object names must follow a constant structure: day_one, day_1.

It is advisable to use a correct indentation for multiple arguments of a function or functions chained by the pipe operator (%>%).

To facilitate working in data management, manipulation and visualization, the magrittr package introduces the famous pipe operator in the form %>% with the aim of combining various functions without the need to assign the result to a new object. The pipe operator passes the output of a function applied to the first argument of the next function. This way of combining functions allows you to chain several steps simultaneously, to perform sequential tasks. In the very simple example below, we pass the vector 1:5 to the mean() function to calculate the average. You should know that there are a couple of other pipe operators in the same package.

4.1 Read and write data

The readr package makes it easy to read or write multiple file formats using functions that start with read_* or write_*.In comparison to R Base, readr functions are faster; they handle problematic column names, and dates are automatically converted. The imported tables are of class tibble (tbl_df), a modern version of data.frame from the tibble package. In the same sense, you can use the read_excel() function of the readxl package to import data from Excel sheets (more details also in this blog post). In the following example, we import the mobility data registered by Google (link) during the last months of the COVID-19 pandemic (download).

FunctionDescription
read_csv() o read_csv2()coma or semicolon (CSV)
read_delim()general separator
read_table()whitespace-separated

Important is to take a look at the argument names, since they change in the readr functions. For example, the well-known header = TRUE argument of read.csv() is in this case col_names = TRUE. More details can be found in the Cheat-Sheet of readr.

4.2 Character manipulations

For working with strings we use the stringr package, whose functions always start with str_* followed by a verb and the first argument.

Some of these functions are as follows:

FunctionDescription
str_replace()replace patterns
str_c()combine characters
str_detect()detect patterns
str_extract()extract patterns
str_sub()extract by position
str_length()length of string

Regular expressions are often used for character patterns. For example, the regular expression [aeiou] matches any single character that is a vowel. The use of square brackets [] corresponds to character classes. For example, [abc] corresponds to each letter regardless of its position. [a-z], [A-Z] or [0-9] each between a and z or 0 and 9. And finally, [:punct:] punctuation, etc. With curly braces “{}” we can indicate the number of the previous element, {2} would be twice, {1,2} between one and two, etc. Also with $ or ^ we can indicate if the pattern starts at the beginning or ends at the end. More details and patterns can be found in the Cheat-Sheet of stringr.

A very useful function is str_glue() to interpolate characters.

4.3 Management of dates and times

The lubridate package is very powerful in handling dates and times. It allows us to create R recognized objects with functions (like ymd() or ymd_hms()) and we can even make calculations.

We only must know the following abbreviations:

  • ymd: represents y:year, m: month, d:day
  • hms: represents h:hour, m:minutes, s:seconds

More useful functions:

Finally, the make_date() function is very useful to create dates from different date parts, such as the year, month, etc.

More details can be found in the Cheat-Sheet of lubridate.

R tidyverse cheat sheet

4.4 Table and vector manipulation

The dplyr and tidyr packages provide us with a data manipulation grammar, a set of useful verbs to solve common problems. The most important functions are:

Tidyverse Cheat Sheet Pdf

FunctionDescription
mutate()add new variables or modify existing ones
select()select variables
filter()filter
summarise()summarize/reduce
arrange()sort
group_by()group
rename()rename columns

In case you haven’t done it before, we import the mobility data.

4.4.1 Select and rename

We can select or remove columns with the select() function, using the name or index of the column. To delete columns we make use of the negative sign. The rename function helps in renaming columns with either the same name or their index.

4.4.2 Filter and sort

To filter data, we use filter() with logical operators (|, , >, etc) or functions that return a logical value (str_detect(), is.na() , etc.). The arrange() function sorts from least to greatest for one or multiple variables (with the negative sign - the order is reversed from greatest to least).

4.4.3 Group and summarize

R Data Wrangling Cheat Sheet

Where do we find greater variability between regions in each country on April 1, 2020?

To answer this question, we first filter the data and then we group by the country column. When we use the summarize() function after grouping, it allows us to summarize by these groups. Moreover, combining group_by() with the mutate() function modifies columns in each group separately. In summarize() we calculate the maximum, minimum value and the difference between both extremes creating new columns.

4.4.4 Join tables

How can we filter the data to get a subset of Europe?

To do this, we import a spatial dataset with the country code and a column of regions. Detailed explanations about the sf (simple feature) package, I’ll leave for another post.

Other dplyr functions allow us to join tables: *_join (). Depending on which table (left or right) you want to join, the functions change: left_join(), right_join() or even full_join(). The by argument is not necessary as long as both tables have a column in common. However, in this case the variable names are different, so we use the following way: c('country_region_code'='iso_a2'). The forcats package of tidyverse has many useful functions for handling categorical variables (factors), variables that have a fixed and known set of possible values. All forcats functions have the prefix fct_*. For example, in this case we use fct_reorder() to reorder the country labels in order of the maximum based on the residential mobility records. Finally, we create a new column 'resi_real' to change the reference value, the average or baseline, from 0 to 100.

4.4.5 Long and wide tables

Before we go to create graphics with ggplot2, it is very common to modify the table between two main formats, long and wide. A table is tidy when 1) each variable is a column 2) each observation/case is a row and 3) each type of observational unit forms a table.

Another group of functions you should take a look at are: separate(), case_when(), complete(). More details can be found in the Cheat-Sheet of dplyr.

4.5 Visualize data

Tidyr Cheat Sheet

ggplot2 is a modern system for data visualization with a huge variety of options. Unlike the R Base graphic system, in ggplot2 a different grammar is used. The grammar of graphics (gg) consists of the sum of several independent layers or objects that are combined using + to construct the final graph. ggplot differentiates between data, what is displayed and how it is displayed.

  • data: our dataset (data.frame or tibble)

  • aesthetics: with the aes() function we indicate the variables that correspond to the x, y, z, … axes, or when it is intended to apply graphic parameters (color, size, shape) according to a variable. It is possible to include aes() in ggplot() or in the corresponding function to a geometry geom_ *.

  • geometries: are geom_ * objects that indicate the geometry to be used, (eg: geom_point(), geom_line(), geom_boxplot(), etc.).

  • scales: are objects of type scales_ * (eg, scale_x_continous(), scale_colour_manual()) to manipulate axes, define colors, etc.

  • statistics: are stat_ * objects (eg, stat_density()) that allow to apply statistical transformations.

More details can be found in the Cheat-Sheet of ggplot2. ggplot is constantly supplemented by extensions for geometries or other graphical options (see https://exts.ggplot2.tidyverse.org/ggiraph.html), for graphical ideas have a look a the R Graph Gallery (https://www.r-graph-gallery.com/).

4.5.1 Line and scatter plot

We create a subset of our mobility data for residences and parks, filtering the records for Italian regions. In addition, we divide the mobility values in percentage by 100 to obtain the fraction, since ggplot2 allows us to indicate the unit of percentage in the label argument (see last plot in this section).

To modify the axes, we use the different scale_* functions that we must adapt to the scales of measurement (date, discrete, continuous, etc.). The labs() function helps us define the axis, legend and plot titles. Finally, we add the style of the graph with theme_light() (others are theme_bw(), theme_minimal(), etc.). We could also make changes to all graphic elements through theme().

4.5.2 Boxplot

Cheat Sheet Tidyverse

We can visualize different aspects of the mobility with other geometries. Here we will create boxplots for each European country representing the variability of mobility between and within countries during the COVID-19 pandemic.

4.5.3 Heatmap

To visualize the mobility trend of all European countries it is recommended to use a heatmap instead of a bundle of lines. Before building the graph, we will create a vector of Sundays for the x-axis labels in the observation period.

Cheat Sheet Tidyverse

To difference between European regions, we will use a color fill for the boxplots. We can set the color type with scale_fill_*, in this case, from the viridis scheme. In addition, the guides() function can modify the color bar of the legend. Finally, here we see the use of theme() with additional changes to theme_minimal().

4.6 Apply functions on vectors or lists

The purrr package contains a set of advanced functional programming functions for working with functions and vectors. The known lapply() family of R Base corresponds to the map() functions in this package. One of the biggest advantages is being able to reduce the use of loops (for, etc.).

R Tidyverse Cheat Sheet Pdf

Finally, a more complex example. We calculate the correlation coefficient between residential and park mobility in all European countries. To get a tidy summary of a model or test we use the tidy() function of the broom package.

As we’ve seen before, there are subfunctions of map_* to get an object of another class instead of a list, here for a bind data.frame.

Data Wrangling In R

Other practical examples here in this post or this other. More details can be found in the Cheat-Sheet of purrr.