16.1 dplyr functions for data manipulation

dplyr is the data manipulation package in tidyverse.

Let’s load some example data. We will be looking at the iris flower dataset. The data is consists of different measurements for individuals of three species of iris.

> library(tidyverse)
> library(datasets)
> data("iris")
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

We will go through the most used functions in dplyr and see the different approaches between base R (meaning, without any packages) and the tidyverse syntax.

16.1.1 select()

The select() function is used to select specific variables (corresponding to columns) in your dataset. The base R equivalent would be iris[,"Sepal.Length"] or iris$Sepal.Length.

> select(iris, Sepal.Length) 
    Sepal.Length
1            5.1
2            4.9
3            4.7
4            4.6
5            5.0
...
> select(iris, Sepal.Width, Sepal.Length)
    Sepal.Width Sepal.Length
1           3.5          5.1
2           3.0          4.9
3           3.2          4.7
4           3.1          4.6
5           3.6          5.0
...

Two things to note:

First, note that in base R, the " " are necessary for column names, whereas here, select(iris, "Sepal.Length") and select(iris, Sepal.Length) both give the same result.

Secondly, even if you select just one column with the select() function, it will still output a data.frame, not a vector. This is different from base R. If you are specifically looking for a vector, you can add the pull() command. Don’t worry about the use of %>% yet, it will be explained later.

> class(iris[,"Sepal.Length"]) # base R command
[1] "numeric"
> class(select(iris, Sepal.Length)) # select() from dplyr
[1] "data.frame"
> class(select(iris, Sepal.Length) %>% pull())
[1] "numeric"

There are some select_helpers which allow you to select for very specific cases. You can get an overview on the R help page with ?select_helpers. Some examples are:

> select(iris, ends_with("Length"))
    Sepal.Length Petal.Length
1            5.1          1.4
2            4.9          1.4
3            4.7          1.3
4            4.6          1.5
5            5.0          1.4
...
> select(iris, last_col())
       Species
1       setosa
2       setosa
3       setosa
4       setosa
5       setosa
...

16.1.2 filter()

The filter() function is used to filter your observations (the rows of your dataset) according to the filters you set. The base R equivalent to this would be something like iris[iris$Species == "setosa",]

> filter(iris, Species == "setosa")
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
...
> filter(iris, Sepal.Length > 7 & Sepal.Width > 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.2         3.6          6.1         2.5 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.2         3.2          6.0         1.8 virginica
4          7.9         3.8          6.4         2.0 virginica

The %in% operator can be useful to identify if an element belongs to a vector or a column in your data.frame. It can replace the need for multiple | statements.

> iris[iris$Species == "virginica" | iris$Species == "versicolor",] # without %in%
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor
55           6.5         2.8          4.6         1.5 versicolor
...
> filter(iris, Species  %in% c("virginica, versicolor")) # with dplyr and %in%
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

16.1.3 ‘arrange()’

There is a dplyr function called arrange() that sorts your data.frame according to the values in a column. In the example below we want our data.frame in descending order of our variable Sepal.Length.

> arrange(iris, desc(Sepal.Length))
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            7.9         3.8          6.4         2.0  virginica
2            7.7         3.8          6.7         2.2  virginica
3            7.7         2.6          6.9         2.3  virginica
4            7.7         2.8          6.7         2.0  virginica
5            7.7         3.0          6.1         2.3  virginica
...

16.1.4 mutate()

The mutate() function creates new columns that are functions of existing columns. It can even use a created variable to create other variables within the same call. In this example, the created Petal.Aspect_ratio is used in the same call to also create the variable Petal.Very_oval.

> mutate(iris,
+        Petal.Aspect_ratio = Petal.Length / Petal.Width,
+        Petal.Very_oval = Petal.Aspect_ratio > 6)
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
...

You can also use any arithmetic operators like +, -, *, ^ and so on within mutate().

In all previous examples in this chapter, we never overwrote the original dataset - we merely chose what to display in the console. If you want to save your wrangled dataset to the environment (for example after creating some fancy new columns), you have to explicitly do so. To illustrate this, see the example below.

> iris_aspect <- mutate(iris,
+        Petal.Aspect_ratio = Petal.Length / Petal.Width,
+        Petal.Very_oval = Petal.Aspect_ratio > 6)
> head(iris_aspect)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Aspect_ratio
1          5.1         3.5          1.4         0.2  setosa               7.00
2          4.9         3.0          1.4         0.2  setosa               7.00
3          4.7         3.2          1.3         0.2  setosa               6.50
4          4.6         3.1          1.5         0.2  setosa               7.50
5          5.0         3.6          1.4         0.2  setosa               7.00
6          5.4         3.9          1.7         0.4  setosa               4.25
  Petal.Very_oval
1            TRUE
2            TRUE
3            TRUE
4            TRUE
5            TRUE
6           FALSE

16.1.5 summarise()

Let us look at some summary statistics of the Sepal.Length of our new iris_aspect data.frame.

> summarise(iris_aspect, 
+           mean_Sepal.Length   = mean(Sepal.Length, na.rm = T),
+           sd_Sepal.Length     = sd(Sepal.Length, na.rm = T),
+           n_of_flowers       = n())
  mean_Sepal.Length sd_Sepal.Length n_of_flowers
1          5.843333       0.8280661          150

16.1.6 group_by()

These summaries are very useful, but in the iris flower example we have different species we may not want to merge into the same summary statistics. For these cases, dplyr provides the group_by() function. You can group your data.frame before you calculate summary statistics.

> iris_by_species <- group_by(iris_aspect, Species)
> iris_by_species
# A tibble: 150 × 7
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Aspect_ratio
          <dbl>       <dbl>        <dbl>       <dbl> <fct>                <dbl>
 1          5.1         3.5          1.4         0.2 setosa                7   
 2          4.9         3            1.4         0.2 setosa                7   
...

The generated table is now of the tibble class, which is the data.frame equivalent in tidyverse. We’re not focusing on this now, as it acts like a data.frame in almost every way. What we want to focus on is that we can see on the top of the table that our data is now grouped by Species, and that there are three species in our table. We could ungroup our data with the function ungroup().

Let’s see what happens with our grouped data if we use the summarise() function from above.

> summarise(iris_by_species, 
+           mean_Sepal.Length   = mean(Sepal.Length, na.rm = T),
+           sd_Sepal.Length     = sd(Sepal.Length, na.rm = T),
+           n_of_flowers       = n())
# A tibble: 3 × 4
  Species    mean_Sepal.Length sd_Sepal.Length n_of_flowers
  <fct>                  <dbl>           <dbl>        <int>
1 setosa                  5.01           0.352           50
2 versicolor              5.94           0.516           50
3 virginica               6.59           0.636           50

We can also define multiple groups.

> iris_by_two <- group_by(iris_aspect, Species, Petal.Very_oval)
> summarise(iris_by_two, 
+           mean_Sepal.Length   = mean(Sepal.Length, na.rm = T),
+           sd_Sepal.Length     = sd(Sepal.Length, na.rm = T),
+           n_of_flowers       = n())
# A tibble: 4 × 5
# Groups:   Species [3]
  Species    Petal.Very_oval mean_Sepal.Length sd_Sepal.Length n_of_flowers
  <fct>      <lgl>                       <dbl>           <dbl>        <int>
1 setosa     FALSE                        5.13           0.369           19
2 setosa     TRUE                         4.93           0.326           31
3 versicolor FALSE                        5.94           0.516           50
4 virginica  FALSE                        6.59           0.636           50

Hm, interesting, there is only four rows in this summary. Well, it seems that only for the setosa species there were both Petal.Very_oval == TRUE and Petal.Very_oval == FALSE. You can investigate this by looking at the column n_of_flowers. For two species, the number is 50. We know our data well enough to see that this means all of the flowers are accounted for.

16.1.7 Piping

The tidyverse uses a special operator %>% for piping. Because it is used so often, it may be worth learning the keyboard shortcut for it:

  • Ctrl+Shift+M on Windows and Linux
  • Cmd+Shift+M on Mac

It basically means “take my data and with it use function()”.

> iris %>% ncol()
[1] 5

Because the data.frame is defined before we pipe, we don’t have to mention it anymore in the ncol() function.

Piping can be done many times in a row:

> iris %>% select(Sepal.Width) %>% sum() %>% round(digits = 0)
[1] 459

Using the onion principle of many lengthy brackets in base R can make your code very hard to read. Piping is supposed to give your code better readability, especially if you use proper spacing and multiple lines.

To show this, we are again comparing base R to the tidyverse commands:

> # base R
> round(sum(iris_aspect$Sepal.Width[iris_aspect$Petal.Aspect_ratio > 5 & iris_aspect$Species == "setosa"], na.rm = TRUE, digits = 0))
[1] 115
> # dplyr
> iris_aspect %>%
+   filter(Petal.Aspect_ratio > 5,
+          Species == "setosa") %>% 
+   select(Sepal.Width) %>% 
+   sum(na.rm = TRUE) %>% 
+   round(digits = 0)
[1] 115

Using the piping operator %>% all the grouping and summarising we did above can be done in one step.

> iris %>% 
+   mutate(
+     Petal.Aspect_ratio = Petal.Length / Petal.Width,
+     Petal.Very_oval = Petal.Aspect_ratio > 6) %>% 
+   group_by(Species, Petal.Very_oval) %>% 
+   summarise(
+     mean_Sepal.Length   = mean(Sepal.Length, na.rm = T),
+     sd_Sepal.Length     = sd(Sepal.Length, na.rm = T),
+     n_of_flowers       = n())
# A tibble: 4 × 5
# Groups:   Species [3]
  Species    Petal.Very_oval mean_Sepal.Length sd_Sepal.Length n_of_flowers
  <fct>      <lgl>                       <dbl>           <dbl>        <int>
1 setosa     FALSE                        5.13           0.369           19
2 setosa     TRUE                         4.93           0.326           31
3 versicolor FALSE                        5.94           0.516           50
4 virginica  FALSE                        6.59           0.636           50

16.1.8 Outlook

There are many more packages in the tidyverse universe. This short bonus chapter just gives some insight into the overall topic.

16.1.9 Exercises: Tidyverse

See Section 18.0.40 for solutions.

  1. Find the rows where the Sepal.Length in our dataset is exactly 5.0, 6.0 or 7.0. Hint: avoid the use of |.

  2. Load in another example dataset using data("ChickWeight"). For every measured time, find the mean weight of all chicks. Exclude diet number 4. Try to use the piping operator.