16.1 dplyr functions for data manipulation
dplyr is the data manipulation package in tidyverse.
Let’s load some example data. We will be looking at the iris
flower dataset. The data is consists of different measurements for individuals of three species of iris.
> library(tidyverse)
> library(datasets)
> data("iris")
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
We will go through the most used functions in dplyr and see the different approaches between base R (meaning, without any packages) and the tidyverse syntax.
16.1.1 select()
The select()
function is used to select specific variables (corresponding to columns) in your dataset. The base R equivalent would be iris[,"Sepal.Length"]
or iris$Sepal.Length
.
> select(iris, Sepal.Length)
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
...
> select(iris, Sepal.Width, Sepal.Length)
Sepal.Width Sepal.Length
1 3.5 5.1
2 3.0 4.9
3 3.2 4.7
4 3.1 4.6
5 3.6 5.0
...
Two things to note:
First, note that in base R, the " "
are necessary for column names, whereas here, select(iris, "Sepal.Length")
and select(iris, Sepal.Length)
both give the same result.
Secondly, even if you select just one column with the select()
function, it will still output a data.frame, not a vector. This is different from base R. If you are specifically looking for a vector, you can add the pull()
command. Don’t worry about the use of %>%
yet, it will be explained later.
> class(iris[,"Sepal.Length"]) # base R command
[1] "numeric"
> class(select(iris, Sepal.Length)) # select() from dplyr
[1] "data.frame"
> class(select(iris, Sepal.Length) %>% pull())
[1] "numeric"
There are some select_helpers
which allow you to select for very specific cases. You can get an overview on the R help page with ?select_helpers
. Some examples are:
16.1.2 filter()
The filter()
function is used to filter your observations (the rows of your dataset) according to the filters you set. The base R equivalent to this would be something like iris[iris$Species == "setosa",]
> filter(iris, Species == "setosa")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...
> filter(iris, Sepal.Length > 7 & Sepal.Width > 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.2 3.6 6.1 2.5 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.2 3.2 6.0 1.8 virginica
4 7.9 3.8 6.4 2.0 virginica
The %in%
operator can be useful to identify if an element belongs to a vector or a column in your data.frame. It can replace the need for multiple |
statements.
> iris[iris$Species == "virginica" | iris$Species == "versicolor",] # without %in%
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
...
> filter(iris, Species %in% c("virginica, versicolor")) # with dplyr and %in%
[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<0 rows> (or 0-length row.names)
16.1.3 ‘arrange()’
There is a dplyr function called arrange()
that sorts your data.frame according to the values in a column. In the example below we want our data.frame in descending order of our variable Sepal.Length
.
16.1.4 mutate()
The mutate()
function creates new columns that are functions of existing columns. It can even use a created variable to create other variables within the same call. In this example, the created Petal.Aspect_ratio
is used in the same call to also create the variable Petal.Very_oval
.
> mutate(iris,
+ Petal.Aspect_ratio = Petal.Length / Petal.Width,
+ Petal.Very_oval = Petal.Aspect_ratio > 6)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...
You can also use any arithmetic operators like +
, -
, *
, ^
and so on within mutate()
.
In all previous examples in this chapter, we never overwrote the original dataset - we merely chose what to display in the console. If you want to save your wrangled dataset to the environment (for example after creating some fancy new columns), you have to explicitly do so. To illustrate this, see the example below.
> iris_aspect <- mutate(iris,
+ Petal.Aspect_ratio = Petal.Length / Petal.Width,
+ Petal.Very_oval = Petal.Aspect_ratio > 6)
> head(iris_aspect)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Aspect_ratio
1 5.1 3.5 1.4 0.2 setosa 7.00
2 4.9 3.0 1.4 0.2 setosa 7.00
3 4.7 3.2 1.3 0.2 setosa 6.50
4 4.6 3.1 1.5 0.2 setosa 7.50
5 5.0 3.6 1.4 0.2 setosa 7.00
6 5.4 3.9 1.7 0.4 setosa 4.25
Petal.Very_oval
1 TRUE
2 TRUE
3 TRUE
4 TRUE
5 TRUE
6 FALSE
16.1.5 summarise()
Let us look at some summary statistics of the Sepal.Length
of our new iris_aspect
data.frame.
16.1.6 group_by()
These summaries are very useful, but in the iris flower example we have different species we may not want to merge into the same summary statistics.
For these cases, dplyr provides the group_by()
function. You can group your data.frame before you calculate summary statistics.
> iris_by_species <- group_by(iris_aspect, Species)
> iris_by_species
# A tibble: 150 × 7
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Aspect_ratio
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 7
2 4.9 3 1.4 0.2 setosa 7
...
The generated table is now of the tibble
class, which is the data.frame equivalent in tidyverse. We’re not focusing on this now, as it acts like a data.frame in almost every way. What we want to focus on is that we can see on the top of the table that our data is now grouped by Species
, and that there are three species in our table. We could ungroup our data with the function ungroup()
.
Let’s see what happens with our grouped data if we use the summarise()
function from above.
> summarise(iris_by_species,
+ mean_Sepal.Length = mean(Sepal.Length, na.rm = T),
+ sd_Sepal.Length = sd(Sepal.Length, na.rm = T),
+ n_of_flowers = n())
# A tibble: 3 × 4
Species mean_Sepal.Length sd_Sepal.Length n_of_flowers
<fct> <dbl> <dbl> <int>
1 setosa 5.01 0.352 50
2 versicolor 5.94 0.516 50
3 virginica 6.59 0.636 50
We can also define multiple groups.
> iris_by_two <- group_by(iris_aspect, Species, Petal.Very_oval)
> summarise(iris_by_two,
+ mean_Sepal.Length = mean(Sepal.Length, na.rm = T),
+ sd_Sepal.Length = sd(Sepal.Length, na.rm = T),
+ n_of_flowers = n())
# A tibble: 4 × 5
# Groups: Species [3]
Species Petal.Very_oval mean_Sepal.Length sd_Sepal.Length n_of_flowers
<fct> <lgl> <dbl> <dbl> <int>
1 setosa FALSE 5.13 0.369 19
2 setosa TRUE 4.93 0.326 31
3 versicolor FALSE 5.94 0.516 50
4 virginica FALSE 6.59 0.636 50
Hm, interesting, there is only four rows in this summary. Well, it seems that only for the setosa
species there were both Petal.Very_oval == TRUE
and Petal.Very_oval == FALSE
. You can investigate this by looking at the column n_of_flowers
. For two species, the number is 50
. We know our data well enough to see that this means all of the flowers are accounted for.
16.1.7 Piping
The tidyverse uses a special operator %>%
for piping. Because it is used so often, it may be worth learning the keyboard shortcut for it:
Ctrl
+Shift
+M
on Windows and LinuxCmd
+Shift
+M
on Mac
It basically means “take my data
and with it use function()
”.
Because the data.frame is defined before we pipe, we don’t have to mention it anymore in the ncol()
function.
Piping can be done many times in a row:
Using the onion principle of many lengthy brackets in base R can make your code very hard to read. Piping is supposed to give your code better readability, especially if you use proper spacing and multiple lines.
To show this, we are again comparing base R to the tidyverse commands:
> # base R
> round(sum(iris_aspect$Sepal.Width[iris_aspect$Petal.Aspect_ratio > 5 & iris_aspect$Species == "setosa"], na.rm = TRUE, digits = 0))
[1] 115
> # dplyr
> iris_aspect %>%
+ filter(Petal.Aspect_ratio > 5,
+ Species == "setosa") %>%
+ select(Sepal.Width) %>%
+ sum(na.rm = TRUE) %>%
+ round(digits = 0)
[1] 115
Using the piping operator %>%
all the grouping and summarising we did above can be done in one step.
> iris %>%
+ mutate(
+ Petal.Aspect_ratio = Petal.Length / Petal.Width,
+ Petal.Very_oval = Petal.Aspect_ratio > 6) %>%
+ group_by(Species, Petal.Very_oval) %>%
+ summarise(
+ mean_Sepal.Length = mean(Sepal.Length, na.rm = T),
+ sd_Sepal.Length = sd(Sepal.Length, na.rm = T),
+ n_of_flowers = n())
# A tibble: 4 × 5
# Groups: Species [3]
Species Petal.Very_oval mean_Sepal.Length sd_Sepal.Length n_of_flowers
<fct> <lgl> <dbl> <dbl> <int>
1 setosa FALSE 5.13 0.369 19
2 setosa TRUE 4.93 0.326 31
3 versicolor FALSE 5.94 0.516 50
4 virginica FALSE 6.59 0.636 50
16.1.8 Outlook
There are many more packages in the tidyverse universe. This short bonus chapter just gives some insight into the overall topic.
16.1.9 Exercises: Tidyverse
See Section 18.0.40 for solutions.
Find the rows where the
Sepal.Length
in our dataset is exactly 5.0, 6.0 or 7.0. Hint: avoid the use of|
.Load in another example dataset using
data("ChickWeight")
. For every measured time, find the mean weight of all chicks. Exclude diet number 4. Try to use the piping operator.