14.1 Apply

As we have seen above, loops are relatively slow in R. Luckily, many functions accept vectors (and matrices or data.frames) as input and do their magic on each element individually, which allows us to avoid writing loops.

However, not all functions do that. In addition, we may no always want to apply a function to each element, but maybe to each column or row of a matrix or data.frame.

Luckily, R comes with a family of functions that help us avoiding loops altogether. We will discuss the most important ones here.

We will ilustrate some of these concepts using a simple simulation task. Imagine we would like to simulate a simple dice game in which it matters how many sixes the player obtains when rolling 5 dice. What is the distribution of 0, 1, … or 5 sixes?

Obviously, this is a binomial sampling problem. Knowing this, we can readily obtain these probabilities using dbinom():

> dbinom(0:5, size=5, prob=1/6)
[1] 0.4018775720 0.4018775720 0.1607510288 0.0321502058 0.0032150206
[6] 0.0001286008

But for the sake of the argument, let’s try to get those using simulations. A for loop implementation could look like this (note that R indexes start at 1):

> simSixes <- function(dice, rep){
+   counts <- integer(6)
+   for(i in 1:rep){
+     roll <- sample(1:6, size=dice, replace=TRUE)
+     numSixes <- sum(roll==6)
+     counts[numSixes+1] <- counts[numSixes+1] + 1
+   }
+   return(counts / rep)
+ }
> system.time(sim <- simSixes(dice=5, rep=10^5))
   user  system elapsed 
  0.903   0.020   0.923 
> sim
[1] 0.39957 0.40358 0.16092 0.03269 0.00315 0.00009

How can we speed up these simulations?

14.1.1 `replicate()`

The most basic function to avoid a for loop is replicate(). As the name suggests, it replicates a particular function call. It takes the following main arguments:

The number of replicates n.
The expression expr to be called.

We could use replicate() to generate many random numbers as follows:

> replicate(10, runif(1))
 [1] 0.6929201 0.4358046 0.7050972 0.3200239 0.8760117 0.6720674 0.3047773
 [8] 0.1398616 0.5652658 0.6980336

That is obviously not super helpful as the function runif() has a built-in mechanism to simulate an arbitrary number of random numbers. But we can illustrate the benefit of using replicate() over a for loop for our case of rolling dice:

> simRoll <- function(dice){
+   roll <- sample(1:6, size=dice, replace=TRUE)
+   return(sum(roll==6))
+ }
> simSixesReplicate <- function(dice, rep){
+   rolls <- replicate(rep, simRoll())
+   counts <- tabulate(rolls, nbins=dice+1)
+   return(counts / rep)
+ }
> system.time(sim <- simSixesReplicate(dice=5, rep=10^5))
   user  system elapsed 
  1.048   0.019   1.069 
> sim
[1] 0.40376 0.20082 0.05372 0.00827 0.00072 0.00004

Since replicate() returns a vector of results, we need to tabulate those into counts using tabulate().

Note that in this case, using replicate() did not result in a major speed-up. That is common for many tasks: while replicate() is a convenient function to write code cleaner, it does usually do little in terms of speed.

14.1.2 `apply()`

An often more useful function is apply() which applies a specific function to a data.frame or matrix. It takes the following arguments:

The data structure X to which a function should be applied.
The MARGIN over which to loop. MARGIN=1 corresponds to rows, MARGIN=2 to columns.
The function FUN to be applied.
Additional parameters ... forwarded to FUN.

What apply() does is basically to loop over X along the MARGIN (i.e. along rows or columns) and applying FUN to each vector of that margin (i.e. to each row or column). Here is an example:

> m <- matrix(runif(1000), nrow=10)
> apply(m, 1, sum)
 [1] 50.98664 47.96165 46.53472 49.42322 46.78828 55.02994 46.73263 50.38601
 [9] 49.15007 47.85920

In the example above, we created a matrix with ten rows and used apply() to apply the function sum() to each row. This call to apply() does returns a vector of length 10, one value per row.

Now let’s see how we could make use of apply() to make our dice simulations faster.

> simSixes <- function(dice, rep){
+   rolls <- matrix(sample(1:6, size=dice*rep, replace=TRUE), ncol=dice)
+   numSix <- apply(rolls == 6, 1, sum)
+   counts <- tabulate(numSix, nbins=dice+1)
+   return(counts / rep)
+ }
> system.time(sim <- simSixes(dice=5, rep=10^5))
   user  system elapsed 
  0.178   0.006   0.184 
> sim
[1] 0.40449 0.15945 0.03282 0.00283 0.00008 0.00000

As you can see, we now managed to significantly reduced computation time by making use of apply(). But let’s walk through the code to understand how it works.

The basic idea is to avoid multiple function calls at all levels, so we should aim for generating all random draws at once. In total, we roll dice*rep times a dice, which we can do fast with sample(1:6, size=dice*rep, replace=TRUE). However, we still need to preserve that we always consider dice rolls together. One way to do that is to store the resulting rolls in a matrix with dice columns such that each row corresponds to a roll of dice dice. Once we have this structure, we can use apply() to perform some calculations on a per row basis. Here, we want to use sum() to count the number of sixes, for which we first turn the integer matrix into a logical matrix using rolls == 6 and then apply() to calculate the number of TRUE’s in each row.

14.1.3 `lapply()` and `sapply()`

R offers a bunch of similar functions to apply() that differ in what input they accept and which output the produce. The apply() function, for instance, does not deal with lists, but the lapply() functions does precicely that:

> l1 <- as.list(1:5)
> l2 <- lapply(l1, runif)
> l2
[[1]]
[1] 0.845548

[[2]]
[1] 0.5016538 0.5085640

[[3]]
[1] 0.03651228 0.36896596 0.17814951

[[4]]
[1] 0.7455067 0.1054734 0.3717697 0.8702687

[[5]]
[1] 0.1525040 0.7820551 0.9155104 0.2130255 0.1950921

In the example above, we used lapply() to generate a list of vectors of random numbers of different sizes, which could not be stored in a matrix.

Note that while lapply() always returns a list, it may not require a list as input.

> lapply(1:3, runif)
[[1]]
[1] 0.6604787

[[2]]
[1] 0.9738182 0.9781264

[[3]]
[1] 0.8911678 0.5364177 0.8744750

Sometimes, we may want to apply a function to a list but not obtain the results in list form but as a vector or matrix. In these cases we may use sapply():

> sapply(l2, sum)
[1] 0.8455480 1.0102178 0.5836277 2.0930185 2.2581871

14.1.4 `sweep()`

The last such function we will discuss here is sweep() which sweeps out a summary statistics from a higher dimensional structure. The classic example is if you want to centralize some data to have a mean of zero.

Consider a matrix \(\boldsymbol{X}\) of 100 samples (rows) obtained from 10 replicates (columns). The goal is now to standardize each replicate (column) to have mean zero. We can get the means of these data sets using apply().

> X <- matrix(runif(1000), ncol=10)
> means <- apply(X, 2, mean)

But how can we now centralize each column? Well, that is trivial with a for loop:

> for(i in 1:ncol(X)){
+   X[,i] <- X[,i] - means[i]
+ }

We can avoid that for loop by expanding the means to match the dimensionality of \(\boldsymbol{X}\):

> M <- matrix(rep(means, nrow(X)), ncol=ncol(X), byrow=TRUE)
> X <- X - M

A more elegant way is offered by sweep that takes the following arguments

The data structure x to which a function should be applied.
The MARGIN over which to loop.
The vector of statistics STATS to be used.
The function FUN to be applied.

To centralize our data, we can thus use sweep() as follows.

> X <- sweep(X, MARGIN=2, STATS=means, FUN="-")

In this call, we ask sweep() to sweep across columns (MARGIN=2) and to use the statistics stored in means using the index of the column, and to apply the operator -. Note that since - is an operator not a function name, we need to put in in quotes as FUN="-".

14.1.5 Shortcuts

Some apply() calls are so frequently used, that R provides shortcuts to them (which are also faster!). These include rowSums(), colSums(), rowMeans() and colMeans() that correspond to apply() calls with the appropriate MARGIN and FUN set to sum or mean.

> X <- matrix(runif(20), ncol=5)
> rowSums(X)
[1] 2.072556 2.692741 1.432204 1.928534
> apply(X, 1, sum)
[1] 2.072556 2.692741 1.432204 1.928534
> colMeans(X)
[1] 0.2220654 0.4735840 0.2865934 0.6079728 0.4412933
> apply(X, 2, mean)
[1] 0.2220654 0.4735840 0.2865934 0.6079728 0.4412933

14.1.6 Exercises: Apply

See Section 18.0.38 for solutions.

Use lapply() to create a list with 100 vectors as follows: sample for each vector an integer \(n_i\) from 1, 2, …, 1000. The vector should then contain the value \(n_i\) 10 times.
Create a 11x1000 matrix \(\boldsymbol{X}\) with rows of 1000 random numbers each, drawn from a normal distribution with means -5:5 and standard deviations seq(0.1, 2, length.out=11), one per row. Verify that the rows have the proper means and standard deviations using apply().
Standardize the matrix \(\boldsymbol{X}\) from above such that each row has mean zero and standard deviation 1. Verify with apply().
Write a function sumBelow() that takes as input a vector \(\boldsymbol{x}\) and a constant \(a\). It then returns the sum of all elements of \(\boldsymbol{x}\) that are < than \(a\). Use apply() to apply our functions to each data set of the standardized matrix \(\boldsymbol{X}\) from the exercise above using \(a=0\) and \(a=1\).
Use apply() to generate a plot with four panels showing histograms of the data sets 4, 5, 6 and 7 of \(\boldsymbol{X}\)

14.1 Apply

14.1.1 replicate()

14.1.2 apply()

14.1.3 lapply() and sapply()

14.1.4 sweep()