4.4 Factors

Factors are data objects that are used to group data into categories. These categories are referred to as levels and can be both strings and integers. They are useful if there are limited number of unique values, for example sex, days of the week, geographic clusters etc.

Suppose we have sampled people from different cantons of Switzerland. These are listed in a vector. We can factorize these locations using the function factor():

> locations <- c("BE", "FR", "VD","FR", "BE", "FR", "VS", "FR","VD","FR", "BE", "BE", "VD", "FR", "VD","BE", "BE", "FR", "BE")
> fLocations <- factor(locations)
> fLocations
 [1] BE FR VD FR BE FR VS FR VD FR BE BE VD FR VD BE BE FR BE
Levels: BE FR VD VS

As you see, R detected 4 categories (levels), which are BE, FR, VD and VS. Each entry in the vector fLocations is not a string anymore, but refers to one of these levels. In fact, factors are stored as integer vectors. This can be seen from its structure:

> str(fLocations)
 Factor w/ 4 levels "BE","FR","VD",..: 1 2 3 2 1 2 4 2 3 2 ...

We see that levels are stored in a character vector and the individual elements are actually stored as indices.

Factors can easily be used to group elements of other vectors. Say we know the canton as well as the height of every person we sampled:

> height <- c(151, 159, 132, 144, 147, 144, 152, 139, 152, 139, 158, 141, 149, 144, 141, 135, 148, 147, 134)
> tapply(height, fLocations, mean)
      BE       FR       VD       VS 
144.8571 145.1429 143.5000 152.0000 

Don’t worry about the function tapply() for the moment, we will discuss it later in more detail. It basically extracts all individuals with factor BE from the vector of heights and computes the mean of these individuals. Then it extracts all individuals with factor FR, computes the mean etc.

4.4.1 Exercises: Factors

See Section 18.0.9 for solutions.

  1. if d <- c(7, 8, 3, 3, 5, 3, 8, 4, NA), what are the levels of factor(d)?

  2. Create a vector months of strings with elements “july”, “december”, “july”, “july”, “may”, “december”, “july”, “september”, “may”, “october”, “december”, “december”, “august”, “may”, “september”, “december”, “september”. Now factorize it. How many levels are there? What are their names?