4.5 Missing values and such

While we all strive for perfect data, missing data is common in real data sets. In R, missing data is represented by NA, meaning not available. NA is a special value that can not easily be compared. For instance, we can not say if NA is > 0, <0 or == 0.

> a <- NA
> a < 0
[1] NA
> a == 0
[1] NA
> a > 0 
[1] NA

In fact, any comparison with NA is NA itself, as there is nothing that can be said about missing data. Of course, we can not even say if some missing data is equal to other missing data.

> NA == NA
[1] NA

To test if some data is missing, we thus need to use the function is.na().

> is.na(NA)
[1] TRUE

Most functions applied to vectors also struggle with NA values. What is the average of unknown data? Well, unknown.

> x <- c(1,3,7, NA, 9)
> mean(x)
[1] NA

In many cases, we may wish to exclude missing values for calculations. This can be done using a logical vector created by is.na(), or by using the argument rm.na that many functions offer.

> x <- c(1,3,7, NA, 9)
> mean(x[!is.na(x)])
[1] 5
> mean(x, na.rm=TRUE)
[1] 5

Apart from NA, R knows a bunch of other special values, including NaN (not a number) or Inf and -Inf denoting and , respectively. They mostly arise through calculations and usually inform us that we do something fishy…

> 1/0
[1] Inf
> 0/0
[1] NaN
> log(-1)
Warning in log(-1): NaNs produced
[1] NaN