4.5 Missing values and such
While we all strive for perfect data, missing data is common in real data sets. In R, missing data is represented by NA
, meaning not available. NA
is a special value that can not easily be compared. For instance, we can not say if NA
is > 0, <0 or == 0.
In fact, any comparison with NA
is NA
itself, as there is nothing that can be said about missing data. Of course, we can not even say if some missing data is equal to other missing data.
To test if some data is missing, we thus need to use the function is.na()
.
Most functions applied to vectors also struggle with NA
values. What is the average of unknown data? Well, unknown.
In many cases, we may wish to exclude missing values for calculations. This can be done using a logical vector created by is.na()
, or by using the argument rm.na
that many functions offer.
Apart from NA
, R knows a bunch of other special values, including NaN
(not a number) or Inf
and -Inf
denoting ∞ and −∞, respectively. They mostly arise through calculations and usually inform us that we do something fishy…