NAs in R: some warnings (and a worked example; calculating standard deviations)

This post shows why is.na and !is.na are not ideal approaches to “clean” a dataset with missing values when we want to compute summary statistics.

A guided example:

Goal for the exercise: Check whether Italy or Germany change their mind about the US president more often.

Step 1: Let’s download survey data (aggregated at the country level) about what the world has been thinking of US presidents (not just Trump).

The dataset was posted by FiveThirtyEight, and is described here.

# Get the data from Github
approval <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-world-trust/TRUMPWORLD-us.csv")

# Show the first 6 rows of the first 7 variables
head(approval[,1:7])
##   year      avg Canada France Germany Greece Hungary
## 1 2000 67.50000     NA     62      78     NA      NA
## 2 2002 61.50000     72     62      60     NA      NA
## 3 2003 44.69231     63     42      45     NA      NA
## 4 2004 35.66667     NA     37      38     NA      NA
## 5 2005 43.58333     59     43      42     NA      NA
## 6 2006 35.33333     NA     39      37     NA      NA

Step 2: Calculate standard deviations. (There are other ways to answer the question, e.g. look at year-over-year changes, count how many large swings have occurred in each country, etc.)

Getting the standard deviation for approval rates in Germany works:

 sd(approval$Germany)
> 13.33174

Problem: Approval ratings of the sitting US president were not measured every year in Italy:

sd(approval$Italy)
> NA

Can you just insert the variable inside !is.na(). You shouldn’t!

sd(!is.na(approval$Italy)) # Don't do this
> 0.5072997

The output is suspicious (clearly wrong), because the range of answers is 30 percentage points, so the SD cannot be less than one. The problem is the that command above computes the standard deviation of ones and zeros.

summary(approval$Italy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   53.00   63.25   73.00   70.30   76.00   83.00       7

The standard devation can be obtained by running

sd(approval$Italy,na.rm=TRUE)
> 9.39326

Or:

observed <- !is.na(approval$Italy)
sd(approval$Italy[observed])
> 9.39326

The variable observed contains the positions (rows) when a survey reading is available:

observed
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

This example illustrates that there are cases when it would be almost impossible to make a mistake while using STATA, but… things gets slightly more bumpy with R.

Hope this can help you avoid related mistakes when you encounter NAs in your datsets.

Show Comments

Stay in touch with me - I send out occasional updates.