This post shows why
!is.na are not ideal approaches to “clean” a dataset with missing values when we want to compute summary statistics.
A guided example:
Goal for the exercise: Check whether Italy or Germany change their mind about the US president more often.
Step 1: Let’s download survey data (aggregated at the country level) about what the world has been thinking of US presidents (not just Trump).
The dataset was posted by FiveThirtyEight, and is described here.
# Get the data from Github approval <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-world-trust/TRUMPWORLD-us.csv") # Show the first 6 rows of the first 7 variables head(approval[,1:7])
## year avg Canada France Germany Greece Hungary ## 1 2000 67.50000 NA 62 78 NA NA ## 2 2002 61.50000 72 62 60 NA NA ## 3 2003 44.69231 63 42 45 NA NA ## 4 2004 35.66667 NA 37 38 NA NA ## 5 2005 43.58333 59 43 42 NA NA ## 6 2006 35.33333 NA 39 37 NA NA
Step 2: Calculate standard deviations. (There are other ways to answer the question, e.g. look at year-over-year changes, count how many large swings have occurred in each country, etc.)
Getting the standard deviation for approval rates in Germany works:
Problem: Approval ratings of the sitting US president were not measured every year in Italy:
Can you just insert the variable inside
!is.na(). You shouldn’t!
sd(!is.na(approval$Italy)) # Don't do this
The output is suspicious (clearly wrong), because the range of answers is 30 percentage points, so the SD cannot be less than one. The problem is the that command above computes the standard deviation of ones and zeros.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 53.00 63.25 73.00 70.30 76.00 83.00 7
The standard devation can be obtained by running
observed <- !is.na(approval$Italy) sd(approval$Italy[observed])
observed contains the positions (rows) when a survey reading is available:
##  TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ##  TRUE TRUE TRUE TRUE TRUE TRUE
This example illustrates that there are cases when it would be almost impossible to make a mistake while using STATA, but… things gets slightly more bumpy with R.
Hope this can help you avoid related mistakes when you encounter NAs in your datsets.