What every STATA user needs to know - how missing values are treated

 •  Filed under Stata - tips and tricks

This is a post for people who are learning Stata.

A common source of mistakes is generating a binary variable that should classify observations according to a particular condition (for example, tag everyone with income higher than 100K as a "high income individual").

The problem: an analyst will wrongly tag people who do not meet the specified condition (unless she is careful).

To test your understanding, generate a variable that is sometimes positive and contains missing values in all other cases. As a simplest case, you can generate a sample of just two observations.

clear  
set obs 2  
gen x =. /*Generate two missing observations*/  
replace x = 1 if _n==2 /*Set observation 2 to one*/  

Then run the following code in Stata, or in your head, and see for yourself if you know what happens:

* Classify x based on its sign
gen below = (x < 0)  
gen above = (x > 0)  

(Many people will generate a variable equal to zero and then run something like replace above=1 if x >0; that will not help.)

The first observation is missing and the second observation is positive but they are not classified by intended.

As the Stata website explains:

In the current system, you must be aware that missing values are coded and treated as positive infinity. Once this fact is absorbed, everything is consistent, drop and keep statements work as one would expect, and the logical comparisons make sense.

The variable below appropriately shows that the second observation is not less than zero, but it claims the same thing about the first observation of x, and it's unknowable whether that is true.

Labeling the classification outcomes should make the problem obvious:

label define xClassification 0 "no" 1 "yes"  
label val below xClassification  
label val above xClassification  

The classify each observation properly, run the following instead:

drop below  
drop above  
gen below = (x < 0) if !missing(x)  
gen above = (x > 0) if !missing(x)  

Based on the example, I'd make recommend:

Tip 1

When you generate variables based on a condition coming from some different variable, execute your command only for the non-missing cases, as above: gen above = (x > 0) if !missing(x) which is equivalent to:

gen above = (x > 0) if ~missing(x)

Or:

gen above = (x > 0) if x != .

Tip 2

Run the mdesc command to see which variables contain missing values. If nothing is missing, you are in a good shape. Tidy datasets will make much of your work easier and you'll make fewer mistakes.

Tip 3

Unless it becomes too cumbersome, you could generate variables with specific conditions, e.g.:

gen upToTwo = (x==1 | x==2)  

instead of

gen upToTwo = (x >= 2)