What every STATA user needs to know - how missing values are treated This is a post for people who are learning Stata.

A common source of mistakes is generating a binary variable that should classify observations according to a particular condition (for example, tag everyone with income higher than 100K as a "high income individual"). The problem: a Stata user may incorrectly classify/tag people who do not meet the specified condition (unless she is careful).

To test your understanding, generate a variable that is sometimes positive and contains missing values in all other cases. As a simplest case, generate a sample of just two observations:

clear
set obs 2
gen x = . /*Generate two missing observations*/
replace x = 1 if _n==2 /*Set observation 2 to one*/

Then run the following code in Stata, or in your head, and see for yourself if you know what happens:

* Classify x based on its sign
gen below = (x < 0)
gen above = (x > 0)

(Many people will generate a variable equal to zero and then run something like replace above=1 if x >0; that will not help.)

The first observation is missing and the second observation is positive. Have they been classified as intended? No.

As the Stata website explains:

In the current system, you must be aware that missing values are coded and treated as positive infinity. Once this fact is absorbed, everything is consistent, drop and keep statements work as one would expect, and the logical comparisons make sense.

The variable below appropriately shows that the second observation is not less than zero, but it claims the same thing about the first observation of x. In fact, it's unknowable whether that is true, so for the first observation the variable below should been missing. Labeling the classification outcomes should make the problem obvious:

label define xClassification 0 "no" 1 "yes"
label val below xClassification
label val above xClassification That's wrong because if the value of x is unknown for the first observation in the dataset, then a new variable should not treat x as if it contained a known (positive) value.

To classify each observation properly, run the following instead:

drop below
drop above
gen below = (x < 0) if !missing(x)
gen above = (x > 0) if !missing(x) This should bring home that if you are not absolutely sure you know what each line of code is doing, it will be a good idea to check the dataset and inspect new variables that you generated. Based on the pitfall shown in the simple example, here are a few suggestions.

Tip 1

When you generate variables based on a condition contained in an existing variable, execute your command only for the non-missing cases, as above: gen above = (x > 0) if !missing(x) which is equivalent to:

gen above = (x > 0) if ~missing(x)

Or:

gen above = (x > 0) if x != .

(This is safer than generating wrong values for some observations and then fixing those instances by replacing them with missing values.)

Tip 2

Unless it becomes too cumbersome, you could replace values with specific conditions, e.g. if you know that the only positive values are 1 and 2, this would work:

gen above_zero = .
replace above_zero = 1 if x==1 | x==2

But that obviously only works with discrete values of x.

Tip 3

Run the mdesc command to see which variables contain missing values. If nothing is missing, no need to worry about misclassification of empty cells.

Tip 4: use recode

Suppose you have 10 observations

clear
set obs 10
* Draw from U~[-5,5]
set seed 10009
gen x = runiform(-5,5)

Half of the observations are positive:

summarize x if x > 0

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
x |          5    3.467536    1.450541   1.894014   4.943838

We could recode the x variable, but it's better use recode ...,gen() like this:

recode x (-5/0= 0 "Negative") (0/max = 1 "Positive"), gen(above)

Check that this worked as intended:

tab above

RECODE of x |      Freq.     Percent        Cum.
------------+-----------------------------------
Negative |          5       50.00       50.00
Positive |          5       50.00      100.00
------------+-----------------------------------
Total |         10      100.00

And double-checking:

tab above if x > 0

RECODE of x |      Freq.     Percent        Cum.
------------+-----------------------------------
Positive |          5      100.00      100.00
------------+-----------------------------------
Total |          5      100.00