Q: (Missing data) My data set has missing values. For example, I am using X to model Y but for some cases I don't have a value for X or Y. What should I do about this?

A: There is so much written on missing data that it would be wasteful to attempt a complete answer here. Popular references are Little & Rubin (1989), Rubin (1987), and Schafer (1997). Allison (2002) gives a shorter and less technical treatment geared to social scientists.
    The following are some essential facts.

Some elaboration, less technical than that in printed sources, is given below.


It matters why your data are missing. Suppose you are modeling weight (Y) as a function of sex (X). Some respondents wouldn't dislose their weight, so you are missing some values for Y. There are three possible mechanisms for the nondisclosure:

  1.  There may be no particular reason why some respondents told you their weights and others didn't. That is, the probability that Y is missing may has no relationship to X or Y. Such data said to be missing completely at random (MCAR).
  2.  One sex may be less likely to disclose its weight. That is, the probability that Y is missing depends only on the value of X. Such data are missing at random (MAR).
  3.  Heavy (or light) people may be less likely to disclose their weight. That is, the probability that Y is missing depend on the unobserved value of Y itself. Such data are not missing at random (NMAR). (In another setting, we would say that Y suffers from selection bias.)
If your data are MCAR or MAR, you can ignore the missing data mechanism and use multiple imputation and maximum likelihood. (Actually, one more condition is required to make the missing-data mechanism ignorable. But this condition is almost always met in practice.)

If your data are NMAR, you can't ignore the missing data mechanism; two approaches to NMAR data are selection models and pattern mixture.


Multiple imputation. If your data are MAR or MCAR, one of the best methods to use is multiple imputation. Suppose we wish to estimate a model from a sample with missing values.

There are two sources of variation or uncertainty in the parameter estimates obtained from a single imputed data set: To estimate imputation variation, we have to carry out multiple imputations: The variation between the parameter estimates from different imputations is an estimate of imputation variance. A special formula allows us to combine our estimates of sampling variance and imputation variance to estimate the total variance of the parameter estimates.

The question remains: How do we impute plausible values for the missing data? Suppose you have a case where Y is missing. You know the X value for that case, and from other cases you can estimate the way that Y depends on X. Using this information, you can generate a distribution of plausible values for the missing Y value, and draw from that distribution at random. The value you draw is your imputed value.
    The imputed value is affected by two sources of random variation:

The best imputation methods account for both sources of variation. This is why the imputed data sets in a multiple imputation contain different imputed values.


Maximum likelihood estimation. This is not the place for a technical discussion of maximum likelihood estimation with missing data. Excellent discussions are given in Allison (2002) and Little & Rubin (1989). Here we offer a heuristic explanation that gives a feeling for why the technique works.
    You can think of multiple imputation as an approximation to maximum likelihood. In multiple imputation, you try a few plausible values wherever you have missing data. In maximum likelihood, you integrate over all possible data values, giving more weight to values that are more plausible. So the results of maximum likelihood estimation are what you would get if you performed an infinite number of multiple imputations.
    The disadvantage of maximum likelihood is that, at least as implemented in most software, it often makes quite restrictive assumptions about the distribution of the missing data. If you can live with those assumptions, then maximum likelihood is ideal. If not, then multiple imputation is more flexible.


Methods appropriate when data is not missing at random. If you think that Y may be missing in part because of the unobserved value of Y, then your data is NMAR and you can't ignore the missing data mechanism. To review the example given earlier, suppose Y is weight in pounds; if someone has a heavy weight, they may be less inclined to report it. So the value of Y affects whether Y is missing; the data are NMAR. Two possible approaches when data is NMAR are selection models and pattern mixture.

Selection models. Social researchers have traditionally dealt with NMAR data by using selection models. In a selection model, you simultaneously model Y and the probability that Y is missing. Unfortunately, a number of practical difficulties are often encountered in estimating selection models.

Pattern mixture (Rubin 1987). When data is NMAR, an alternative to selection models is multiple imputation with pattern mixture. In this approach, you perform multiple imputations under a variety of assumptions about the missing data mechanism. In ordinary multiple imputation, you assume that those people who report their weights are similar to those who don't. In a pattern-mixture model, you may assume that people who don't report their weights are an average of 20 pounds heavier. This is of course an arbitrary assumption; the idea of pattern mixture is to try out a variety of plausible assumptions and see how much they affect your results.

Although pattern mixture strikes me as a more natural, flexible, and interpretable approach, it appears that social researchers more often use selection models. Perhaps this is because selection models are about 10 years older and were developed by social researchers (notably the economist James Heckman). I suspect that pattern mixture has greater potential than its rare use in the literature would suggest.


Special software routines are needed to implement the best methods. Social researchers sometimes try to implement imputation methods using, for example, SPSS. This is quite difficult to do, and the results are unlikely to be as good as those obtained from custom software. Programs for both multiple imputation and maximum likelihood are reviewed in Allison (2002). Software for multiple imputation only is reviewed by Horton & Lipsitz (2001) and tracked on two websites: www.stat.psu.edu/~jls/misoftwa.html and www.multiple-imputation.com.
    In OSU's sociology department,  the most accessible programs are probably the multiple imputation procedures in SAS (PROC MI and PROC MIAnalyze) and the maximum likelihood procedure in AMOS. There are examples of using both programs to handle missing data in Allison (2002), pp. 25-26 (AMOS) and 41-47 (SAS). Both SAS and AMOS assume that the missing values follow a conditional normal distribution.

I do not recommend the impute command in Stata, because it omits residual variation. I do not recommend the Missing Values Analysis add-on to SPSS, for reasons given in a published review.

Both SAS and AMOS assume that data are MAR.
    If data are NMAR, researchers may wish to use the simple selection models implemented in STATA (pertinent routes are heckman and heckprob) or the larger library of selection models in LIMDEP. I am not aware of software specifically for pattern mixture, though a rough version can be implemented by editing the imputed values from SAS PROC MI.


Many old missing-data methods, some still in wide use, are terrible! Despite the growing availability of software for multiple imputation and maximum likelihood, much published research continues to use methods that are known to produce biased results. The following are some apparently plausible things you definitely shouldn't do. For simplicity's sake, we assume you are modeling Y as a function of X, and are only missing values for X.

Don't be led astray! Although you still see these and other ad hoc methods in published research, they are not reliable. They often give you worse results than you would get by simply deleting any cases with missing values.

The following methods are improvements on those above, but still not as good as maximum likelihood or multiple imputation:


References

Allison, P. (2002). Missing data. Thousand Oaks, CA: Sage.
Little, RL & Rubin, DB. (1990). Statistical analysis with missing data. New York: Wiley.
Horton, NJ & Lipsitz, SR. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician 55(3): 244-254.
Rubin, DB. (1987). Multiple imputation for survey nonresponse. New York: Wiley.
Schafer, JL. (1997a). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.