A: There is so much written on missing data that it would be
wasteful to attempt a complete answer here. Popular references are Little
& Rubin (1989), Rubin (1987), and Schafer
(1997). Allison (2002) gives a shorter and less
technical treatment geared to social scientists.
The following are some essential facts.
It matters why your data are missing. Suppose you are modeling weight (Y) as a function of sex (X). Some respondents wouldn't dislose their weight, so you are missing some values for Y. There are three possible mechanisms for the nondisclosure:
If your data are NMAR, you can't ignore the missing data mechanism; two approaches to NMAR data are selection models and pattern mixture.
Multiple imputation. If your data are MAR or MCAR, one of the best methods to use is multiple imputation. Suppose we wish to estimate a model from a sample with missing values.
The question remains: How do we impute plausible values for the missing
data? Suppose you have a case where Y is missing. You know the X value
for that case, and from other cases you can estimate the way that Y depends
on X. Using this information, you can generate a distribution of plausible
values for the missing Y value, and draw from that distribution at random.
The value you draw is your imputed value.
The imputed value is affected by two sources of
random variation:
Maximum likelihood estimation.
This is not the place for a technical discussion of maximum likelihood
estimation with missing data. Excellent discussions are given in Allison
(2002) and Little & Rubin (1989).
Here we offer a heuristic explanation that gives a feeling for why the
technique works.
You can think of multiple imputation as an approximation
to maximum likelihood. In multiple imputation, you try a few plausible
values wherever you have missing data. In maximum likelihood, you integrate
over all possible data values, giving more weight to values that
are more plausible. So the results of maximum likelihood estimation are
what you would get if you performed an infinite number of multiple imputations.
The disadvantage of maximum likelihood is that,
at least as implemented in most software, it often makes quite restrictive
assumptions about the distribution of the missing data. If you can live
with those assumptions, then maximum likelihood is ideal. If not, then
multiple imputation is more flexible.
Methods appropriate when data is not missing at random. If you think that Y may be missing in part because of the unobserved value of Y, then your data is NMAR and you can't ignore the missing data mechanism. To review the example given earlier, suppose Y is weight in pounds; if someone has a heavy weight, they may be less inclined to report it. So the value of Y affects whether Y is missing; the data are NMAR. Two possible approaches when data is NMAR are selection models and pattern mixture.
Selection models. Social researchers have traditionally dealt with NMAR data by using selection models. In a selection model, you simultaneously model Y and the probability that Y is missing. Unfortunately, a number of practical difficulties are often encountered in estimating selection models.
Pattern mixture (Rubin 1987). When data is NMAR, an alternative to selection models is multiple imputation with pattern mixture. In this approach, you perform multiple imputations under a variety of assumptions about the missing data mechanism. In ordinary multiple imputation, you assume that those people who report their weights are similar to those who don't. In a pattern-mixture model, you may assume that people who don't report their weights are an average of 20 pounds heavier. This is of course an arbitrary assumption; the idea of pattern mixture is to try out a variety of plausible assumptions and see how much they affect your results.
Although pattern mixture strikes me as a more natural, flexible, and interpretable approach, it appears that social researchers more often use selection models. Perhaps this is because selection models are about 10 years older and were developed by social researchers (notably the economist James Heckman). I suspect that pattern mixture has greater potential than its rare use in the literature would suggest.
Special
software routines are needed to implement the best methods. Social
researchers sometimes try to implement imputation methods using, for example,
SPSS. This is quite difficult to do, and the results are unlikely to be
as good as those obtained from custom software. Programs for both multiple
imputation and maximum likelihood are reviewed in Allison
(2002). Software for multiple imputation only is reviewed by Horton
& Lipsitz (2001) and tracked on two websites: www.stat.psu.edu/~jls/misoftwa.html
and www.multiple-imputation.com.
In OSU's sociology department, the most accessible
programs are probably the multiple imputation procedures in SAS (PROC MI and
PROC MIAnalyze) and the
maximum likelihood procedure in AMOS. There are examples of using both
programs to handle missing data in Allison (2002),
pp. 25-26 (AMOS) and 41-47 (SAS). Both SAS and AMOS assume that the missing
values follow a conditional normal distribution.
I do not recommend the impute command in Stata, because it omits residual variation. I do not recommend the Missing Values Analysis add-on to SPSS, for reasons given in a published review.
Both SAS and AMOS assume that
data are MAR.
If data are NMAR, researchers may wish to use the
simple selection models implemented in STATA (pertinent routes are heckman
and heckprob) or the larger library of selection models in LIMDEP. I am
not aware of software specifically for pattern mixture, though a rough
version can be implemented by editing the imputed values from SAS PROC
MI.
Many old missing-data methods, some still in wide use, are terrible! Despite the growing availability of software for multiple imputation and maximum likelihood, much published research continues to use methods that are known to produce biased results. The following are some apparently plausible things you definitely shouldn't do. For simplicity's sake, we assume you are modeling Y as a function of X, and are only missing values for X.
Don't be led astray! Although you still see these and other ad hoc methods in published research, they are not reliable. They often give you worse results than you would get by simply deleting any cases with missing values.
The following methods are improvements on those above, but still not as good as maximum likelihood or multiple imputation: