Q: (Imputing categories) I'm using SAS PROC MI (description, documentation) for multiple imputation. The software assumes that imputed values have a conditional normal distribution, but some of the variables I'm imputing represent categories. What can I do about this?

A: This is a fairly hard question. Suppose that one of your variables is the dichotomy GENDER. Define a dummy variable X that is 1 for men and 0 for women. If X has missing values and you impute them using SAS PROC MI, the imputed values will typically be fractions such as .80.

Common advice is to round the imputed values (e.g., Allison 2002). For example, a value of .80 would be rounded to 1. It has recently been shown, however, that such rounding can produce substantial bias, particularly when there is a lot of missing data, or when well over half the cases have X=1 (or X=0) (Horton, Lipsitz, & Parzen 2003; Allison 2005). For this reason, I am no longer recommending that imputed values be rounded.

It is important to distinguish between two goals:

  1. You want the imputed values to be plausible. When X is gender, as above, you want all the imputed values to be 0 or 1.
  2. When the imputed values are used in an analysis, you want the analytic results to be unbiased.

When using SAS PROC MI, these goals may not be compatible.

  1. If you round you get plausible imputations but may get biased results (Horton, Lipsitz, & Parzen 2003).
  2. If you don't round you get implausible imputations but approximately unbiased results (Horton, Lipsitz, & Parzen 2003; Allison 2005).
Since the results are usually more important than the imputed values, I wouldn't round.

There does exist software that can impute dummy variables without giving unrealistic values (e.g., IVEware for SAS, or MICE for Stata), but that software is harder to use.

References

Horton, N.J., Lipsitz, S.P., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician 57(4), 229-232.

Allison, P. (2005). Imputation of categorical variables with PROC MI. 30th meeting of SAS Users Group International (SUGI 30). Philadelphia, PA.

Allison, P. (2002). Missing Data. Thousand Oaks, CA: Sage.