Q: (Recoding
a quantitative variable into categories) My model includes a regressor
X (for example age) that is measured in quantitative units (for example
years). Results using this regressor are confusing. When I break
X into artifical categories (20-29, 30-39, etc.), it seems to affect the
response variable. But when I use X as originally coded, the effect goes
away.
A: What you most likely have is a non-linear relationship.
Your first result tells you that X has an effect. Your second result tells
you that the effect is not linear.
Non-linearity is actually quite common even when
it does not make itself obvious this way. Non-linearity is an often overlooked
aspect of model specification. Even if a model contains all plausible confounding
variables, those variables may not be fully controlled if their effects
are falsely thought to be non-linear.
A common visual check for non-linearity is
to fit the model using X in its original form, then plot the residuals
against X. Sometimes the residual pattern will be curved, but since residuals
are by definition noisy, it may be difficult to see any but the most obvious
curvature.
There are a few different ways to model non-linear
effects:
-
Break the regressor into a few different categories. You have already
tried this. Whether it's satisfactory depends on how plausible the categories
are. It makes a lot of sense to categorize a variable like education, since
there are natural breakpoints when people complete certain degrees. Categorizing
age makes less sense, since someone at the top of one category (age 29)
may be little different from someone at the bottom of the next (age 30).
-
Add polynomial terms such as X2 and X3. This
is the most common textbook recommendation, but it may not be so easy to
interpret, and it may not fit very well for large values of X. (When X
is large, X2 and X3 are very large.)
-
Use a simple transformation such as log(X) or 1/X. Again, a common
textbook recommendation. While some relationships make a lot of sense on
the transformed scale, others may be difficult to fit or interpret. Think
about whether the expected relationship justifies the transformation you
propose to use.
-
Use a spline transformation. Splines are probably under-used in
social research. They are flexible enough to capture a variety of non-linear
relationships, but allow you to impose sensible constraints--for example,
when you're sure that the effect of X is smooth and never changes direction.
Splines are very helpful when you can't think of a simple transformation
that captures the expected shape of the relationship, or when you're not
sure what shape to expect. They are implemented in SAS PROC TRANSREG; the
short article by Smith (1979) is a nice introduction.
NB:
Because they are so flexible, you will need to plot the spline transformation
before you interpret the results.
-
Use a local regression method. Local regression is even more flexible
than spline transformation, but consequently easier to over-fit and harder
to interpret. One type of local regression model is implemented in an experimental
SAS procedure, PROC LOESS. A variety of local regression models are discussed
in Hastie & Tibshirani (1990).
References
Hastie, TJ & Tibshirani, RJ. (1990).
Generalized
additive models. London: Chapman & Hall.
Smith, PL. (1979). Splines as a useful and convenient
statistical tool. The American Statistician 33(2): 57-62.