Monday, November 17, 2014

Why I don't do backwards selection

I use model averaging alot. My heart always sinks when I get a comment like this from a reviewer:
Why not use F-tests to actually pick one model instead of doing the model averaging?
Why not indeed?

I use model averaging to make sure I’m getting ‘’unconditional variances’’ on predictions from a set of data. If I pick one model using backwards selection or AICc or BIC, then the variance of my predictions are conditional on that model being correct. But it isn’t. So using an entire set of plausible models to make the prediction means that I will get wider confidence intervals, but more honest confidence intervals.
Another suggestion I get alot is to use the single best model because the weight of that model is bigger than X. Fill in X with your favorite arbitrary cutoff. If a single model in the set happens to be much, much better than all the rest, then it’s predictions naturally carry more weight, and ultimately the variance from the model set shrinks towards that from a single model. Naturally. Without making arbitrary choices that are likely to be different from a reviewer’s equally arbitrary choice.
I’m going to compare the predictions from a model chosen using backwards selection, a model chosen using BIC, and a model set weighted with BIC to demonstrate these differences. The data come from a paper I am working on with Brigitte Tenhumberg, Tomomi Suwa, Leland Russell, and Svata Louda. In the paper we’re synthesizing experiments on Bull Thistle (Cirsium vulgare) and Tall Thistle (Cirsium altissimum) carried out in a couple of different places with a variety of experimental manipulations. The individual experimental results are all published; the goal of the paper is to integrate them into size structured Integral Projection Models (IPM) to evaluate the effects of experimentally manipulating competition or herbivory on the entire life history.
An IPM represents life history traits like growth, survival and reproduction as functions of size. Rather than a single number, the growth function has both an intercept and a slope that must be predicted under each combination of herbivory and competition. The other issue with thistles is that they eventually ‘’bolt’‘, or produce a flowering stalk. This dramatically affects their above ground biomass, so I have to account for bolting in the model as well. So my’‘global’’, or most complex model is
~veg.bio07 + Herb + Comp + bolt08 + veg.bio07:Comp + veg.bio07:Herb + veg.bio07:bolt08 + Herb:Comp + Herb:bolt08 + Comp:bolt08
which has 4 main effects and all 6 possible two way interactions. Now, we don’t expect all those parameters to be important, so how do we simplify things?
Following the reviewer’s preference, I could do backwards selection. At each step I’ll identify a parameter that has the smallest influence on the model, and then ‘’drop’’ that parameter to create a simpler model. If the simpler model is not significantly different from the more complex model (at some arbitrarily chosen level of significance, say α = 0.05), then we accept the simpler model and repeat the process.
bg0 = lmer(log(veg.bio)~veg.bio07 + Herb + Comp + bolt08 + veg.bio07:Comp +  veg.bio07:Herb + veg.bio07:bolt08 +   Herb:Comp + Herb:bolt08 + Comp:bolt08 + (1|plot), data=grow.bull[-remove,],REML=FALSE)
drop1(bg0,test="Chisq")
## Single term deletions
## 
## Model:
## log(veg.bio) ~ veg.bio07 + Herb + Comp + bolt08 + veg.bio07:Comp + 
##     veg.bio07:Herb + veg.bio07:bolt08 + Herb:Comp + Herb:bolt08 + 
##     Comp:bolt08 + (1 | plot)
##                  Df    AIC     LRT  Pr(Chi)   
## <none>              769.38                    
## veg.bio07:Comp    2 766.72  1.3354 0.512893   
## veg.bio07:Herb    1 767.94  0.5585 0.454865   
## veg.bio07:bolt08  1 776.24  8.8540 0.002924 **
## Herb:Comp         2 770.80  5.4121 0.066799 . 
## Herb:bolt08       1 773.41  6.0294 0.014069 * 
## Comp:bolt08       2 777.89 12.5035 0.001927 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reviewer wanted F tests, but package lme4 only gives me Likelihood Ratio Tests so that’s what I’ll use. The interaction between veg.bio07:Comp is the weakest, so I’ll drop that term and repeat.
bg1 = update(bg0,.~.-veg.bio07:Comp)
drop1(bg1,test="Chisq")
## Single term deletions
## 
## Model:
## log(veg.bio) ~ veg.bio07 + Herb + Comp + bolt08 + (1 | plot) + 
##     veg.bio07:Herb + veg.bio07:bolt08 + Herb:Comp + Herb:bolt08 + 
##     Comp:bolt08
##                  Df    AIC     LRT  Pr(Chi)   
## <none>              766.72                    
## veg.bio07:Herb    1 765.06  0.3362 0.562057   
## veg.bio07:bolt08  1 772.33  7.6138 0.005792 **
## Herb:Comp         2 767.11  4.3873 0.111507   
## Herb:bolt08       1 770.21  5.4879 0.019148 * 
## Comp:bolt08       2 775.75 13.0346 0.001478 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bg2 = update(bg1,.~.-veg.bio07:Herb)
drop1(bg2,test="Chisq")
## Single term deletions
## 
## Model:
## log(veg.bio) ~ veg.bio07 + Herb + Comp + bolt08 + (1 | plot) + 
##     veg.bio07:bolt08 + Herb:Comp + Herb:bolt08 + Comp:bolt08
##                  Df    AIC     LRT  Pr(Chi)   
## <none>              765.06                    
## veg.bio07:bolt08  1 770.40  7.3433 0.006731 **
## Herb:Comp         2 765.49  4.4374 0.108751   
## Herb:bolt08       1 770.67  7.6168 0.005783 **
## Comp:bolt08       2 773.83 12.7733 0.001684 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bg3 = update(bg2,.~.-Herb:Comp)
drop1(bg3,test="Chisq")
## Single term deletions
## 
## Model:
## log(veg.bio) ~ veg.bio07 + Herb + Comp + bolt08 + (1 | plot) + 
##     veg.bio07:bolt08 + Herb:bolt08 + Comp:bolt08
##                  Df    AIC     LRT  Pr(Chi)   
## <none>              765.49                    
## veg.bio07:bolt08  1 770.91  7.4203 0.006449 **
## Herb:bolt08       1 768.36  4.8640 0.027423 * 
## Comp:bolt08       2 772.49 10.9999 0.004087 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So I stop with bg3 because none of the remaining interaction terms can be dropped without causing a significant change in the log-likelihood. Here’s what the predicted functions look like for plants that do not bolt. All the lines are parallel because I dropped all the interactions between herbivory or competition and plant size in 2007. The shaded grey areas are bootstrapped 95% confidence areas.

Now I’ll use BIC to assess an a priori set of 35 models. Including all possible combinations of main effects and two way interactions creates too many models. For example, evaluating all combinations of the 6 possible two way interactions together with all 4 main effects leads to 64 models. The models I selected attempted to evaluate a range of plausible combinations of fixed effects and interactions. Because the population model is size structured I only include models with the size variable. When including interactions I preferentially included interactions with size for the same reason. Including too many models increases model selection uncertainty leading to poor precision in my predictions, while including too few models risks overlooking important interactions between the covariates leading to biased predictions. The model set we developed looks at all models involving size and one other main effect, and samples across the space of models involving 3 or 4 main effects. Here are the models that make up > 95% of the weight.

unlist.bg.models. k loglik dbic wbic
~veg.bio07 + bolt08 5 -378.79 0.00 0.79
~veg.bio07 + Herb + bolt08 6 -377.91 3.84 0.12
~veg.bio07 + bolt08 + veg.bio07:bolt08 6 -378.16 4.34 0.09
These models are much, much simpler than the backwards selected model. For comparison with the table bg3 has a log-likelihood of -370.7467364, and a BIC value of 808.5853165. The log-likelihood of the more complex model is better, as expected. However, the BIC value of the top model is 785.5265974, so the backward selected (or BS, for short :) ) model would not even come close to the top models once the additional parameters are accounted for. Now I’ll make predictions from each of the 35 models in the set and then calculate a weighted average of those predictions.


Now that took alot longer than making the plot for one model; about 35 times longer, in fact. So that might be one argument in favor of picking a single model to use, but it isn’t a very good one in my opinion. If you can’t find something else useful to do while you wait for a bootstrap to finish, you’re not busy enough! The biggest difference is that the curves are all on top of each other because there is no effect of manipulating herbivory or competition on growth. In addition, the confidence limits on the predicted size are much narrower.
When I first saw this result, I was surprised. A single model is supposed to have more precision than a model averaged result. And that’s true if you compare the model averaged result with the BIC selected best model. The difference is very small because most of the weight is on the top model. However the BS model is much more complex, and as a result falls into a very different place on the bias-variance tradeoff. So what do you think? I know some of the readers of this blog use statistics. What’s your preferred model selection approach and why?

2 comments:

  1. Wow, the figures really suck. I thought I'd fixed that problem. Let me try that again.

    ReplyDelete