Stepwise selection after multiple imputation

Last updated on Jan 4, 2022 8 min read applied statistics, variable selection

Some note

I have written two post previously about multiple imputation using mice package:

This post probably my last post about multiple imputation using mice package.

Stepwise selection

The general steps in mice package are:

mice() - impute the NAs
with() - run the analysis (lm, glm, etc)
pool() - pool the results

For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.

Brand (1999) proposed this solution:

Perform stepwise selection separately on each imputed dataset
Fit a preliminary model that contains all variables that present in at least half of the models in the step 1
Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p > 0.05)
Repeat step 3 until all variables have p values < 0.05

So, we going to do this solution and use multivariate Wald test (D1() in mice package) for model comparison instead of pooled likelihood ratio p value.

Example in R

Load the packages.

library(mice)
library(tidyverse)

Create a missing data. We going to use the famous mtcars dataset, which already available in R.

set.seed(123)
dat <- 
  mtcars %>% 
  mutate(across(c(vs, am), as.factor)) %>% 
  select(-mpg) %>% 
  missForest::prodNA(0.1) %>% 
  bind_cols(mpg = mtcars$mpg)
summary(dat)

##       cyl             disp             hp             drat      
##  Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:4.000   1st Qu.:120.7   1st Qu.:103.0   1st Qu.:3.150  
##  Median :6.000   Median :225.0   Median :123.0   Median :3.715  
##  Mean   :6.148   Mean   :232.8   Mean   :147.4   Mean   :3.642  
##  3rd Qu.:8.000   3rd Qu.:334.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
##  NA's   :5       NA's   :1       NA's   :4       NA's   :2      
##        wt             qsec          vs        am          gear     
##  Min.   :1.513   Min.   :14.50   0   :17   0   :18   Min.   :3.00  
##  1st Qu.:2.429   1st Qu.:16.88   1   :11   1   :10   1st Qu.:3.00  
##  Median :3.203   Median :17.51   NA's: 4   NA's: 4   Median :4.00  
##  Mean   :3.112   Mean   :17.75                       Mean   :3.71  
##  3rd Qu.:3.533   3rd Qu.:18.83                       3rd Qu.:4.00  
##  Max.   :5.424   Max.   :22.90                       Max.   :5.00  
##  NA's   :4       NA's   :2                           NA's   :1     
##       carb            mpg       
##  Min.   :1.000   Min.   :10.40  
##  1st Qu.:2.000   1st Qu.:15.43  
##  Median :2.000   Median :19.20  
##  Mean   :2.667   Mean   :20.09  
##  3rd Qu.:4.000   3rd Qu.:22.80  
##  Max.   :6.000   Max.   :33.90  
##  NA's   :5

Run mice() on missing data with 10 imputed datasets (m = 10).

datImp <- mice(dat, m = 10, printFlag = F, seed = 123)
datImp

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##      cyl     disp       hp     drat       wt     qsec       vs       am 
##    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"    "pmm" "logreg" "logreg" 
##     gear     carb      mpg 
##    "pmm"    "pmm"       "" 
## PredictorMatrix:
##      cyl disp hp drat wt qsec vs am gear carb mpg
## cyl    0    1  1    1  1    1  1  1    1    1   1
## disp   1    0  1    1  1    1  1  1    1    1   1
## hp     1    1  0    1  1    1  1  1    1    1   1
## drat   1    1  1    0  1    1  1  1    1    1   1
## wt     1    1  1    1  0    1  1  1    1    1   1
## qsec   1    1  1    1  1    0  1  1    1    1   1

Run stepwise selection on each imputed dataset.

sc <- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb, 
           lower = ~ 1)
exp <- expression(f1 <- lm(mpg ~ 1),
                  f2 <- step(f1, scope = sc, trace = 0))
fit <- with(datImp, exp)

Next, we calculate how many times each variable selected in the each model by stepwise selection.

fit$analyses %>% 
  map(formula) %>% #get the formula
  map(terms) %>% #get the terms
  map(labels) %>% #get the name of variables
  unlist() %>% 
  table()

## .
##   am carb  cyl disp drat   hp qsec   vs   wt 
##    7    5    3    2    4    5    3    4    7

We going to select:

am
carb
hp
wt

These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.

fit_full1 <- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2         am1  3.06689135 1.94363342  1.577917 13.06329 1.384846e-01
## 3        carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4          hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5          wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02

We exclude carb variable in the next model as it has the largest non-significant p value.

fit_full2 <- with(datImp, lm(mpg ~ am + hp + wt))

Next, we compare using multivariate Wald test.

D1(fit_full1, fit_full2)

##    test statistic df1     df2 dfcom   p.value       riv
##  1 ~~ 2 0.9765411   1 9.21378    27 0.3482934 0.6935655

P > 0.05. So, we opt for the simpler model.

pool(fit_full2) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2         am1  2.50264907 1.79966590  1.390619 15.31418 1.842201e-01
## 3          hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4          wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02

We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.

fit_full3 <- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)

##    test statistic df1      df2 dfcom   p.value       riv
##  1 ~~ 2   1.93382   1 12.90982    28 0.1878483 0.4392918

Again, we opt for the simple model.

pool(fit_full3) %>% 
  summary()

##          term    estimate  std.error statistic       df      p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2          hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3          wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05

There is no non-significant variable in the model anymore. Thus, this is our final model.