Stepwise selection after multiple imputation
Some note
I have written two post previously about multiple imputation using mice
package:
This post probably my last post about multiple imputation using mice
package.
Stepwise selection
The general steps in mice
package are:
mice()
- impute the NAs
with()
- run the analysis (lm, glm, etc)
pool()
- pool the results
For backward and forward selection, we can do it manually after pooling the results in step 3, but we cannot do this for stepwise selection.
Brand (1999) proposed this solution:
- Perform stepwise selection separately on each imputed dataset
- Fit a preliminary model that contains all variables that present in at least half of the models in the step 1
- Apply backward elimination on the variables in the preliminary model (the variable is removed one by one if p > 0.05)
- Repeat step 3 until all variables have p values < 0.05
So, we going to do this solution and use multivariate Wald test (D1()
in mice
package) for model comparison instead of pooled likelihood ratio p value.
Example in R
Load the packages.
library(mice)
library(tidyverse)
Create a missing data. We going to use the famous mtcars
dataset, which already available in R.
set.seed(123)
dat <-
mtcars %>%
mutate(across(c(vs, am), as.factor)) %>%
select(-mpg) %>%
missForest::prodNA(0.1) %>%
bind_cols(mpg = mtcars$mpg)
summary(dat)
## cyl disp hp drat
## Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:4.000 1st Qu.:120.7 1st Qu.:103.0 1st Qu.:3.150
## Median :6.000 Median :225.0 Median :123.0 Median :3.715
## Mean :6.148 Mean :232.8 Mean :147.4 Mean :3.642
## 3rd Qu.:8.000 3rd Qu.:334.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
## NA's :5 NA's :1 NA's :4 NA's :2
## wt qsec vs am gear
## Min. :1.513 Min. :14.50 0 :17 0 :18 Min. :3.00
## 1st Qu.:2.429 1st Qu.:16.88 1 :11 1 :10 1st Qu.:3.00
## Median :3.203 Median :17.51 NA's: 4 NA's: 4 Median :4.00
## Mean :3.112 Mean :17.75 Mean :3.71
## 3rd Qu.:3.533 3rd Qu.:18.83 3rd Qu.:4.00
## Max. :5.424 Max. :22.90 Max. :5.00
## NA's :4 NA's :2 NA's :1
## carb mpg
## Min. :1.000 Min. :10.40
## 1st Qu.:2.000 1st Qu.:15.43
## Median :2.000 Median :19.20
## Mean :2.667 Mean :20.09
## 3rd Qu.:4.000 3rd Qu.:22.80
## Max. :6.000 Max. :33.90
## NA's :5
Run mice()
on missing data with 10 imputed datasets (m = 10
).
datImp <- mice(dat, m = 10, printFlag = F, seed = 123)
datImp
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## cyl disp hp drat wt qsec vs am
## "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" "logreg" "logreg"
## gear carb mpg
## "pmm" "pmm" ""
## PredictorMatrix:
## cyl disp hp drat wt qsec vs am gear carb mpg
## cyl 0 1 1 1 1 1 1 1 1 1 1
## disp 1 0 1 1 1 1 1 1 1 1 1
## hp 1 1 0 1 1 1 1 1 1 1 1
## drat 1 1 1 0 1 1 1 1 1 1 1
## wt 1 1 1 1 0 1 1 1 1 1 1
## qsec 1 1 1 1 1 0 1 1 1 1 1
Run stepwise selection on each imputed dataset.
sc <- list(upper = ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb,
lower = ~ 1)
exp <- expression(f1 <- lm(mpg ~ 1),
f2 <- step(f1, scope = sc, trace = 0))
fit <- with(datImp, exp)
Next, we calculate how many times each variable selected in the each model by stepwise selection.
fit$analyses %>%
map(formula) %>% #get the formula
map(terms) %>% #get the terms
map(labels) %>% #get the name of variables
unlist() %>%
table()
## .
## am carb cyl disp drat hp qsec vs wt
## 7 5 3 2 4 5 3 4 7
We going to select:
- am
- carb
- hp
- wt
These variables appear at least in the half of the models. We have 10 imputed datasets, so, 10 models. Next, we fit a preliminary model.
fit_full1 <- with(datImp, lm(mpg ~ am + carb + hp + wt))
pool(fit_full1) %>%
summary()
## term estimate std.error statistic df p.value
## 1 (Intercept) 33.33683070 3.30280913 10.093478 15.81838 2.688191e-08
## 2 am1 3.06689135 1.94363342 1.577917 13.06329 1.384846e-01
## 3 carb -0.64791214 0.65564816 -0.988201 11.64959 3.431353e-01
## 4 hp -0.03414274 0.01159828 -2.943777 20.47239 7.895170e-03
## 5 wt -2.39586280 1.22218829 -1.960306 13.54830 7.085513e-02
We exclude carb variable in the next model as it has the largest non-significant p value.
fit_full2 <- with(datImp, lm(mpg ~ am + hp + wt))
Next, we compare using multivariate Wald test.
D1(fit_full1, fit_full2)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 0.9765411 1 9.21378 27 0.3482934 0.6935655
P > 0.05. So, we opt for the simpler model.
pool(fit_full2) %>%
summary()
## term estimate std.error statistic df p.value
## 1 (Intercept) 33.75666324 3.30083213 10.226713 16.87762 1.195383e-08
## 2 am1 2.50264907 1.79966590 1.390619 15.31418 1.842201e-01
## 3 hp -0.03950216 0.01162689 -3.397482 17.65719 3.280147e-03
## 4 wt -2.75412354 1.15870950 -2.376889 15.03403 3.116779e-02
We see that am variable has the largest non-significant p value. So, we exclude this variable in the next model and compare the two latest models using multivariate Wald test.
fit_full3 <- with(datImp, lm(mpg ~ hp + wt))
D1(fit_full2, fit_full3)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 1.93382 1 12.90982 28 0.1878483 0.4392918
Again, we opt for the simple model.
pool(fit_full3) %>%
summary()
## term estimate std.error statistic df p.value
## 1 (Intercept) 37.50546490 1.91102857 19.625800 23.65472 4.440892e-16
## 2 hp -0.03263534 0.01042989 -3.129021 21.20234 5.031751e-03
## 3 wt -3.92792051 0.75157304 -5.226266 19.78033 4.238231e-05
There is no non-significant variable in the model anymore. Thus, this is our final model.
gtsummary::tbl_regression(fit_full3)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
hp | -0.03 | -0.05, -0.01 | 0.005 |
wt | -3.9 | -5.5, -2.4 | <0.001 |
1
CI = Confidence Interval
|
Reference: