Fitted vs predict in R
There are two functions in R that seems almost similar yet different:
fitted()
predict()
First let’s prepare some data first.
# Packages
library(dplyr)
# Data
set.seed(123)
dat <-
iris %>%
mutate(twoGp = sample(c("Gp1", "Gp2"), 150, replace = T), #create two group factor
twoGp = as.factor(twoGp))
This is our data.
summary(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species twoGp
## setosa :50 Gp1:76
## versicolor:50 Gp2:74
## virginica :50
##
##
##
fitted()
is used to get a predicted values or \(\hat{y}\) based on the data. Let’s see this on the logistic regression.
logR <- glm(twoGp ~ ., family = binomial(), data = dat)
These are the fitted values.
fitted(logR) %>% head()
## 1 2 3 4 5 6
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198
For predict()
, we have three types:
- Response
- Link - default
- Terms
If no new data supplied to predict()
, it will use the original data used to fit the model.
1. Response
The type response is identical to fitted()
.
predict(logR, type = "response") %>% head()
## 1 2 3 4 5 6
## 0.4074988 0.3385228 0.3772767 0.3555640 0.4255196 0.4602198
We can confirm this as below.
all.equal(fitted(logR), predict(logR, type = "response"))
## [1] TRUE
Thus, fitted()
and predict(type = "response")
give use predicted probabilities on the scale of the response variable. The first observation of this values can be interpreted as probability of Gp2 (Gp1 is a reference group) for first observation is 0.41.
2. Link
predict(type = "link")
gives us predicted probabilities on the logit scale or log odds prediction.
predict(logR, type = "link") %>% head() #similar to predict(logR)
## 1 2 3 4 5 6
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578
So, the log odds prediction of Gp2 for the first observation is -0.37. Hence, we can get the same values if we apply a link function to the fitted values.
The link function for logistic regression is:
\[ ln(\frac{\mu}{1 - \mu}) \] So, we apply this link function to the fitted values.
logOddsProb <- log(fitted(logR) / (1 - fitted(logR)))
head(logOddsProb)
## 1 2 3 4 5 6
## -0.3743150 -0.6698840 -0.5011235 -0.5946702 -0.3001551 -0.1594578
We can further confirm this as we did previously.
all.equal(logOddsProb, predict(logR, type = "link"))
## [1] TRUE
Also, we can conclude predict(type = "link")
give use a fitted values before an application of link function (log odds).
3. Terms
Lastly, we have predict(type = "terms")
. This type gives us a matrix of fitted values for each variable of each observation in the model on the scale of linear predictor.
predict(logR, type = "terms") %>% head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.07988782 0.28070682 0.4819893 -0.2736677 -0.9178543
## 2 0.10138230 -0.03635661 0.4819893 -0.2736677 -0.9178543
## 3 0.12287679 0.09046877 0.5024299 -0.2736677 -0.9178543
## 4 0.13362403 0.02705608 0.4615487 -0.2736677 -0.9178543
## 5 0.09063506 0.34411951 0.4819893 -0.2736677 -0.9178543
## 6 0.04764610 0.53435757 0.4206675 -0.2188976 -0.9178543
So, if we add up the values of the first observation and the constant (or intercept), we will get the same values as the log odds prediction (predict(type = "link")
).
predTerm <- predict(logR, type = "terms")
sum(predTerm[1, ], attr(predTerm, "constant")) #add up the first observation and the constant
## [1] -0.374315
logOddsProb[1]
## 1
## -0.374315
Those values also similar to if we calculate manually using coefficient from the summary()
.
\[ LogOdds(Gp2) = \beta_0 + \beta_1(Sepal.Length) + \beta_2(Sepal.Width) + \] \[ \beta_3(Petal.Length) + \beta_4(Petal.Width) + \beta_5(Species) \] So, this is the values we get from the first observation.
coef(logR)[1] + coef(logR)[2]*dat$Sepal.Length[1] + coef(logR)[3]*dat$Sepal.Width[1] + coef(logR)[4]*dat$Petal.Length[1] + coef(logR)[5]*dat$Petal.Width[1] + 0 #setosa species
## (Intercept)
## -0.374315
However, in predict(type = "terms")
the values is centered, thus we have a different values for constant/intercept and for \(\beta_1(Sepal.Length)\),\(\beta_2(Sepal.Width)\) and so on. For example, the values for intercept for both models are:
# Intercept/constant from predict(type = "terms")
attr(predTerm, "constant")
## [1] -0.02537694
# Intercept/constant from summary()
coef(logR)[1]
## (Intercept)
## -1.814251
References: