\[\begin{gather*} y\in \{0,1\}: \text{dependent variable.}\\ x_1,\cdots x_k: \text{predictor variables}\\ x_i=(1,x_{i1},\cdots,x_{ik})^T,\ i=1,\cdots n\\ \beta = (\beta_0,\cdots,\beta_k)^T \end{gather*}\]
Let \(\hat{y_i}=\beta^Tx_i+\varepsilon_i\).
\[ y_i=\begin{cases} 1& \hat{y_i}\ge0\\ 0& \hat{y_i}<0 \end{cases} \]
For Probit Model, \(\varepsilon\sim N(0,1)\). So, \[ P\{y_i=1\}=P\{\hat{y_i}\ge0\}=P\{\varepsilon_i\ge-\beta^Tx_i\}=1-\Phi(-\beta^Tx_i)=\Phi(\beta^Tx_i) \] Therefore Probit Model is shown below.
\[\Phi(u)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{u}e^{-\frac{z^2}{2}}dz\]
For Logit Model, \(\varepsilon \sim \text{Logistic}(0,1)\).
P.D.F of Standard Logistic Distribution: \(f(x)=\frac{e^{-x}}{(1+e^{-x})^2}\).
C.D.F of Standard Logistic Distribution: \(F(x)=\frac{1}{1+e^{-x}}\).
Therefore Probit Model is shown below. Where \(u=\beta_0+\beta_1x_1+\cdots+\beta_kx_k\).
\[ \Lambda(u)=\frac{1}{1+e^{-u}} \]
We can estimate \(\beta\) by maximum likelihood estimation.
\(u\) is also referred to as the log-odds.
\[ u=\ln\frac{p}{1-p} \]
To read “xlsx” files, I use R function - read_excel in package readxl.
For each patient, variable descriptions are as follows:
In this case, I find that variable x4 might be a categorical variable. The code below estimates a logistic regression model using the glm (generalized linear model) function.
factor(doctors$x4)->doctors$x4
doctors.logit<-glm(y~.,data = doctors,family = binomial(logit))
summary(doctors.logit)
##
## Call:
## glm(formula = y ~ ., family = binomial(logit), data = doctors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4599 -0.4412 0.1605 0.4367 1.7655
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.70095 3.69973 -2.892 0.00382 **
## x1 0.45959 0.17705 2.596 0.00944 **
## x2 0.81826 0.43646 1.875 0.06083 .
## x3 -0.37897 0.42125 -0.900 0.36832
## x42 -2.14618 4.14447 -0.518 0.60457
## x43 0.22639 0.84413 0.268 0.78855
## x5 -0.03383 0.02045 -1.654 0.09816 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 75.934 on 57 degrees of freedom
## Residual deviance: 39.251 on 51 degrees of freedom
## AIC: 53.251
##
## Number of Fisher Scoring iterations: 6
As shown in the above results, x1, x2, x3 and two terms of x4 are statistically significant. The logistic regression coefficients give the change of the log-odds for a one unit increase in the predictor variable. In this case x4=1 is reference category.
From the result, we can see only variable x1 is significant in \(\alpha=0.05\) significance level. The corresponding estimator means that for one unit increase in x1, the log-odds of y=1(Successful pregnancy) increase by \(0.45959\).(Bruin 2011)
The indicator variable for x4 is slightly different. For instance, the log-odds of y=1(Successful pregnancy) with x4=2, verse with x4=1. Change the variable x4 from \(2\) into \(1\), the log-odds would change \(-2.14618\).
factor(doctors$x4)->doctors$x4
doctors.probit<-glm(y~.,data = doctors,family = binomial(probit))
summary(doctors.probit)
##
## Call:
## glm(formula = y ~ ., family = binomial(probit), data = doctors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4306 -0.4425 0.1226 0.4728 1.7298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.12853 1.94100 -3.157 0.00159 **
## x1 0.26397 0.09484 2.783 0.00538 **
## x2 0.47466 0.23534 2.017 0.04370 *
## x3 -0.23146 0.23321 -0.992 0.32096
## x42 -1.23190 1.93357 -0.637 0.52405
## x43 0.09670 0.47985 0.202 0.84029
## x5 -0.01906 0.01125 -1.695 0.09014 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 75.934 on 57 degrees of freedom
## Residual deviance: 39.239 on 51 degrees of freedom
## AIC: 53.239
##
## Number of Fisher Scoring iterations: 7
From the result, we can see variables x1 and x2 are significant in \(\alpha=0.05\) significance level. The corresponding estimator means that for one unit increase in x1, the log-odds of y=1(Successful pregnancy) increase by \(0.26397\), but for x2, it increases by \(0.47466\).
I use library aod to continue the next step - significance test.
Logit model
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 13.5, df = 6, P(> X2) = 0.036
Probit model
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 17.2, df = 6, P(> X2) = 0.0087
As we can see, p-values are both less than \(0.05\), therefore we can conclude that these two models are both significant at \(95\%\) confidence level. In other words, not all coefficients equal to \(0\).
Compute the optimal cutoff can help us minimizes the misclassification error for both above model. I use package InformationValue to find it.
Logit Model
doctors.logit.predicted <- plogis(predict(doctors.logit,doctors))
library(InformationValue)
doctors.logit.optcutoff <- optimalCutoff(doctors$y,doctors.logit.predicted)
show(doctors.logit.optcutoff)
## [1] 0.5293058
Probit Model
doctors.probit.predicted <- plogis(predict(doctors.probit,doctors))
library(InformationValue)
doctors.probit.optcutoff <- optimalCutoff(doctors$y,doctors.probit.predicted)
show(doctors.probit.optcutoff)
## [1] 0.5234879
with default
## 0 1
## 0 16 3
## 1 5 34
with optimal cufoff
## 0 1
## 0 18 3
## 1 3 34
## 0 1
## 0 17 3
## 1 4 34
with optimal cufoff
## 0 1
## 0 18 3
## 1 3 34
Logit Model
Probit Model
Higher the area under curve, better the prediction power of the model. From these figure above, we can condlude that the Logit Model is slightly better than Probit Model.
To read “xlsx” files, I use R function - read_excel in package readxl.
library(readxl)
read_excel('../datasets/binary regression.xlsx') -> exams
# Converts data format to numeric value
exams$x1<-as.numeric(exams$x1)
exams$x2<-as.numeric(exams$x2)
For each exam, variable descriptions are as follows:
In this case, there is no categorical variable. The code below estimates a logistic regression model using the glm (generalized linear model) function.
##
## Call:
## glm(formula = y ~ ., family = binomial(logit), data = exams)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2246 -0.1425 0.0042 0.2007 2.1524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.1603 4.0819 -3.959 7.53e-05 ***
## x1 1.7454 0.4308 4.051 5.09e-05 ***
## x2 1.4865 0.3941 3.772 0.000162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.628 on 99 degrees of freedom
## Residual deviance: 43.402 on 97 degrees of freedom
## AIC: 49.402
##
## Number of Fisher Scoring iterations: 8
As shown in the above results, both x1 and x2 are statistically significant.
From the result, we can see variables x1 and x2 are significant in \(\alpha=0.001\) significance level. The corresponding estimator means that for one unit increase in x1, the log-odds of y=1(Pass Exam) increase by \(1.7454\), but for x2, it increase by \(1.4865\).
##
## Call:
## glm(formula = y ~ ., family = binomial(probit), data = exams)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.20524 -0.08993 0.00000 0.15986 2.11765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.3426 2.1896 -4.267 1.98e-05 ***
## x1 1.0057 0.2291 4.390 1.13e-05 ***
## x2 0.8643 0.2155 4.012 6.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.628 on 99 degrees of freedom
## Residual deviance: 43.074 on 97 degrees of freedom
## AIC: 49.074
##
## Number of Fisher Scoring iterations: 9
From the result, we can see both x1 and x2 are significant in \(\alpha=0.001\) significance level. The corresponding estimator means that for one unit increase in x1, the log-odds of y=1(Pass Exam) increase by \(1.0057\), but for x2, it increases by \(0.8643\).
I use library aod to continue the next step - significance test.
Logit model
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 16.5, df = 2, P(> X2) = 0.00026
Probit model
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 19.4, df = 2, P(> X2) = 6e-05
As we can see, p-values are both less than \(0.05\), therefore we can conclude that these two models are both significant at \(99.9\%\) confidence level. In other words, not all coefficients equal to \(0\).
Compute the optimal cutoff can help us minimizes the misclassification error for both above model. I use package InformationValue to find it.
Logit Model
exams.logit.predicted <- plogis(predict(exams.logit,exams))
library(InformationValue)
exams.logit.optcutoff <- optimalCutoff(exams$y,exams.logit.predicted)
show(exams.logit.optcutoff)
## [1] 0.6399997
Probit Model
exams.probit.predicted <- plogis(predict(exams.probit,exams))
library(InformationValue)
exams.probit.optcutoff <- optimalCutoff(exams$y,exams.probit.predicted)
show(exams.probit.optcutoff)
## [1] 0.5798182
with default
## 0 1
## 0 39 5
## 1 6 50
with optimal cufoff
## 0 1
## 0 43 7
## 1 2 48
## 0 1
## 0 39 5
## 1 6 50
with optimal cufoff
## 0 1
## 0 43 7
## 1 2 48
Logit Model
Probit Model
Higher the area under curve, better the prediction power of the model. From these figure above, we can condlude that the Probit Model is slightly better than Logit Model.
Thanks for knitr designed by(Xie 2015).
Bruin, J. 2011. “Newtest: Command to Compute New Test @ONLINE.” February 2011. https://stats.idre.ucla.edu/stata/ado/analysis/.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.