To read “xlsx” files, I use R function - read_excel in package readxl.
For each patient, variable descriptions are as follows:
Model 1:\(X_1=\beta_0+\beta_1X_2+\beta_2X_3\)
Where \(X_1,X_2,X_3\) are vector with n-dimension.(similarly hereinafter)
Model 2:\(X_1=\beta_0+\beta_1X_2\)
Model 3:\(X_1=\beta_0+\beta_1X_3\)
The Null Hypothesis is that the coefficients associated with the variables are equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).
##
## Call:
## lm(formula = patients$X1 ~ patients$X2 + patients$X3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## patients$X2 0.8614 0.2482 3.470 0.00844 **
## patients$X3 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
As we can see statistics \(F\) for hypothesis \(H_0:\beta_1=\beta_2=0\) is \(168.8\). And p-value is less than \(0.05\). Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(Systolic blood pressure) and dependent variable(Age and Weight). We can conclude that the linear regression model is significant in general. All coefficients of the linear regression model are significant.
##
## Call:
## lm(formula = patients$X1 ~ patients$X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.742 -2.415 1.015 1.795 5.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.7055 6.4524 9.098 7.81e-06 ***
## patients$X2 1.4632 0.1023 14.300 1.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.949 on 9 degrees of freedom
## Multiple R-squared: 0.9578, Adjusted R-squared: 0.9532
## F-statistic: 204.5 on 1 and 9 DF, p-value: 1.708e-07
Since p-value for patients$X2(Age), intercept and F-statistics are all less than \(0.05\), We can conclude that the linear regression model is significant in general. And all coefficients of the linear regression model are significant.
##
## Call:
## lm(formula = patients$X1 ~ patients$X3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7440 -1.0412 -0.3124 2.2283 4.2560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14161 12.36447 0.092 0.928
## patients$X3 0.76384 0.06318 12.090 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.459 on 9 degrees of freedom
## Multiple R-squared: 0.942, Adjusted R-squared: 0.9356
## F-statistic: 146.2 on 1 and 9 DF, p-value: 7.227e-07
Since p-value for patients$X3(Weight) and F-statistics are both less than \(0.05\), We can conclude that the coefficient(Weight) of the linear regression model is significant and the linear regression model is significant in general. And the other coefficients are not.
\(H_0\): the variance of the residuals is constant (homoscedasticity).
\(H_1\): the variances of the residuals are different (heteroscedasticity) .
Model 1:
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.1314767, Df = 1, p = 0.71691
Model 2:
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2609148, Df = 1, p = 0.60949
Model 3:
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.5535792, Df = 1, p = 0.45686
As we can see, all p-values are larger than \(0.05\), so we should accept the null hypothesis that the variance of the residuals is constant.
Autocorrelation means that the residuals satisfy the equation(with lag 1):
\[ \varepsilon_t=\rho\varepsilon_{t-1}+v_t \] Where \(\{v_t\}\) is the independent normally \((0,\sigma_v^2)\) distributed random variables, \(|\rho|<1\) is an autocorrelation parameter.
\(H_0\): \(\rho=0\)
\(H_1\): \(\rho\ne0\)
Model 1
## lag Autocorrelation D-W Statistic p-value
## 1 0.0007935901 1.927361 0.790
## 2 -0.2755027796 2.426084 0.438
## 3 0.2953338145 1.104598 0.292
## Alternative hypothesis: rho[lag] != 0
P-value of different lags are all larger than \(0.05\), so we should accept the null hypothesis that autocorrelation is zero. (i.e. the residuals has no autocorrelation)
Model 2
## lag Autocorrelation D-W Statistic p-value
## 1 0.5161502 0.8198496 0.030
## 2 0.2062185 1.3363803 0.400
## 3 0.1631983 0.9002157 0.146
## Alternative hypothesis: rho[lag] != 0
P-value of different lags except lag 1 are larger than \(0.05\), so for lags 2 and 3, we should accept the null hypothesis that autocorrelation is zero. But for lag 1, we need to reject the null hypothesis.
Model 3
## lag Autocorrelation D-W Statistic p-value
## 1 -0.2486634 2.474519 0.404
## 2 -0.2171831 2.391001 0.406
## 3 0.1663044 1.498164 0.808
## Alternative hypothesis: rho[lag] != 0
P-value of different lags are all larger than \(0.05\), so we should accept the null hypothesis that autocorrelation is zero.
To read “xlsx” files, I use R function - read_excel in package readxl.
For dataset cars, variable descriptions are as follows:
Question: Construct different linear regression models using “step” function with Y = price as a dependent variable, and independent variables: mileage, liter, cruise, sound, leather. Verify if the coefficients of the linear regression model are significant. Choose the best model using AIC statistics.
cars_data_subset = subset(cars_data,select = c(Price,Mileage,Liter,Cruise,Sound,Leather))
library(MASS)
lm(Price~.,cars_data_subset)->cars.full.model
lm(Price~1,cars_data_subset)->cars.null.model
stepAIC(cars.null.model,direction = "both",trace = 1, scope=list(lower=cars.null.model,
upper=cars.full.model))->cars.selected.model
## Start: AIC=14792.6
## Price ~ 1
##
## Df Sum of Sq RSS AIC
## + Liter 1 2.4443e+10 5.4019e+10 14494
## + Cruise 1 1.4565e+10 6.3896e+10 14630
## + Leather 1 1.9388e+09 7.6523e+10 14774
## + Mileage 1 1.6056e+09 7.6856e+10 14778
## + Sound 1 1.2132e+09 7.7248e+10 14782
## <none> 7.8461e+10 14793
##
## Step: AIC=14494.49
## Price ~ Liter
##
## Df Sum of Sq RSS AIC
## + Cruise 1 4.4346e+09 4.9584e+10 14428
## + Mileage 1 1.3810e+09 5.2638e+10 14476
## + Leather 1 9.2996e+08 5.3089e+10 14482
## + Sound 1 6.0710e+08 5.3411e+10 14487
## <none> 5.4019e+10 14494
## - Liter 1 2.4443e+10 7.8461e+10 14793
##
## Step: AIC=14427.62
## Price ~ Liter + Cruise
##
## Df Sum of Sq RSS AIC
## + Mileage 1 1.5597e+09 4.8024e+10 14404
## + Leather 1 1.4601e+09 4.8124e+10 14406
## + Sound 1 3.9456e+08 4.9189e+10 14423
## <none> 4.9584e+10 14428
## - Cruise 1 4.4346e+09 5.4019e+10 14494
## - Liter 1 1.4312e+10 6.3896e+10 14630
##
## Step: AIC=14403.92
## Price ~ Liter + Cruise + Mileage
##
## Df Sum of Sq RSS AIC
## + Leather 1 1.4801e+09 4.6544e+10 14381
## + Sound 1 4.3503e+08 4.7589e+10 14399
## <none> 4.8024e+10 14404
## - Mileage 1 1.5597e+09 4.9584e+10 14428
## - Cruise 1 4.6132e+09 5.2638e+10 14476
## - Liter 1 1.4014e+10 6.2038e+10 14608
##
## Step: AIC=14380.75
## Price ~ Liter + Cruise + Mileage + Leather
##
## Df Sum of Sq RSS AIC
## + Sound 1 7.6266e+08 4.5782e+10 14370
## <none> 4.6544e+10 14381
## - Leather 1 1.4801e+09 4.8024e+10 14404
## - Mileage 1 1.5797e+09 4.8124e+10 14406
## - Cruise 1 5.1573e+09 5.1702e+10 14463
## - Liter 1 1.2706e+10 5.9250e+10 14573
##
## Step: AIC=14369.47
## Price ~ Liter + Cruise + Mileage + Leather + Sound
##
## Df Sum of Sq RSS AIC
## <none> 4.5782e+10 14370
## - Sound 1 7.6266e+08 4.6544e+10 14381
## - Mileage 1 1.6371e+09 4.7419e+10 14396
## - Leather 1 1.8077e+09 4.7589e+10 14399
## - Cruise 1 4.9334e+09 5.0715e+10 14450
## - Liter 1 1.2327e+10 5.8108e+10 14559
Statistic \(AIC\) is a criterion for the selected model that is looking for a model that best interprets the data but contains the fewest free parameters.
\[ AIC=2k+n\ln\frac{RSS}{n} \]
Where \(k\) is the number of parameters, \(n\) is the number of observations and \(RSS\) is the residual sum of squares.
Therefore, as the formula shows, we should give priority to the model with the lowest AIC value.
##
## Call:
## lm(formula = Price ~ Liter + Cruise + Mileage + Leather + Sound,
## data = cars_data_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12328 -5660 -1700 4546 38203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.318e+03 1.184e+03 6.182 1.01e-09 ***
## Liter 3.864e+03 2.636e+02 14.658 < 2e-16 ***
## Cruise 6.251e+03 6.741e+02 9.273 < 2e-16 ***
## Mileage -1.744e-01 3.265e-02 -5.342 1.20e-07 ***
## Leather 3.436e+03 6.121e+02 5.613 2.74e-08 ***
## Sound -2.126e+03 5.832e+02 -3.646 0.000284 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7574 on 798 degrees of freedom
## Multiple R-squared: 0.4165, Adjusted R-squared: 0.4129
## F-statistic: 113.9 on 5 and 798 DF, p-value: < 2.2e-16
As we can see statistics \(F\) for hypothesis \(H_0:\beta_1=\beta_2=\cdots=\beta_5=0\) is \(113.9\). And p-value is less than \(0.05\). Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(Price) and dependent variable(Liter, Cruise, Mileage, Leather, Sound). We can conclude that the linear regression model is significant in general. All coefficients of the linear regression model are significant.
\(H_0\): the variance of the residuals is constant (homoscedasticity).
\(H_1\): the variances of the residuals are different (heteroscedasticity) .
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 105.567, Df = 1, p = < 2.22e-16
As we can see, the p-value is less than \(0.05\), so we should reject the null hypothesis that the variance of the residuals is constant.
\(H_0\): \(\rho=0\)
\(H_1\): \(\rho\ne0\)
## lag Autocorrelation D-W Statistic p-value
## 1 0.8479070 0.3021227 0
## 2 0.8269908 0.3433033 0
## 3 0.7839602 0.4285357 0
## Alternative hypothesis: rho[lag] != 0
P-value of different lags are all less than \(0.05\), so we should reject the null hypothesis that autocorrelation is zero. So, the residuals have autocorrelation.
To read “txt” files, I use R function - read.table().
read.table('../datasets/cigarettes.dat.txt',header = FALSE,
dec = '.',na.strings = 'NA') -> cigarettes_data
For dataset cigarettes, variable descriptions are as follows:
Model: \(y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3\)
##
## Call:
## lm(formula = V5 ~ V2 + V3 + V4, data = cigarettes_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89261 -0.78269 0.00428 0.92891 2.45082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2022 3.4618 0.925 0.365464
## V2 0.9626 0.2422 3.974 0.000692 ***
## V3 -2.6317 3.9006 -0.675 0.507234
## V4 -0.1305 3.8853 -0.034 0.973527
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.446 on 21 degrees of freedom
## Multiple R-squared: 0.9186, Adjusted R-squared: 0.907
## F-statistic: 78.98 on 3 and 21 DF, p-value: 1.329e-11
In general
As we can see statistics \(F\) for hypothesis \(H_0:\beta_1=\beta_2=\beta_3=0\) is \(78.98\). And p-value is less than \(0.05\). Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(carbon monoxide) and dependent variable(tar, nicotine, weight). We can conclude that the linear regression model is significant in general.
For Each Coefficient
Since only the p-value of \(\beta_2=0\) is less than \(0.05\), we need to reject the null hypothesis that \(\beta_2=0\), we could state that \(\beta_2\) is significant and the other coefficients are not.
\(H_0\): the variance of the residuals is constant (homoscedasticity).
\(H_1\): the variances of the residuals are different (heteroscedasticity) .
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.562259, Df = 1, p = 0.10944
As we can see, the p-value is larger than \(0.05\), so we should accept the null hypothesis that the variance of the residuals is constant.
\(H_0\): \(\rho=0\)
\(H_1\): \(\rho\ne0\)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.5056358 2.8604692 0.016
## 2 0.4398677 0.9209631 0.004
## 3 -0.4002760 2.3849266 0.154
## Alternative hypothesis: rho[lag] != 0
P-value of different lags except lag 3 are less than \(0.05\), so for lags 1 and 2, we should reject the null hypothesis that autocorrelation is zero. But for lag 3, we need to accept the null hypothesis.
Thanks for knitr designed by(Xie 2015).
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.