Systolic Blood Pressure Data

Read Data

To read “xlsx” files, I use R function - read_excel in package readxl.

library(readxl)
read_excel('../datasets/mlr02.xlsx',) -> patients

For each patient, variable descriptions are as follows:

X1: Systolic blood pressure
X2: Age in years
X3: Weight in pounds

Construct LR models

Model 1:$X_1=\beta_0+\beta_1X_2+\beta_2X_3$

Where $X_1,X_2,X_3$ are vector with n-dimension.(similarly hereinafter)

lr1<-lm(patients$X1~patients$X2+patients$X3)

Model 2:$X_1=\beta_0+\beta_1X_2$

lr2<-lm(patients$X1~patients$X2)

Model 3:$X_1=\beta_0+\beta_1X_3$

lr3<-lm(patients$X1~patients$X3)

Test the significance coefficients of LR model

The Null Hypothesis is that the coefficients associated with the variables are equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).

Model 1

summary(lr1)

## 
## Call:
## lm(formula = patients$X1 ~ patients$X2 + patients$X3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## patients$X2   0.8614     0.2482   3.470  0.00844 **
## patients$X3   0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

As we can see statistics $F$ for hypothesis $H_0:\beta_1=\beta_2=0$ is $168.8$. And p-value is less than $0.05$. Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(Systolic blood pressure) and dependent variable(Age and Weight). We can conclude that the linear regression model is significant in general. All coefficients of the linear regression model are significant.

Model 2

summary(lr2)

## 
## Call:
## lm(formula = patients$X1 ~ patients$X2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.742 -2.415  1.015  1.795  5.184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58.7055     6.4524   9.098 7.81e-06 ***
## patients$X2   1.4632     0.1023  14.300 1.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.949 on 9 degrees of freedom
## Multiple R-squared:  0.9578, Adjusted R-squared:  0.9532 
## F-statistic: 204.5 on 1 and 9 DF,  p-value: 1.708e-07

Since p-value for patients$X2(Age), intercept and F-statistics are all less than $0.05$, We can conclude that the linear regression model is significant in general. And all coefficients of the linear regression model are significant.

Model 3

summary(lr3)

## 
## Call:
## lm(formula = patients$X1 ~ patients$X3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7440 -1.0412 -0.3124  2.2283  4.2560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.14161   12.36447   0.092    0.928    
## patients$X3  0.76384    0.06318  12.090 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.459 on 9 degrees of freedom
## Multiple R-squared:  0.942,  Adjusted R-squared:  0.9356 
## F-statistic: 146.2 on 1 and 9 DF,  p-value: 7.227e-07

Since p-value for patients$X3(Weight) and F-statistics are both less than $0.05$, We can conclude that the coefficient(Weight) of the linear regression model is significant and the linear regression model is significant in general. And the other coefficients are not.

Test the homoscedasticity of the model

$H_0$: the variance of the residuals is constant (homoscedasticity).

$H_1$: the variances of the residuals are different (heteroscedasticity) .

Model 1:

library(carData)
library(car)
ncvTest(lr1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.1314767, Df = 1, p = 0.71691

Model 2:

ncvTest(lr2)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.2609148, Df = 1, p = 0.60949

Model 3:

ncvTest(lr3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.5535792, Df = 1, p = 0.45686

As we can see, all p-values are larger than $0.05$, so we should accept the null hypothesis that the variance of the residuals is constant.

Test the autocorrelation of residuals

Autocorrelation means that the residuals satisfy the equation(with lag 1):

\[ \varepsilon_t=\rho\varepsilon_{t-1}+v_t \] Where $\{v_t\}$ is the independent normally $(0,\sigma_v^2)$ distributed random variables, $|\rho|<1$ is an autocorrelation parameter.

$H_0$: $\rho=0$

$H_1$: $\rho\ne0$

Model 1

durbinWatsonTest(lr1,max.lag=3)

##  lag Autocorrelation D-W Statistic p-value
##    1    0.0007935901      1.927361   0.790
##    2   -0.2755027796      2.426084   0.438
##    3    0.2953338145      1.104598   0.292
##  Alternative hypothesis: rho[lag] != 0

P-value of different lags are all larger than $0.05$, so we should accept the null hypothesis that autocorrelation is zero. (i.e. the residuals has no autocorrelation)

Model 2

durbinWatsonTest(lr2,max.lag=3)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.5161502     0.8198496   0.030
##    2       0.2062185     1.3363803   0.400
##    3       0.1631983     0.9002157   0.146
##  Alternative hypothesis: rho[lag] != 0

P-value of different lags except lag 1 are larger than $0.05$, so for lags 2 and 3, we should accept the null hypothesis that autocorrelation is zero. But for lag 1, we need to reject the null hypothesis.

Model 3

durbinWatsonTest(lr3,max.lag=3)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.2486634      2.474519   0.404
##    2      -0.2171831      2.391001   0.406
##    3       0.1663044      1.498164   0.808
##  Alternative hypothesis: rho[lag] != 0

P-value of different lags are all larger than $0.05$, so we should accept the null hypothesis that autocorrelation is zero.

Cars data

Read Data

To read “xlsx” files, I use R function - read_excel in package readxl.

library(readxl)
read_excel('../datasets/kuiper.xlsx',) -> cars_data

For dataset cars, variable descriptions are as follows:

Price: suggested retail price of the used 2005 GM car in excellent condition. The condition of a car can greatly affect price. All cars in this data set were less than one year old when priced and considered to be in excellent condition.
Mileage: number of miles the car has been driven
Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet
Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier
Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D
Type: body type such as sedan, coupe, etc.
Cylinder: number of cylinders in the engine
Liter: a more specific measure of engine size
Doors: number of doors
Cruise: indicator variable representing whether the car has cruise control (1 = cruise)
Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded)
Leather: indicator variable representing whether the car has leather seats (1 = leather)

Construct the Best LR Models

Question: Construct different linear regression models using “step” function with Y = price as a dependent variable, and independent variables: mileage, liter, cruise, sound, leather. Verify if the coefficients of the linear regression model are significant. Choose the best model using AIC statistics.

cars_data_subset = subset(cars_data,select = c(Price,Mileage,Liter,Cruise,Sound,Leather))
library(MASS)
lm(Price~.,cars_data_subset)->cars.full.model
lm(Price~1,cars_data_subset)->cars.null.model
stepAIC(cars.null.model,direction = "both",trace = 1, scope=list(lower=cars.null.model,
                                          upper=cars.full.model))->cars.selected.model

## Start:  AIC=14792.6
## Price ~ 1
## 
##           Df  Sum of Sq        RSS   AIC
## + Liter    1 2.4443e+10 5.4019e+10 14494
## + Cruise   1 1.4565e+10 6.3896e+10 14630
## + Leather  1 1.9388e+09 7.6523e+10 14774
## + Mileage  1 1.6056e+09 7.6856e+10 14778
## + Sound    1 1.2132e+09 7.7248e+10 14782
## <none>                  7.8461e+10 14793
## 
## Step:  AIC=14494.49
## Price ~ Liter
## 
##           Df  Sum of Sq        RSS   AIC
## + Cruise   1 4.4346e+09 4.9584e+10 14428
## + Mileage  1 1.3810e+09 5.2638e+10 14476
## + Leather  1 9.2996e+08 5.3089e+10 14482
## + Sound    1 6.0710e+08 5.3411e+10 14487
## <none>                  5.4019e+10 14494
## - Liter    1 2.4443e+10 7.8461e+10 14793
## 
## Step:  AIC=14427.62
## Price ~ Liter + Cruise
## 
##           Df  Sum of Sq        RSS   AIC
## + Mileage  1 1.5597e+09 4.8024e+10 14404
## + Leather  1 1.4601e+09 4.8124e+10 14406
## + Sound    1 3.9456e+08 4.9189e+10 14423
## <none>                  4.9584e+10 14428
## - Cruise   1 4.4346e+09 5.4019e+10 14494
## - Liter    1 1.4312e+10 6.3896e+10 14630
## 
## Step:  AIC=14403.92
## Price ~ Liter + Cruise + Mileage
## 
##           Df  Sum of Sq        RSS   AIC
## + Leather  1 1.4801e+09 4.6544e+10 14381
## + Sound    1 4.3503e+08 4.7589e+10 14399
## <none>                  4.8024e+10 14404
## - Mileage  1 1.5597e+09 4.9584e+10 14428
## - Cruise   1 4.6132e+09 5.2638e+10 14476
## - Liter    1 1.4014e+10 6.2038e+10 14608
## 
## Step:  AIC=14380.75
## Price ~ Liter + Cruise + Mileage + Leather
## 
##           Df  Sum of Sq        RSS   AIC
## + Sound    1 7.6266e+08 4.5782e+10 14370
## <none>                  4.6544e+10 14381
## - Leather  1 1.4801e+09 4.8024e+10 14404
## - Mileage  1 1.5797e+09 4.8124e+10 14406
## - Cruise   1 5.1573e+09 5.1702e+10 14463
## - Liter    1 1.2706e+10 5.9250e+10 14573
## 
## Step:  AIC=14369.47
## Price ~ Liter + Cruise + Mileage + Leather + Sound
## 
##           Df  Sum of Sq        RSS   AIC
## <none>                  4.5782e+10 14370
## - Sound    1 7.6266e+08 4.6544e+10 14381
## - Mileage  1 1.6371e+09 4.7419e+10 14396
## - Leather  1 1.8077e+09 4.7589e+10 14399
## - Cruise   1 4.9334e+09 5.0715e+10 14450
## - Liter    1 1.2327e+10 5.8108e+10 14559

Statistic $AIC$ is a criterion for the selected model that is looking for a model that best interprets the data but contains the fewest free parameters.

\[ AIC=2k+n\ln\frac{RSS}{n} \]

Where $k$ is the number of parameters, $n$ is the number of observations and $RSS$ is the residual sum of squares.

Therefore, as the formula shows, we should give priority to the model with the lowest AIC value.

Test if the linear regression model is significant in general

summary(cars.selected.model)

## 
## Call:
## lm(formula = Price ~ Liter + Cruise + Mileage + Leather + Sound, 
##     data = cars_data_subset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12328  -5660  -1700   4546  38203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.318e+03  1.184e+03   6.182 1.01e-09 ***
## Liter        3.864e+03  2.636e+02  14.658  < 2e-16 ***
## Cruise       6.251e+03  6.741e+02   9.273  < 2e-16 ***
## Mileage     -1.744e-01  3.265e-02  -5.342 1.20e-07 ***
## Leather      3.436e+03  6.121e+02   5.613 2.74e-08 ***
## Sound       -2.126e+03  5.832e+02  -3.646 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7574 on 798 degrees of freedom
## Multiple R-squared:  0.4165, Adjusted R-squared:  0.4129 
## F-statistic: 113.9 on 5 and 798 DF,  p-value: < 2.2e-16

As we can see statistics $F$ for hypothesis $H_0:\beta_1=\beta_2=\cdots=\beta_5=0$ is $113.9$. And p-value is less than $0.05$. Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(Price) and dependent variable(Liter, Cruise, Mileage, Leather, Sound). We can conclude that the linear regression model is significant in general. All coefficients of the linear regression model are significant.

Test the homoscedasticity of the model

$H_0$: the variance of the residuals is constant (homoscedasticity).

$H_1$: the variances of the residuals are different (heteroscedasticity) .

library(carData)
library(car)
ncvTest(cars.selected.model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 105.567, Df = 1, p = < 2.22e-16

As we can see, the p-value is less than $0.05$, so we should reject the null hypothesis that the variance of the residuals is constant.

Test the autocorrelation of residuals

$H_0$: $\rho=0$

$H_1$: $\rho\ne0$

durbinWatsonTest(cars.selected.model,max.lag=3)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.8479070     0.3021227       0
##    2       0.8269908     0.3433033       0
##    3       0.7839602     0.4285357       0
##  Alternative hypothesis: rho[lag] != 0

P-value of different lags are all less than $0.05$, so we should reject the null hypothesis that autocorrelation is zero. So, the residuals have autocorrelation.

Cigarettes

Read Data

To read “txt” files, I use R function - read.table().

read.table('../datasets/cigarettes.dat.txt',header = FALSE,
           dec = '.',na.strings = 'NA') -> cigarettes_data

For dataset cigarettes, variable descriptions are as follows:

V1: Company name
V2: x1=tar (mg)
V3: x2=nicotine (mg)
V4: x3=weight (g)
V5: y=carbon monoxide (mg)

Construct LR model&test the significance in general and coefficients

Model: $y=\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3$

cigarette.LRmodel <- lm(V5~V2+V3+V4,cigarettes_data)
summary(cigarette.LRmodel)

## 
## Call:
## lm(formula = V5 ~ V2 + V3 + V4, data = cigarettes_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89261 -0.78269  0.00428  0.92891  2.45082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2022     3.4618   0.925 0.365464    
## V2            0.9626     0.2422   3.974 0.000692 ***
## V3           -2.6317     3.9006  -0.675 0.507234    
## V4           -0.1305     3.8853  -0.034 0.973527    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.446 on 21 degrees of freedom
## Multiple R-squared:  0.9186, Adjusted R-squared:  0.907 
## F-statistic: 78.98 on 3 and 21 DF,  p-value: 1.329e-11

In general

As we can see statistics $F$ for hypothesis $H_0:\beta_1=\beta_2=\beta_3=0$ is $78.98$. And p-value is less than $0.05$. Therefore, we need to reject the null hypothesis that there no exists a relationship between the independent variable(carbon monoxide) and dependent variable(tar, nicotine, weight). We can conclude that the linear regression model is significant in general.

For Each Coefficient

Since only the p-value of $\beta_2=0$ is less than $0.05$, we need to reject the null hypothesis that $\beta_2=0$, we could state that $\beta_2$ is significant and the other coefficients are not.

Test the homoscedasticity of the model

$H_0$: the variance of the residuals is constant (homoscedasticity).

$H_1$: the variances of the residuals are different (heteroscedasticity) .

library(carData)
library(car)
ncvTest(cigarette.LRmodel)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.562259, Df = 1, p = 0.10944

As we can see, the p-value is larger than $0.05$, so we should accept the null hypothesis that the variance of the residuals is constant.

Test the autocorrelation of residuals

$H_0$: $\rho=0$

$H_1$: $\rho\ne0$

durbinWatsonTest(cigarette.LRmodel,max.lag=3)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.5056358     2.8604692   0.016
##    2       0.4398677     0.9209631   0.004
##    3      -0.4002760     2.3849266   0.154
##  Alternative hypothesis: rho[lag] != 0

P-value of different lags except lag 3 are less than $0.05$, so for lags 1 and 2, we should reject the null hypothesis that autocorrelation is zero. But for lag 3, we need to accept the null hypothesis.

Acknowledgements

Thanks for knitr designed by(Xie 2015).

References

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Lecture 4 - Linear Regression - Zhao Chi - 19.M09

Zhao Chi

Systolic Blood Pressure Data

Read Data

Construct LR models

Test the significance coefficients of LR model

Model 1

Model 2

Model 3

Test the homoscedasticity of the model

Test the autocorrelation of residuals

Cars data

Read Data

Construct the Best LR Models

Test if the linear regression model is significant in general

Test the homoscedasticity of the model

Test the autocorrelation of residuals

Cigarettes

Read Data

Construct LR model&test the significance in general and coefficients

Test the homoscedasticity of the model

Test the autocorrelation of residuals

Acknowledgements

References