Systolic Blood Pressure Data

Read Data

To read “xlsx” files, I use R function - read_excel in package readxl.

library(readxl)
read_excel('../datasets/blood.xlsx',) -> patients

For each patient, variable descriptions are as follows:

X1: Systolic blood pressure
X2: Age in years
X3: Weight in pounds

The model in matrix form

\[Y=X\beta+\epsilon\]

Cost function of Linear Regression model

\[J(\beta)=(y-X\beta)^T(y-X\beta)\]

Minimize the cost function, we can get the OLS (ordinary least squares) estimator: \[\hat{\beta}=(X^TX)^{-1}X^TY\]

Linear Regression model

In this case, \(X_1\) is a dependent variable and the other two are the independent variables. Then I use the following code to build a linear regression model.

blood.lr<-lm(X1~.,patients)
summary(blood.lr)

## 
## Call:
## lm(formula = X1 ~ ., data = patients)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## X2            0.8614     0.2482   3.470  0.00844 **
## X3            0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Ridge Regression

As shown above, we can not guarantee that matrix \(X^TX\) is always reversible. Therefore, if \(X^TX\) is an irreversible matrix, we can not get the estimator \(\hat{\beta}\). And if \(|X^TX|\rightarrow0\), the estimator \(\hat{\beta}\) will tend to infinity.

To solve such a problem(multicollinearity), we can use ridge regression.

Cost Function of Ridge Regression

\[J(\beta,\lambda)=(y-X\beta)^T(y-X\beta)+\lambda\beta^T\beta\] Minimize the cost function, we can get the Ridge regression estimator: \[\hat\beta(\lambda)=(X^TX+\lambda I_{k+1})^{-1}X^TY\] For any matrix \(X\) and any \(\lambda > 0\), there exist matrix \((X^TX+\lambda I_{k+1})^{-1}\).

Quantile Regression

Cost Function of Quantile Regression

\[J(\beta,\tau)=\sum p_{\tau}(y_i-(X\beta)_i)=\sum_{i:y_i\ge(X\beta)_i}\tau(y_i-(X\beta)_i)+\sum_{i:y_i\le(X\beta)_i}(1-\tau)((X\beta)_i-y_i),\ \tau \in (0,1)\]

LAD estimator: \(\hat{\beta}(\tau)\)

Ridge regression

Find optimal \(\lambda\)

Construct ridge regression for \(\lambda\) from \(0\) to \(2\) with step \(0.001\):

library(MASS)
lm.ridge(X1~.,patients,lambda = seq(0,2,by=0.001))->fit
plot(fit)

select(lm.ridge(X1~.,patients,lambda = seq(0,2,by=0.01)))

## modified HKB estimator is 0 
## modified L-W estimator is 0 
## smallest value of GCV  at 0.56

From the figure and the result of select function, I can conclude that the choice of the ridge parameter(\(\lambda\)) have great uncertainty by using different methods.

In this case, let \(\lambda=0.56\). The result is shown below.

blood.rr <- lm.ridge(X1~.,patients,lambda = 0.56)
blood.rr

##                    X2         X3 
## 31.9476379  0.7876384  0.3535983

Use library ridge

chosen automatically

library(ridge)
blood.rr.a<-linearRidge(X1~.,patients)
summary(blood.rr.a)

## 
## Call:
## linearRidge(formula = X1 ~ ., data = patients)
## 
## 
## Coefficients:
##             Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  30.9289              NA                  NA               NA
## X2            0.8505         24.5135              6.2148            3.944
## X3            0.3387         18.5445              6.2148            2.984
##             Pr(>|t|)    
## (Intercept)       NA    
## X2             8e-05 ***
## X3           0.00285 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.004737444, chosen automatically, computed using 1 PCs
## 
## Degrees of freedom: model 1.917 , variance 1.84 , residual 1.993

chosen manually

library(ridge)
blood.rr.m<-linearRidge(X1~.,patients,lambda = 0.56)
summary(blood.rr.m)

## 
## Call:
## linearRidge(formula = X1 ~ ., data = patients, lambda = 0.56)
## 
## 
## Coefficients:
##             Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  54.4895              NA                  NA               NA
## X2            0.5914         17.0441              1.7555            9.709
## X3            0.3009         16.4736              1.7555            9.384
##             Pr(>|t|)    
## (Intercept)       NA    
## X2            <2e-16 ***
## X3            <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.56
## 
## Degrees of freedom: model 0.8644 , variance 0.6107 , residual 1.118

Compare three models

blood.predicted.lr <- predict(blood.lr,patients)
blood.compare <- cbind(actual=patients$X1,blood.predicted.lr)
show(blood.compare)

##    actual blood.predicted.lr
## 1     132           133.7183
## 2     143           143.4317
## 3     153           153.6716
## 4     162           164.5327
## 5     154           151.7570
## 6     168           168.4078
## 7     137           140.4640
## 8     149           146.4939
## 9     159           156.3019
## 10    128           126.5407
## 11    166           165.6804

blood.accuracy <- mean(apply(blood.compare,1,min)/apply(blood.compare,1,max))
blood.accuracy

## [1] 0.9886923

RR with \(\lambda=0.56\)

blood.predicted.rr.m <- predict(blood.rr.m,patients)
blood.compare1 <- cbind(actual=patients$X1,blood.predicted.rr.m)
show(blood.compare1)

##    actual blood.predicted.rr.m
## 1     132             137.2896
## 2     143             144.7385
## 3     153             152.4780
## 4     162             161.1408
## 5     154             151.3057
## 6     168             164.4400
## 7     137             142.9852
## 8     149             147.1247
## 9     159             155.2066
## 10    128             131.9362
## 11    166             162.3547

blood.accuracy1 <- mean(apply(blood.compare1,1,min)/apply(blood.compare1,1,max))
blood.accuracy1

## [1] 0.9792692

RR with \(\lambda = 0.00473\) chosen automatically

blood.predicted.rr.a <- predict(blood.rr.a,patients)
blood.compare2 <- cbind(actual=patients$X1,blood.predicted.rr.a)
show(blood.compare2)

##    actual blood.predicted.rr.a
## 1     132             133.7481
## 2     143             143.4272
## 3     153             153.6181
## 4     162             164.4789
## 5     154             151.7440
## 6     168             168.3776
## 7     137             140.5295
## 8     149             146.4830
## 9     159             156.3201
## 10    128             126.6130
## 11    166             165.6605

blood.accuracy2 <- mean(apply(blood.compare2,1,min)/apply(blood.compare2,1,max))
blood.accuracy2

## [1] 0.9887473

Compare the three results, model blood.rr.a is more accurate than the other two.

max(blood.accuracy,blood.accuracy1,blood.accuracy2)

## [1] 0.9887473

Quantile Regression

We use function rq in package quantreg, for different \(\tau\) from \(0.2\) to \(0.8\) with step \(0.1\). The results as shown below.

library(SparseM)

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(quantreg)
blood.qr<- rq(formula = X1~.,tau=2:8/10,data = patients,)
summary(blood.qr)

## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.2
## 
## Coefficients:
##             coefficients lower bd   upper bd  
## (Intercept)   41.13115   -104.93178   63.27750
## X2             1.14754     -0.20444    1.28465
## X3             0.18033     -0.03971    0.93687
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.3
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 27.03704      4.76042 57.92169
## X2           0.95062     -0.10568  1.31937
## X3           0.32099      0.07881  0.79637
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.4
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 29.30435      1.80309 68.14027
## X2           0.79710      0.01156  1.31406
## X3           0.36232      0.16714  0.78413
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 27.36782      0.88287 61.27839
## X2           0.64368      0.09425  1.53764
## X3           0.42529      0.00393  0.72892
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.6
## 
## Coefficients:
##             coefficients lower bd upper bd
## (Intercept) 27.36782     12.20839 59.54008
## X2           0.64368      0.20217  1.67670
## X3           0.42529     -0.12433  0.58920
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.7
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)  15.18654    -13.49924  62.10933
## X2            0.49847      0.07013   1.59303
## X3            0.53823     -0.05272   0.60927
## 
## Call: rq(formula = X1 ~ ., tau = 2:8/10, data = patients)
## 
## tau: [1] 0.8
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)  26.37870    -90.45791  56.34419
## X2            0.83432     -0.30389   1.70043
## X3            0.37870     -0.15552   0.77708

plot(summary(blood.qr))

Compare the results

blood.predicted.qr <- predict(blood.qr,patients)
blood.compare.ql = cbind(actual=patients$X1,blood.predicted.lr,blood.predicted.qr)
show(blood.compare.ql)

##    actual blood.predicted.lr tau= 0.2 tau= 0.3 tau= 0.4 tau= 0.5 tau= 0.6
## 1     132           133.7183 132.0000 132.0000 133.4348 134.4138 134.4138
## 2     143           143.4317 142.0164 142.1852 143.0000 143.5977 143.5977
## 3     153           153.6716 153.0000 153.0000 153.0000 153.0000 153.0000
## 4     162           164.5327 162.9508 164.1605 163.9420 164.0920 164.0920
## 5     154           151.7570 149.9180 150.7901 151.3333 151.9195 151.9195
## 6     168           168.4078 165.7213 168.0000 168.0000 168.5632 168.5632
## 7     137           140.4640 137.0000 138.7160 140.4638 142.0805 142.0805
## 8     149           146.4939 145.0328 145.3704 146.0435 146.5862 146.5862
## 9     159           156.3019 153.0492 155.2716 156.1159 157.2414 157.2414
## 10    128           126.5407 124.0328 124.3704 126.4783 128.0000 128.0000
## 11    166           165.6804 162.8852 165.1358 165.3188 166.0000 166.0000
##    tau= 0.7 tau= 0.8
## 1  134.2202 135.2781
## 2  143.6300 145.2840
## 3  153.0000 155.7456
## 4  165.1407 167.1893
## 5  152.5810 154.0000
## 6  170.4832 171.4320
## 7  143.2905 142.6272
## 8  146.7798 148.4675
## 9  159.0000 159.0000
## 10 128.0000 128.0000
## 11 167.8716 168.6272

Cars data

Read Data

To read “xlsx” files, I use R function - read_excel in package readxl.

library(readxl)
read_excel('../datasets/kuiper.xlsx',) -> cars_data

For dataset cars, variable descriptions are as follows:

Price: suggested retail price of the used 2005 GM car in excellent condition. The condition of a car can greatly affect price. All cars in this data set were less than one year old when priced and considered to be in excellent condition.
Mileage: number of miles the car has been driven
Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet
Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier
Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D
Type: body type such as sedan, coupe, etc.
Cylinder: number of cylinders in the engine
Liter: a more specific measure of engine size
Doors: number of doors
Cruise: indicator variable representing whether the car has cruise control (1 = cruise)
Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded)
Leather: indicator variable representing whether the car has leather seats (1 = leather)

Linear Regression Model

cars_data_subset = subset(cars_data,select = c(Price,Mileage,Liter,Cruise,Sound,Leather))
lm(Price~.,cars_data_subset)->cars.lr
summary(cars.lr)

## 
## Call:
## lm(formula = Price ~ ., data = cars_data_subset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12328  -5660  -1700   4546  38203 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.318e+03  1.184e+03   6.182 1.01e-09 ***
## Mileage     -1.744e-01  3.265e-02  -5.342 1.20e-07 ***
## Liter        3.864e+03  2.636e+02  14.658  < 2e-16 ***
## Cruise       6.251e+03  6.741e+02   9.273  < 2e-16 ***
## Sound       -2.126e+03  5.832e+02  -3.646 0.000284 ***
## Leather      3.436e+03  6.121e+02   5.613 2.74e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7574 on 798 degrees of freedom
## Multiple R-squared:  0.4165, Adjusted R-squared:  0.4129 
## F-statistic: 113.9 on 5 and 798 DF,  p-value: < 2.2e-16

Ridge regression

Similar to task 1, I use the library ridge.

library(ridge)
cars.rr<-linearRidge(Price~.,cars_data_subset)
summary(cars.rr)

## 
## Call:
## linearRidge(formula = Price ~ ., data = cars_data_subset)
## 
## 
## Coefficients:
##               Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  7.430e+03              NA                  NA               NA
## Mileage     -1.724e-01      -4.005e+04           7.492e+03            5.346
## Liter        3.828e+03       1.199e+05           8.125e+03           14.761
## Cruise       6.213e+03       7.602e+04           8.115e+03            9.368
## Sound       -2.104e+03      -2.785e+04           7.621e+03            3.655
## Leather      3.398e+03       4.308e+04           7.657e+03            5.626
##             Pr(>|t|)    
## (Intercept)       NA    
## Mileage     9.01e-08 ***
## Liter        < 2e-16 ***
## Cruise       < 2e-16 ***
## Sound       0.000258 ***
## Leather     1.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.01153487, chosen automatically, computed using 5 PCs
## 
## Degrees of freedom: model 4.938 , variance 4.876 , residual 4.999

Compare two models

cars.predicted.lr <- predict(cars.lr,cars_data_subset)
cars.compare <- cbind(actual=cars_data_subset$Price,cars.predicted.lr)
show(head(cars.compare,5))

##     actual cars.predicted.lr
## 1 17314.10          25423.85
## 2 17542.04          21828.63
## 3 16218.85          21120.36
## 4 16336.91          22698.11
## 5 16339.17          25525.24

cars.accuracy <- mean(apply(cars.compare,1,min)/apply(cars.compare,1,max))
cars.accuracy

## [1] 0.7791548

RR with \(\lambda = 0.012\) chosen automatically

cars.predicted.rr <- predict(cars.rr,cars_data_subset)
cars.compare1 <- cbind(actual=cars_data_subset$Price,cars.predicted.rr)
show(head(cars.compare1,5))

##     actual cars.predicted.rr
## 1 17314.10          25385.84
## 2 17542.04          21830.27
## 3 16218.85          21130.06
## 4 16336.91          22691.73
## 5 16339.17          25487.94

cars.accuracy1 <- mean(apply(cars.compare1,1,min)/apply(cars.compare1,1,max))
cars.accuracy1

## [1] 0.7794884

Compare the two results, model cars.rr is more accurate than cars.lr.

max(cars.accuracy1,cars.accuracy)

## [1] 0.7794884

Quantile Regression

We use function rq in package quantreg, for different \(\tau\) from \(0.2\) to \(0.8\) with step \(0.1\). The results as shown below.

library(SparseM)
library(quantreg)
cars.qr<- rq(formula = Price~.,tau=2:8/10,data = cars_data_subset,)
summary(cars.qr)

## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.2
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept)  6942.50069   5459.65065  7296.08236
## Mileage        -0.11506     -0.13654    -0.09319
## Liter        3594.64481   3378.81399  3868.56366
## Cruise        542.71312    333.14546   928.59901
## Sound        -769.24211  -1145.59397  -171.74786
## Leather       509.06156    228.23340   909.23020
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.3
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept)  6913.54754   6300.53997  7468.35597
## Mileage        -0.12293     -0.14090    -0.10745
## Liter        3793.26932   3586.63649  3941.50503
## Cruise        924.61467    665.72890  1134.85978
## Sound        -971.47254  -1297.00791  -644.73647
## Leather       603.48569    362.15424  1019.67261
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.4
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept)  6640.58054   6204.93794  7378.70988
## Mileage        -0.12125     -0.14300    -0.10556
## Liter        4025.16643   3712.31439  4251.18936
## Cruise       1195.48266    915.01362  1573.66440
## Sound       -1161.82742  -1840.88750  -748.55659
## Leather       779.70216    250.53125  1356.29465
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept)  6670.48365   5414.67440  8472.41054
## Mileage        -0.12289     -0.16614    -0.09624
## Liter        4383.62510   3561.08396  4721.63290
## Cruise       1360.78031   1022.04912  2674.79926
## Sound       -1760.28023  -2525.58686  -886.08973
## Leather      1055.59005    391.66341  1734.42700
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.6
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept)  9471.32531   7725.86917 11763.82545
## Mileage        -0.14851     -0.20021    -0.10058
## Liter        3429.83459   2861.32636  4122.94348
## Cruise       5296.83351   3888.21959  7159.46934
## Sound       -4090.53479  -6138.48532 -2905.45510
## Leather      3516.73942   2017.71036  4711.76740
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.7
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept) 10264.78566   8537.39228 12231.31142
## Mileage        -0.14267     -0.19072    -0.09341
## Liter        2572.85577   2236.60727  3136.25310
## Cruise      11640.27168  10590.45354 12161.87968
## Sound       -3517.99528  -4887.65352 -2767.39458
## Leather      4572.27575   3018.65572  6120.76708
## 
## Call: rq(formula = Price ~ ., tau = 2:8/10, data = cars_data_subset)
## 
## tau: [1] 0.8
## 
## Coefficients:
##             coefficients lower bd    upper bd   
## (Intercept) 11194.34749  10027.85473 13410.30640
## Mileage        -0.15380     -0.20443    -0.09938
## Liter        2947.35812   2429.84071  3456.85731
## Cruise      13993.36377  13263.49899 15233.27569
## Sound       -2702.86035  -4105.50144 -1658.69236
## Leather      2490.54804    356.93850  3669.42748

plot(summary(cars.qr))

Compare the results

cars.predicted.qr <- predict(cars.qr,cars_data_subset)
cars.compare.ql = cbind(actual=cars_data_subset$Price,cars.predicted.lr,cars.predicted.qr)
head(cars.compare.ql,5)

##     actual cars.predicted.lr tau= 0.2 tau= 0.3 tau= 0.4 tau= 0.5 tau= 0.6
## 1 17314.10          25423.85 17422.54 18218.67 18935.15 19905.50 23605.93
## 2 17542.04          21828.63 16808.32 17502.83 18044.62 18737.59 19953.46
## 3 16218.85          21120.36 16341.07 17003.59 17552.22 18238.51 19350.35
## 4 16336.91          22698.11 16748.34 17588.31 18332.59 19612.17 22973.67
## 5 16339.17          25525.24 16855.85 17762.76 18689.13 20238.86 25972.10
##   tau= 0.7 tau= 0.8
## 1 29762.30 32847.80
## 2 25059.63 30216.68
## 3 24480.25 29592.08
## 4 27549.40 31811.08
## 5 31623.76 33764.86

Cigarettes

Read Data

To read “txt” files, I use R function - read.table().

read.table('../datasets/cigarettes.dat.txt',header = FALSE,
           dec = '.',na.strings = 'NA') -> cigarettes_data

For dataset cigarettes, variable descriptions are as follows:

V1: Company name
V2: x1=tar (mg)
V3: x2=nicotine (mg)
V4: x3=weight (g)
V5: y=carbon monoxide (mg)

Linear Regression Model

cigarettes.lr <- lm(V5~V2+V3+V4,cigarettes_data)
summary(cigarettes.lr)

## 
## Call:
## lm(formula = V5 ~ V2 + V3 + V4, data = cigarettes_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89261 -0.78269  0.00428  0.92891  2.45082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2022     3.4618   0.925 0.365464    
## V2            0.9626     0.2422   3.974 0.000692 ***
## V3           -2.6317     3.9006  -0.675 0.507234    
## V4           -0.1305     3.8853  -0.034 0.973527    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.446 on 21 degrees of freedom
## Multiple R-squared:  0.9186, Adjusted R-squared:  0.907 
## F-statistic: 78.98 on 3 and 21 DF,  p-value: 1.329e-11

Ridge regression

Similar to task 1, I use the library ridge.

library(ridge)
cigarettes.rr<-linearRidge(V5~V2+V3+V4,cigarettes_data)
summary(cigarettes.rr)

## 
## Call:
## linearRidge(formula = V5 ~ V2 + V3 + V4, data = cigarettes_data)
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  2.735032              NA                  NA               NA
## V2           0.703002       19.513000            3.731162            5.230
## V3           1.382849        2.398586            3.751193            0.639
## V4          -0.007041       -0.003026            1.619055            0.002
##             Pr(>|t|)    
## (Intercept)       NA    
## V2           1.7e-07 ***
## V3             0.523    
## V4             0.999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 0.0193945, chosen automatically, computed using 2 PCs
## 
## Degrees of freedom: model 2.508 , variance 2.223 , residual 2.793

Compare two models

cigarettes.predicted.lr <- predict(cigarettes.lr,cigarettes_data)
cigarettes.compare <- cbind(actual=cigarettes_data$V5,cigarettes.predicted.lr)
show(head(cigarettes.compare,5))

##   actual cigarettes.predicted.lr
## 1   13.6               14.382689
## 2   16.6               15.671090
## 3   23.5               26.392608
## 4   10.2                9.018481
## 5    5.4                5.972616

cigarettes.accuracy <- mean(apply(cigarettes.compare,1,min)/
                              apply(cigarettes.compare,1,max))
cigarettes.accuracy

## [1] 0.9026594

RR with \(\lambda = 0.019\) chosen automatically

cigarettes.predicted.rr <- predict(cigarettes.rr,cigarettes_data)
cigarettes.compare1 <- cbind(actual=cigarettes_data$V5,cigarettes.predicted.rr)
show(head(cigarettes.compare1,5))

##   actual cigarettes.predicted.rr
## 1   13.6               13.829671
## 2   16.6               15.441180
## 3   23.5               26.483468
## 4   10.2                9.279022
## 5    5.4                6.163817

cigarettes.accuracy1 <- mean(apply(cigarettes.compare1,1,min)/
                               apply(cigarettes.compare1,1,max))
cigarettes.accuracy1

## [1] 0.8974755

Compare the two results, model cigarettes.lr is more accurate than cigarettes.rr.

max(cigarettes.accuracy,cigarettes.accuracy1)

## [1] 0.9026594

Quantile Regression

We use function rq in package quantreg, for different \(\tau\) from \(0.2\) to \(0.8\) with step \(0.1\). The results as shown below.

library(SparseM)
library(quantreg)
cigarettes.qr<- rq(formula = V5~V2+V3+V4,tau=2:8/10,data = cigarettes_data)
summary(cigarettes.qr)

## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.2
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   3.65249     -4.69097  14.76883
## V2            0.54733      0.53517   1.31290
## V3            4.61991     -8.31325   5.43916
## V4           -2.93206    -16.03331  14.87044
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.3
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   3.91020     -7.49230  12.85782
## V2            0.84280      0.54215   0.98870
## V3           -0.59492     -3.50732   5.43159
## V4           -1.82595    -10.37780  12.09846
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.4
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   2.46006     -7.72342  10.93642
## V2            0.81179      0.73404   1.44977
## V3            0.41790    -16.01431   3.65981
## V4           -0.67556     -6.87898  11.40989
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   1.84400     -5.88455  12.93210
## V2            1.00921      0.51541   1.57522
## V3           -2.37725    -13.95511   6.75901
## V4            0.39012     -8.28515   8.31436
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.6
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   0.84449     -9.97981  13.63223
## V2            1.17639      0.33371   1.55546
## V3           -5.34000    -13.32838   9.46835
## V4            2.37129     -9.99074  11.59710
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.7
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)   0.19785     -8.38333  17.38390
## V2            1.38739      0.26367   1.42203
## V3           -9.08348     -9.41974   9.95496
## V4            4.28449    -13.39838  11.08982
## 
## Call: rq(formula = V5 ~ V2 + V3 + V4, tau = 2:8/10, data = cigarettes_data)
## 
## tau: [1] 0.8
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept)  -4.90728     -8.97867  13.50917
## V2            0.93496      0.27969   1.49528
## V3           -1.14784    -10.40564   9.99748
## V4            8.52023     -8.66037  12.21889

plot(summary(cigarettes.qr))

Compare the results

cigarettes.predicted.qr <- predict(cigarettes.qr,cigarettes_data)
cigarettes.compare.ql = cbind(actual=cigarettes_data$V5,
                              cigarettes.predicted.lr,cigarettes.predicted.qr)
head(cigarettes.compare.ql,5)

##   actual cigarettes.predicted.lr  tau= 0.2  tau= 0.3  tau= 0.4  tau= 0.5
## 1   13.6               14.382689 12.454072 13.482943 13.600000 14.413823
## 2   16.6               15.671090 14.099861 14.767163 15.152674 15.898202
## 3   23.5               26.392608 25.925627 25.690724 26.712576 27.547151
## 4   10.2                9.018481  8.405556  8.559524  8.607416  8.686963
## 5    5.4                5.972616  4.970212  5.400000  5.316325  5.400000
##    tau= 0.6  tau= 0.7  tau= 0.8
## 1 15.175579 16.169752 15.683453
## 2 16.600000 17.453962 18.154747
## 3 27.823183 28.094011 30.550383
## 4  8.878342  9.187039  9.710097
## 5  5.775388  6.306739  6.528752

Acknowledgements

Thanks for knitr designed by(Xie 2015).

References

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Lecture 5 - Ridge and Quantile Regressions - Zhao Chi - 19.M09

Zhao Chi

Systolic Blood Pressure Data

Read Data

The model in matrix form

Cost function of Linear Regression model

Linear Regression model

Ridge Regression

Quantile Regression

Ridge regression

Find optimal \(\lambda\)

Use library ridge

Compare three models

Quantile Regression

Compare the results

Cars data

Read Data

Linear Regression Model

Ridge regression

Compare two models

Quantile Regression

Compare the results

Cigarettes

Read Data

Linear Regression Model

Ridge regression

Compare two models

Quantile Regression

Compare the results

Acknowledgements

References