To read “txt” files, I use R function - read.table().
For dataset babyboom, variable descriptions are as follows:
I use function subset to divide data babyboom into 2 subset, one of these is data of girl and another is boy.
baby_girls = subset(babyboom_data,V2==1,select = c(V1,V2,V3,V4))
baby_boys = subset(babyboom_data,V2==2,select = c(V1,V2,V3,V4))
The table below is shown some data of baby_boys.
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
3 | 118 | 2 | 3554 | 78 |
4 | 155 | 2 | 3838 | 115 |
5 | 257 | 2 | 3625 | 177 |
8 | 422 | 2 | 2846 | 262 |
9 | 431 | 2 | 3166 | 271 |
10 | 708 | 2 | 3520 | 428 |
For example, the summary of babyboom is shown in below.
Using R function summary to summarize data. The results below is shwon the summarize of baby_boys and baby_girls.
## V1 V2 V3 V4
## Min. : 118.0 Min. :2 Min. :2121 Min. : 78.0
## 1st Qu.: 754.2 1st Qu.:2 1st Qu.:3198 1st Qu.: 464.2
## Median :1409.5 Median :2 Median :3404 Median : 849.5
## Mean :1311.9 Mean :2 Mean :3375 Mean : 799.6
## 3rd Qu.:1937.5 3rd Qu.:2 3rd Qu.:3629 3rd Qu.:1177.5
## Max. :2123.0 Max. :2 Max. :4162 Max. :1283.0
## V1 V2 V3 V4
## Min. : 5.0 Min. :1 Min. :1745 Min. : 5.0
## 1st Qu.: 837.8 1st Qu.:1 1st Qu.:2711 1st Qu.: 507.8
## Median :1406.5 Median :1 Median :3381 Median : 846.5
## Mean :1273.0 Mean :1 Mean :3132 Mean : 773.0
## 3rd Qu.:1804.2 3rd Qu.:1 3rd Qu.:3517 3rd Qu.:1094.2
## Max. :2355.0 Max. :1 Max. :3866 Max. :1435.0
The figure below is the histogram of number of births(boy) after midnight per hour.
hist(baby_boys$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
main = "Histogram of number of births(Boys)",breaks = seq(0,1440,by=60))
The another figure is the histogram of number of births(girls) after midnight per hour.
hist(baby_girls$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
main = "Histogram of number of births(Girls)",breaks = seq(0,1440,by=60))
The another figure is the histogram of number of births(both of all) after midnight per hour.
hist(babyboom_data$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
main = "Histogram of number of births(both of all)",breaks = seq(0,1440,by=60))
Figures below are shown Histogram of Weight(Boy, Girls respectively).
par(mfrow=c(1,2))
hist(babyboom_data$V3,xlab = "Weight",ylab = 'Frequency',
main = "Histogram of Weight(Boys)")
hist(babyboom_data$V3,xlab = "Weight",ylab = 'Frequency',
main = "Histogram of Weight(Girls)")
We can use Box Plot to detect outliers. The Box-Plot of babys weight after born as below, boys’s on the right and girs’s on the left.
According to figure above, It is obviously that one of the boys’s weight is less than the lower limit. So, I can conclude that this value is a outlier and this boy may be a premature foetus. And median of boy’s weight is more higher than girl’s. Finally, the interquartile range of girl’s weight is larger than boy’s, so girl’s weight data is more discrete than boy’s.
This dataset is not only number but string type, so I use read.csv() function to read data.
For dataset airport, variable descriptions are as follows:
I use command below to select numerical variations.
Using library psych to describe statistics.
## vars n mean sd median trimmed mad min
## V3 1 135 45702.44 56406.43 23519.00 34575.61 23729.01 1188.00
## V4 2 135 46453.73 57525.97 23906.00 35039.38 24117.45 1253.00
## V5 3 135 3139235.24 4587564.20 1254846.00 2172878.05 1428058.11 0.00
## V6 4 135 33640.65 80828.32 6192.36 13056.98 7865.55 7.95
## V7 5 135 11410.20 20510.77 2928.32 6680.76 3935.07 0.00
## max range skew kurtosis se
## V3 322430.0 321242.0 2.38 6.82 4854.69
## V4 332338.0 331085.0 2.40 6.93 4951.05
## V5 25636383.0 25636383.0 2.62 7.91 394834.66
## V6 614223.6 614215.7 4.22 21.62 6956.59
## V7 140359.4 140359.4 3.30 13.50 1765.29
These figure are shown in below.
#par(mfrow=c(2,1))
dV3 = density(numerical_airport_data$V3)
dV4 = density(numerical_airport_data$V4)
plot(dV3,main = "PDF of Numerical Variations",xlab = "Value",ylab = "probability",
col="green")
lines(dV4,col="red")
legend("topright",legend=paste(c('Scheduled departures','Performed departures')),
lwd=1,col=c("green", "red"))
dV5 = density(numerical_airport_data$V5)
plot(dV5,main = "PDF of Enplaned passengers",xlab = "Value",ylab = "probability")
dV6 = density(numerical_airport_data$V6)
dV7 = density(numerical_airport_data$V7)
plot(dV7,main = "PDF of Numerical Variations",xlab = "Value",
ylab = "probability",col="blue")
lines(dV6,col="orange")
legend("topright",legend=paste(c('Enplaned revenue tons of freight',
'Enplaned revenue tons of mail')),
lwd=1,col=c("orange", "blue"))
par(mfrow=c(2,3))
plot(ecdf(numerical_airport_data$V3),main="CDF of Scheduled departures",col="green")
plot(ecdf(numerical_airport_data$V4),main="CDF of Performed departures",col="red")
plot(ecdf(numerical_airport_data$V5),main="CDF of Enplaned passengers")
plot(ecdf(numerical_airport_data$V6),main="CDF of Enplaned revenue tons of freight",
col="orange")
plot(ecdf(numerical_airport_data$V7),main="CDF of Enplaned revenue tons of mail",
col="blue")
To read “txt” files, I use R function - read.table().
For dataset euroweight, variable descriptions are as follows:
## V1 V2 V3
## Min. : 1.0 Min. :7.201 Min. :1.00
## 1st Qu.: 500.8 1st Qu.:7.498 1st Qu.:2.75
## Median :1000.5 Median :7.520 Median :4.50
## Mean :1000.5 Mean :7.521 Mean :4.50
## 3rd Qu.:1500.2 3rd Qu.:7.544 3rd Qu.:6.25
## Max. :2000.0 Max. :7.752 Max. :8.00
## vars n mean sd median trimmed mad min max range skew
## V1 1 2000 1000.50 577.49 1000.50 1000.50 741.30 1.0 2000.00 1999.00 0.00
## V2 2 2000 7.52 0.03 7.52 7.52 0.03 7.2 7.75 0.55 -0.19
## V3 3 2000 4.50 2.29 4.50 4.50 2.97 1.0 8.00 7.00 0.00
## kurtosis se
## V1 -1.20 12.91
## V2 4.42 0.00
## V3 -1.24 0.05
This dataset conclude only one variation - “weight” useful for us.
#par(mfrow=c(2,2))
dV2 = density(euroweight_data$V2)
plot(dV2,main = "PDF of euroweight",ylim=range(0,15))
curve(dnorm(x,m=7.52,sd=0.03),main="Probability density function N(7.52,0.0009)",
add = TRUE,col="red")
legend("topleft",legend=paste(c('density','N(7.52,0.0009)')),
lwd=1,col=c("black", "red"))
plot(ecdf(euroweight_data$V2))
curve(pnorm(x,m=7.52,sd=0.03),main="Probability density function N(7.52,0.0009)",
add = TRUE,col="red")
legend("topleft",legend=paste(c('ECDF','CDF~N(7.52,0.0009)')),
lwd=1,col=c("black", "red"))
According to the figure above, I can conclude that the weight of euro coins is not follows a normal distribution.
The other figures are shown in below.
par(mfrow=c(1,2))
hist(euroweight_data$V2,main = "Histogram of euroweight",xlab = "weight")
boxplot(euroweight_data$V2,main = "Box-plot of euroweight",
ylab="weight",xlab="euroweight")
Thanks for knitr designed by(Xie 2015).
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.