Babyboom

Data Read

To read “txt” files, I use R function - read.table().

read.table('../babyboom.dat.txt',header = FALSE,
           dec = ',',na.strings = 'NA') -> babyboom_data

For dataset babyboom, variable descriptions are as follows:

V1: Time of birth recorded on the 24-hour clock
V2: Sex of the child (1 = girl, 2 = boy)
V3: Birth weight in grams
V4: Number of minutes after midnight of each birth

Separate Data

I use function subset to divide data babyboom into 2 subset, one of these is data of girl and another is boy.

baby_girls = subset(babyboom_data,V2==1,select = c(V1,V2,V3,V4))
baby_boys = subset(babyboom_data,V2==2,select = c(V1,V2,V3,V4))

The table below is shown some data of baby_boys.

knitr::kable(
  head(baby_boys, 6)
)

	V1	V2	V3	V4
3	118	2	3554	78
4	155	2	3838	115
5	257	2	3625	177
8	422	2	2846	262
9	431	2	3166	271
10	708	2	3520	428

Summary of data

For example, the summary of babyboom is shown in below.

Using R function summary to summarize data. The results below is shwon the summarize of baby_boys and baby_girls.

summary(baby_boys)

##        V1               V2          V3             V4        
##  Min.   : 118.0   Min.   :2   Min.   :2121   Min.   :  78.0  
##  1st Qu.: 754.2   1st Qu.:2   1st Qu.:3198   1st Qu.: 464.2  
##  Median :1409.5   Median :2   Median :3404   Median : 849.5  
##  Mean   :1311.9   Mean   :2   Mean   :3375   Mean   : 799.6  
##  3rd Qu.:1937.5   3rd Qu.:2   3rd Qu.:3629   3rd Qu.:1177.5  
##  Max.   :2123.0   Max.   :2   Max.   :4162   Max.   :1283.0

summary(baby_girls)

##        V1               V2          V3             V4        
##  Min.   :   5.0   Min.   :1   Min.   :1745   Min.   :   5.0  
##  1st Qu.: 837.8   1st Qu.:1   1st Qu.:2711   1st Qu.: 507.8  
##  Median :1406.5   Median :1   Median :3381   Median : 846.5  
##  Mean   :1273.0   Mean   :1   Mean   :3132   Mean   : 773.0  
##  3rd Qu.:1804.2   3rd Qu.:1   3rd Qu.:3517   3rd Qu.:1094.2  
##  Max.   :2355.0   Max.   :1   Max.   :3866   Max.   :1435.0

Figures of data

The figure below is the histogram of number of births(boy) after midnight per hour.

hist(baby_boys$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
     main = "Histogram of number of births(Boys)",breaks = seq(0,1440,by=60))

The another figure is the histogram of number of births(girls) after midnight per hour.

hist(baby_girls$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
     main = "Histogram of number of births(Girls)",breaks = seq(0,1440,by=60))

The another figure is the histogram of number of births(both of all) after midnight per hour.

hist(babyboom_data$V4,xlab = "Minutes after 00:00",ylab = 'Frequency',
     main = "Histogram of number of births(both of all)",breaks = seq(0,1440,by=60))

Figures below are shown Histogram of Weight(Boy, Girls respectively).

par(mfrow=c(1,2))
hist(babyboom_data$V3,xlab = "Weight",ylab = 'Frequency',
     main = "Histogram of Weight(Boys)")
hist(babyboom_data$V3,xlab = "Weight",ylab = 'Frequency',
     main = "Histogram of Weight(Girls)")

We can use Box Plot to detect outliers. The Box-Plot of babys weight after born as below, boys’s on the right and girs’s on the left.

boxplot(baby_girls$V3,baby_boys$V3)

Conclusion of Data - babyboom

According to figure above, It is obviously that one of the boys’s weight is less than the lower limit. So, I can conclude that this value is a outlier and this boy may be a premature foetus. And median of boy’s weight is more higher than girl’s. Finally, the interquartile range of girl’s weight is larger than boy’s, so girl’s weight data is more discrete than boy’s.

Airport

Data Read

This dataset is not only number but string type, so I use read.csv() function to read data.

read.csv('../airport.csv',header = FALSE,dec = '.',na.strings = 'NA') -> airport_data

For dataset airport, variable descriptions are as follows:

V1: Airport
V2: City
V3: Scheduled departures
V4: Performed departures
V5: Enplaned passengers
V6: Enplaned revenue tons of freight
V7: Enplaned revenue tons of mail

I use command below to select numerical variations.

airport_data[c(3:7)]->numerical_airport_data

Descriptive Statistics

Using library psych to describe statistics.

library(psych)
describe(numerical_airport_data)

##    vars   n       mean         sd     median    trimmed        mad     min
## V3    1 135   45702.44   56406.43   23519.00   34575.61   23729.01 1188.00
## V4    2 135   46453.73   57525.97   23906.00   35039.38   24117.45 1253.00
## V5    3 135 3139235.24 4587564.20 1254846.00 2172878.05 1428058.11    0.00
## V6    4 135   33640.65   80828.32    6192.36   13056.98    7865.55    7.95
## V7    5 135   11410.20   20510.77    2928.32    6680.76    3935.07    0.00
##           max      range skew kurtosis        se
## V3   322430.0   321242.0 2.38     6.82   4854.69
## V4   332338.0   331085.0 2.40     6.93   4951.05
## V5 25636383.0 25636383.0 2.62     7.91 394834.66
## V6   614223.6   614215.7 4.22    21.62   6956.59
## V7   140359.4   140359.4 3.30    13.50   1765.29

CDF, PDF for Numerical Variations

PDF

These figure are shown in below.

#par(mfrow=c(2,1))
dV3 = density(numerical_airport_data$V3)
dV4 = density(numerical_airport_data$V4)
plot(dV3,main = "PDF of Numerical Variations",xlab = "Value",ylab = "probability",
     col="green")
lines(dV4,col="red")
legend("topright",legend=paste(c('Scheduled departures','Performed departures')), 
       lwd=1,col=c("green", "red"))

dV5 = density(numerical_airport_data$V5)
plot(dV5,main = "PDF of Enplaned passengers",xlab = "Value",ylab = "probability")

dV6 = density(numerical_airport_data$V6)
dV7 = density(numerical_airport_data$V7)
plot(dV7,main = "PDF of Numerical Variations",xlab = "Value",
     ylab = "probability",col="blue")
lines(dV6,col="orange")
legend("topright",legend=paste(c('Enplaned revenue tons of freight',
                                 'Enplaned revenue tons of mail')), 
                                  lwd=1,col=c("orange", "blue"))

CDF

par(mfrow=c(2,3))
plot(ecdf(numerical_airport_data$V3),main="CDF of Scheduled departures",col="green")
plot(ecdf(numerical_airport_data$V4),main="CDF of Performed departures",col="red")
plot(ecdf(numerical_airport_data$V5),main="CDF of Enplaned passengers")
plot(ecdf(numerical_airport_data$V6),main="CDF of Enplaned revenue tons of freight",
     col="orange")
plot(ecdf(numerical_airport_data$V7),main="CDF of Enplaned revenue tons of mail",
     col="blue")

Euroweight

Read Data

To read “txt” files, I use R function - read.table().

read.table('../euroweight.dat.txt',header = FALSE,
           dec = '.',na.strings = 'NA') -> euroweight_data

For dataset euroweight, variable descriptions are as follows:

V1: ID - this is the case number
V2: weight - weight of the euro coin in grams
V3: batch - number of the package

Summary of Data

summary(euroweight_data)

##        V1               V2              V3      
##  Min.   :   1.0   Min.   :7.201   Min.   :1.00  
##  1st Qu.: 500.8   1st Qu.:7.498   1st Qu.:2.75  
##  Median :1000.5   Median :7.520   Median :4.50  
##  Mean   :1000.5   Mean   :7.521   Mean   :4.50  
##  3rd Qu.:1500.2   3rd Qu.:7.544   3rd Qu.:6.25  
##  Max.   :2000.0   Max.   :7.752   Max.   :8.00

Describe Data

library(psych)
describe(euroweight_data)

##    vars    n    mean     sd  median trimmed    mad min     max   range  skew
## V1    1 2000 1000.50 577.49 1000.50 1000.50 741.30 1.0 2000.00 1999.00  0.00
## V2    2 2000    7.52   0.03    7.52    7.52   0.03 7.2    7.75    0.55 -0.19
## V3    3 2000    4.50   2.29    4.50    4.50   2.97 1.0    8.00    7.00  0.00
##    kurtosis    se
## V1    -1.20 12.91
## V2     4.42  0.00
## V3    -1.24  0.05

Visualization of Data

This dataset conclude only one variation - “weight” useful for us.

#par(mfrow=c(2,2))
dV2 = density(euroweight_data$V2)
plot(dV2,main = "PDF of euroweight",ylim=range(0,15))
curve(dnorm(x,m=7.52,sd=0.03),main="Probability density function N(7.52,0.0009)",
      add = TRUE,col="red")
legend("topleft",legend=paste(c('density','N(7.52,0.0009)')), 
       lwd=1,col=c("black", "red"))

plot(ecdf(euroweight_data$V2))
curve(pnorm(x,m=7.52,sd=0.03),main="Probability density function N(7.52,0.0009)",
      add = TRUE,col="red")
legend("topleft",legend=paste(c('ECDF','CDF~N(7.52,0.0009)')), 
       lwd=1,col=c("black", "red"))

qqnorm(euroweight_data$V2)
qqline(euroweight_data$V2,col="red",lwd=2)

According to the figure above, I can conclude that the weight of euro coins is not follows a normal distribution.

The other figures are shown in below.

par(mfrow=c(1,2))
hist(euroweight_data$V2,main = "Histogram of euroweight",xlab = "weight")
boxplot(euroweight_data$V2,main = "Box-plot of euroweight",
        ylab="weight",xlab="euroweight")

Acknowledgements

Thanks for knitr designed by(Xie 2015).

References

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Lecture 1 - Descriptive statistics - Zhao Chi - 19.M09

Zhao Chi

Babyboom

Data Read

Separate Data

Summary of data

Figures of data

Conclusion of Data - babyboom

Airport

Data Read

Descriptive Statistics

CDF, PDF for Numerical Variations

PDF

CDF

Euroweight

Read Data

Summary of Data

Describe Data

Visualization of Data

Acknowledgements

References