Babyboom

Data Read and Separete

To read “txt” files, I use R function - read.table().

For dataset babyboom, variable descriptions are as follows:

  1. V1: Time of birth recorded on the 24-hour clock
  2. V2: Sex of the child (1 = girl, 2 = boy)
  3. V3: Birth weight in grams
  4. V4: Number of minutes after midnight of each birth

I use function subset to divide data babyboom into 2 subset, one of these is data of girl and another is boy.

The table below is shown some data of baby_boys.

V1 V2 V3 V4
3 118 2 3554 78
4 155 2 3838 115
5 257 2 3625 177
8 422 2 2846 262
9 431 2 3166 271
10 708 2 3520 428

One-Sample Test for Normality

For all baby

Kolmogorov-Smirnov D Test

Because of K-S test require the distribution of our test need to be continuous, that means there are must no duplicate values in our sample. But our sample(all baby) don’t satisfy this condition. So, I use R function “Jitter” to add noisy in our sample.

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  jitter(babyboom_data$V3)
## D = 0.18328, p-value = 0.09131
## alternative hypothesis: two-sided

Lilliefors Test

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  babyboom_data$V3
## D = 0.18336, p-value = 0.0007395

Anderson–Darling Test

## 
##  Anderson-Darling normality test
## 
## data:  babyboom_data$V3
## A = 1.7168, p-value = 0.0001788

Cramer–von Mises Test

## 
##  Cramer-von Mises normality test
## 
## data:  babyboom_data$V3
## W = 0.31125, p-value = 0.0002256

Shapiro–Wilk Test

## 
##  Shapiro-Wilk normality test
## 
## data:  babyboom_data$V3
## W = 0.89872, p-value = 0.0009944

Shapiro–Francia Test

## 
##  Shapiro-Francia normality test
## 
## data:  babyboom_data$V3
## W = 0.89701, p-value = 0.001519

Pearson chi-square test

## 
##  Pearson chi-square normality test
## 
## data:  babyboom_data$V3
## P = 20.091, p-value = 0.005377

According to the results above, only K-S D test show a different result. Because I the mean and standard deviation of population is different with sample’s. I can conclude the weight of all baby is not normal distribution.

For Boys

## 
##  Shapiro-Wilk normality test
## 
## data:  baby_boys$V3
## W = 0.94747, p-value = 0.2022

Since the p-value is larger that \(0.05\), we cannot reject the null hypothesis that the weights of baby boys is normal distribution.

For Girls

## 
##  Shapiro-Wilk normality test
## 
## data:  baby_girls$V3
## W = 0.87028, p-value = 0.01798

Since the p-value is smaller that \(0.05\), we reject the null hypothesis that the weights of baby girls is normal distribution.

Test the hypothesis if the mean of the weight of girls is the same as the weight of boys.

One of our sample is not normal distribution, but the other one is. So I use non-parametrical test Wilcoxon rank sum test and K-S test to compare the means of our two samples.

Wilcoxon Rank sum test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  baby_boys$V3 and baby_girls$V3
## W = 273.5, p-value = 0.3519
## alternative hypothesis: true location shift is not equal to 0

Since the p-value is large than confidence level \(0.05\), we cannot reject the null hypothesis that the weight of baby boy’s is the same as the girl’s. These two sample are came from the same distribution. And the means of the weight of two sample are same.

Two Sample Kolmogorov-Smirnov Tests

## Warning in ks.test(baby_boys$V3, baby_girls$V3, alternative = "two.sided", : p-
## value will be approximate in the presence of ties
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  baby_boys$V3 and baby_girls$V3
## D = 0.23932, p-value = 0.5762
## alternative hypothesis: two-sided

Similarity as the Wilcoxon Rank sum test, Two Sample Kolmogorov-Smirnov Tests give me the same result.

Student-T test

Finally, I try to use Student-T test in this test, the result is same.

## 
##  Welch Two Sample t-test
## 
## data:  baby_boys$V3 and baby_girls$V3
## t = 1.4211, df = 27.631, p-value = 0.1665
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -107.4273  593.1538
## sample estimates:
## mean of x mean of y 
##  3375.308  3132.444

Test the hypothesis if the variance of the weight of girls is the same as the weight of boys.

The two sample are not all from the normal distribution. F-test is sensitive to the sample from normal distribution and have different length. Then Bartlett Test and F-test is unavailable, I use Levene’s Test to test the Homogeneity of Variances.

## Warning in leveneTest.default(babyboom_data$V3, babyboom_data$V2):
## babyboom_data$V2 coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  1.8154 0.1851
##       42

P-value is larger than \(0.05\), so we cannot reject the null hypothesis that the variance of the weight of girls is the same as the weight of boys.

Then I tried to use F-test to test the null hypothesis and got the same result.

## 
##  F test to compare two variances
## 
## data:  baby_boys$V3 and baby_girls$V3
## F = 0.45933, num df = 25, denom df = 17, p-value = 0.07526
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1802395 1.0839460
## sample estimates:
## ratio of variances 
##          0.4593257

One-Sample Tests for Exponentiality

One-sample Kolmogorov–Smirnov test

## 
##  Kolmogorov-Smirnov test for exponentiality
## 
## data:  babyboom_data$V4
## KSn = 0.23477, p-value = 5e-04
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  babyboom_data$V4
## D = 0.99326, p-value < 2.2e-16
## alternative hypothesis: two-sided

Cramer–von Mises test

## 
##  Cramer-von Mises test for exponentiality
## 
## data:  babyboom_data$V4
## Wn = 0.75668, p-value = 1

Atkinson test

## 
##  Atkinson test for exponentiality
## 
## data:  babyboom_data$V4
## T = 0.016934, p-value = 0.001615

Lorenz test

## 
##  Lorenz test for exponentiality
## 
## data:  babyboom_data$V4
## L = 0.27801, p-value = 0.4086

Shapiro-Wilk test for exponentiality:

## 
##  Shapiro-Wilk test for exponentiality
## 
## data:  babyboom_data$V4
## W = 0.084434, p-value = 1

Kimber-Michael test for exponentially:

## 
##  Kimber-Michael test for exponentiality
## 
## data:  babyboom_data$V4
## D = 0.19583, p-value < 2.2e-16

Using different method I got the different result, according to the p-values, there of them are smaller than \(0.05\), but the others are larger than \(0.05\). So, I’m not sure the sample is fitted in exponential distribution.

Test the hypothesis if the births per hour for each hour is distributed by Poisson distribution

Firstly, I build the frequency matrix about the births per hour, then using Goodness-of-fit Tests to continue the next test.

## $breaks
##  [1]    0   60  120  180  240  300  360  420  480  540  600  660  720  780  840
## [16]  900  960 1020 1080 1140 1200 1260 1320 1380 1440
## 
## $counts
##  [1] 1 3 1 0 4 0 0 2 2 1 3 1 2 1 4 1 2 1 3 4 3 2 1 2
## 
## $density
##  [1] 0.0003787879 0.0011363636 0.0003787879 0.0000000000 0.0015151515
##  [6] 0.0000000000 0.0000000000 0.0007575758 0.0007575758 0.0003787879
## [11] 0.0011363636 0.0003787879 0.0007575758 0.0003787879 0.0015151515
## [16] 0.0003787879 0.0007575758 0.0003787879 0.0011363636 0.0015151515
## [21] 0.0011363636 0.0007575758 0.0003787879 0.0007575758
## 
## $mids
##  [1]   30   90  150  210  270  330  390  450  510  570  630  690  750  810  870
## [16]  930  990 1050 1110 1170 1230 1290 1350 1410
## 
## $xname
## [1] "babyboom_data$V4"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
##   Var1 Freq
## 1    0    3
## 2    1    8
## 3    2    6
## 4    3    4
## 5    4    3
## Warning in summary.goodfit(gf): Chi-squared approximation may be incorrect
## 
##   Goodness-of-fit test for poisson distribution
## 
##              X^2 df  P(> X^2)
## Pearson 4.555953  7 0.7139696

Obviously, p-value is larger than \(0.05\), as a result I cannot reject the null hypothesis that the births per hour is distributed by Poisson distribution.

Euroweight

Read Data

To read “txt” files, I use R function - read.table().

For dataset euroweight, variable descriptions are as follows:

  1. V1: ID - this is the case number
  2. V2: weight - weight of the euro coin in grams
  3. V3: batch - number of the package

One-Sample Tests for Normality

For whole sample

## 
##  Shapiro-Wilk normality test
## 
## data:  euroweight_data$V2
## W = 0.97547, p-value < 2.2e-16

The p-value of whole sample is large than \(0.05\), so we need to reject the null hypothesis.

For each group

I write a function to test the normality for more than one group in a sample. The results are shown in below.

## Loading required package: magrittr
##             No        Group         W      p.value       norm.test
## 1            1            1 0.9955066 6.830017e-01            Norm
## 2            2            2 0.9909001 1.218770e-01            Norm
## 3            3            3 0.8634321 4.089445e-14 Other_situation
## 4            4            4 0.9955047 6.826586e-01            Norm
## 5            5            5 0.9910340 1.289928e-01            Norm
## 6            6            6 0.9840595 6.756499e-03 Other_situation
## 7            7            7 0.9907008 1.119834e-01            Norm
## 8            8            8 0.9367201 6.827698e-09 Other_situation
## 9 Test Method: Shapiro-Wilk        NA           NA            <NA>

Test the hypothesis that the mean of the weight of coins is the same in different packages

As the table shown in above, not all group in the sample is distributed by normal distribution. So I use non-parametrical test pairwise.wilcox.test and Kruskal-Wallis test.

Kruskal-Wallis test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  euroweight_data$V2 by euroweight_data$V3
## Kruskal-Wallis chi-squared = 97.5, df = 7, p-value < 2.2e-16

According to the result, I can conclude we need to reject the null hypothesis that the mean of weight in each group are same.

Pairwise Wilcox test

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  euroweight_data$V2 and euroweight_data$V3 
## 
##   1       2       3       4       5       6       7      
## 2 1.00000 -       -       -       -       -       -      
## 3 0.04297 0.00025 -       -       -       -       -      
## 4 0.00141 0.10329 7.8e-12 -       -       -       -      
## 5 0.00108 0.10329 2.6e-12 1.00000 -       -       -      
## 6 0.76768 0.04297 1.00000 2.9e-07 1.7e-07 -       -      
## 7 1.00000 1.00000 0.00012 0.10329 0.10202 0.04297 -      
## 8 1.00000 0.10329 0.73578 1.4e-06 7.1e-07 1.00000 0.10202
## 
## P value adjustment method: holm

Pairwise T test

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  euroweight_data$V2 and euroweight_data$V3 
## 
##   1       2       3       4       5       6       7      
## 2 1.00000 -       -       -       -       -       -      
## 3 0.01455 0.00014 -       -       -       -       -      
## 4 0.00285 0.11938 3.2e-11 -       -       -       -      
## 5 0.00203 0.10225 1.7e-11 1.00000 -       -       -      
## 6 1.00000 0.11938 0.47138 3.9e-06 2.4e-06 -       -      
## 7 1.00000 1.00000 0.00017 0.11019 0.09317 0.11942 -      
## 8 1.00000 0.32960 0.18828 4.6e-05 3.0e-05 1.00000 0.33590
## 
## P value adjustment method: holm

As we can see, there are several group are not normal distribution in the sample, compare the results to Pairwise Wilcox test, we can find that different results are in the pairs which is related with group 3,6,8.

Iris

Read Data

To read “txt” files, I use R function - read.table().

For dataset iris, variable descriptions are as follows:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class

Test the normality of length of flowers grouping them by the type of iris

Similar to the previous example, I use function shapiro.test.mulity for normality test.

##             No           Group         W   p.value norm.test
## 1            1     Iris-setosa 0.9776985 0.4595132      Norm
## 2            2 Iris-versicolor 0.9778357 0.4647370      Norm
## 3            3  Iris-virginica 0.9711794 0.2583147      Norm
## 4 Test Method:    Shapiro-Wilk        NA        NA      <NA>

As shown in the table above, length of flowers for each group are distributed by normal distribution.

Test the hypotheses about similarity of distributions of characteristics of flowers of different types

For sepal length

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  iris_data$V1 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor 1.7e-13     -              
## Iris-virginica  < 2e-16     5.9e-07        
## 
## P value adjustment method: holm

The results above shows that these pair of sample are not come from one normal distribution.

For petal length

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  iris_data$V3 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16      -              
## Iris-virginica  <2e-16      <2e-16         
## 
## P value adjustment method: holm

The result is as same as sepal length.

Test the hypotheses if the means and variances of the characteristics of flowers of different types are equal

For sepal length

Means

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris_data$V1 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor 1.8e-15     -              
## Iris-virginica  < 2e-16     2.8e-09        
## 
## P value adjustment method: holm

The result above shown the means of sepal length of flowers in different group are not equal.

Variances

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris_data$V1 and iris_data$V5
## Bartlett's K-squared = 16.006, df = 2, p-value = 0.0003345

The result shown us that the variance of sepal length of flowers in different group don’t have homogeneity.

For sepal width

Means

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris_data$V2 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor < 2e-16     -              
## Iris-virginica  2.1e-09     0.0032         
## 
## P value adjustment method: holm

The result above shown the means of sepal width of flowers in different group are not equal.

Variances

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris_data$V2 and iris_data$V5
## Bartlett's K-squared = 2.2158, df = 2, p-value = 0.3302

The result shown us that the variance of sepal width of flowers in different group have homogeneity.

For petal length

Means

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris_data$V3 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16      -              
## Iris-virginica  <2e-16      <2e-16         
## 
## P value adjustment method: holm

The result above shown the means of petal length of flowers in different group are not equal.

Variances

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris_data$V3 and iris_data$V5
## Bartlett's K-squared = 55.494, df = 2, p-value = 8.905e-13

The result shown us that the variance of petal length of flowers in different group don’t have homogeneity.

For petal width

Means

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  iris_data$V4 and iris_data$V5 
## 
##                 Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16      -              
## Iris-virginica  <2e-16      <2e-16         
## 
## P value adjustment method: holm

The result above shown the means of petal width of flowers in different group are not equal.

Variances

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris_data$V4 and iris_data$V5
## Bartlett's K-squared = 37.996, df = 2, p-value = 5.615e-09

The result shown us that the variance of petal width of flowers in different group don’t have homogeneity.

Height

Read Data

To read “xlsx” files, I use R function - read_excel in package readxl.

For dataset height, variable descriptions are as follows:

  1. height of football players
  2. height of basketball players

Test the normality of heights of football and basketball players

For football players

## 
##  Shapiro-Wilk normality test
## 
## data:  height_data$HtFt
## W = 0.93655, p-value = 0.01609

Since, p-value is smaller than \(0.05\), we could reject the null hypothesis that the heights of football player is distributed by normal distribution.

For basketball player

## 
##  Shapiro-Wilk normality test
## 
## data:  height_data$HtBk
## W = 0.96839, p-value = 0.3197

Since, p-value is smaller than \(0.05\), we could not reject the null hypothesis that the heights of basketball player is distributed by normal distribution.

Test the equity of means and variances of the heights of football and basketball players.

Means

## 
##  Wilcoxon rank sum test
## 
## data:  jitter(height_data$HtFt) and jitter(height_data$HtBk)
## W = 531, p-value = 0.0009971
## alternative hypothesis: true location shift is not equal to 0
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  jitter(height_data$HtFt) and jitter(height_data$HtBk)
## D = 0.33889, p-value = 0.01137
## alternative hypothesis: two-sided

The results shown that the means of the heights of football and basketball players are not equal.

Variances

## Warning in leveneTest.default(height_data$HtFt, height_data$HtBk):
## height_data$HtBk coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 14  0.5718 0.8617
##       25

P-value is larger than \(0.05\), So we cannot reject the null hypothesis that the variances of the heights of football and basketball players are equal.

Test if the distributions of the heights of football and basketball players are the same.

## 
##  Wilcoxon rank sum test
## 
## data:  jitter(height_data$HtFt) and jitter(height_data$HtBk)
## W = 512, p-value = 0.0005228
## alternative hypothesis: true location shift is not equal to 0
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  jitter(height_data$HtFt) and jitter(height_data$HtBk)
## D = 0.35, p-value = 0.008027
## alternative hypothesis: two-sided

According to the results, we could reject the null hypothesis that the distributions of the heights of football and basketball players are the same.

Sugery

Binomial Test

##    temp Freq
## 1 FALSE   18
## 2  TRUE   69
## 
##  Exact binomial test
## 
## data:  69 and 87
## number of successes = 69, number of trials = 87, p-value = 0.06129
## alternative hypothesis: true probability of success is not equal to 0.7
## 95 percent confidence interval:
##  0.6928684 0.8725251
## sample estimates:
## probability of success 
##              0.7931034

P-value is larger than \(0.05\), so we cannot reject the null hypothesis that the operation is successful with probability 0.7.

Acknowledgements

Thanks for knitr designed by(Xie 2015).

References

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.