To read “txt” files, I use R function - read.table().
read.table('../datasets/babyboom.dat.txt',header = FALSE,
dec = ',',na.strings = 'NA') -> babyboom_data
For dataset babyboom, variable descriptions are as follows:
I use function subset to divide data babyboom into 2 subset, one of these is data of girl and another is boy.
baby_girls = subset(babyboom_data,V2==1,select = c(V1,V2,V3,V4))
baby_boys = subset(babyboom_data,V2==2,select = c(V1,V2,V3,V4))
The table below is shown some data of baby_boys.
V1 | V2 | V3 | V4 | |
---|---|---|---|---|
3 | 118 | 2 | 3554 | 78 |
4 | 155 | 2 | 3838 | 115 |
5 | 257 | 2 | 3625 | 177 |
8 | 422 | 2 | 2846 | 262 |
9 | 431 | 2 | 3166 | 271 |
10 | 708 | 2 | 3520 | 428 |
Kolmogorov-Smirnov D Test
Because of K-S test require the distribution of our test need to be continuous, that means there are must no duplicate values in our sample. But our sample(all baby) don’t satisfy this condition. So, I use R function “Jitter” to add noisy in our sample.
##
## One-sample Kolmogorov-Smirnov test
##
## data: jitter(babyboom_data$V3)
## D = 0.18328, p-value = 0.09131
## alternative hypothesis: two-sided
Lilliefors Test
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: babyboom_data$V3
## D = 0.18336, p-value = 0.0007395
Anderson–Darling Test
##
## Anderson-Darling normality test
##
## data: babyboom_data$V3
## A = 1.7168, p-value = 0.0001788
Cramer–von Mises Test
##
## Cramer-von Mises normality test
##
## data: babyboom_data$V3
## W = 0.31125, p-value = 0.0002256
Shapiro–Wilk Test
##
## Shapiro-Wilk normality test
##
## data: babyboom_data$V3
## W = 0.89872, p-value = 0.0009944
Shapiro–Francia Test
##
## Shapiro-Francia normality test
##
## data: babyboom_data$V3
## W = 0.89701, p-value = 0.001519
Pearson chi-square test
##
## Pearson chi-square normality test
##
## data: babyboom_data$V3
## P = 20.091, p-value = 0.005377
According to the results above, only K-S D test show a different result. Because I the mean and standard deviation of population is different with sample’s. I can conclude the weight of all baby is not normal distribution.
##
## Shapiro-Wilk normality test
##
## data: baby_boys$V3
## W = 0.94747, p-value = 0.2022
Since the p-value is larger that \(0.05\), we cannot reject the null hypothesis that the weights of baby boys is normal distribution.
##
## Shapiro-Wilk normality test
##
## data: baby_girls$V3
## W = 0.87028, p-value = 0.01798
Since the p-value is smaller that \(0.05\), we reject the null hypothesis that the weights of baby girls is normal distribution.
One of our sample is not normal distribution, but the other one is. So I use non-parametrical test Wilcoxon rank sum test and K-S test to compare the means of our two samples.
##
## Wilcoxon rank sum test with continuity correction
##
## data: baby_boys$V3 and baby_girls$V3
## W = 273.5, p-value = 0.3519
## alternative hypothesis: true location shift is not equal to 0
Since the p-value is large than confidence level \(0.05\), we cannot reject the null hypothesis that the weight of baby boy’s is the same as the girl’s. These two sample are came from the same distribution. And the means of the weight of two sample are same.
## Warning in ks.test(baby_boys$V3, baby_girls$V3, alternative = "two.sided", : p-
## value will be approximate in the presence of ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: baby_boys$V3 and baby_girls$V3
## D = 0.23932, p-value = 0.5762
## alternative hypothesis: two-sided
Similarity as the Wilcoxon Rank sum test, Two Sample Kolmogorov-Smirnov Tests give me the same result.
Finally, I try to use Student-T test in this test, the result is same.
##
## Welch Two Sample t-test
##
## data: baby_boys$V3 and baby_girls$V3
## t = 1.4211, df = 27.631, p-value = 0.1665
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -107.4273 593.1538
## sample estimates:
## mean of x mean of y
## 3375.308 3132.444
The two sample are not all from the normal distribution. F-test is sensitive to the sample from normal distribution and have different length. Then Bartlett Test and F-test is unavailable, I use Levene’s Test to test the Homogeneity of Variances.
## Warning in leveneTest.default(babyboom_data$V3, babyboom_data$V2):
## babyboom_data$V2 coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 1.8154 0.1851
## 42
P-value is larger than \(0.05\), so we cannot reject the null hypothesis that the variance of the weight of girls is the same as the weight of boys.
Then I tried to use F-test to test the null hypothesis and got the same result.
##
## F test to compare two variances
##
## data: baby_boys$V3 and baby_girls$V3
## F = 0.45933, num df = 25, denom df = 17, p-value = 0.07526
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1802395 1.0839460
## sample estimates:
## ratio of variances
## 0.4593257
##
## Kolmogorov-Smirnov test for exponentiality
##
## data: babyboom_data$V4
## KSn = 0.23477, p-value = 5e-04
##
## One-sample Kolmogorov-Smirnov test
##
## data: babyboom_data$V4
## D = 0.99326, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## Cramer-von Mises test for exponentiality
##
## data: babyboom_data$V4
## Wn = 0.75668, p-value = 1
##
## Atkinson test for exponentiality
##
## data: babyboom_data$V4
## T = 0.016934, p-value = 0.001615
##
## Lorenz test for exponentiality
##
## data: babyboom_data$V4
## L = 0.27801, p-value = 0.4086
##
## Shapiro-Wilk test for exponentiality
##
## data: babyboom_data$V4
## W = 0.084434, p-value = 1
##
## Kimber-Michael test for exponentiality
##
## data: babyboom_data$V4
## D = 0.19583, p-value < 2.2e-16
Using different method I got the different result, according to the p-values, there of them are smaller than \(0.05\), but the others are larger than \(0.05\). So, I’m not sure the sample is fitted in exponential distribution.
Firstly, I build the frequency matrix about the births per hour, then using Goodness-of-fit Tests to continue the next test.
## $breaks
## [1] 0 60 120 180 240 300 360 420 480 540 600 660 720 780 840
## [16] 900 960 1020 1080 1140 1200 1260 1320 1380 1440
##
## $counts
## [1] 1 3 1 0 4 0 0 2 2 1 3 1 2 1 4 1 2 1 3 4 3 2 1 2
##
## $density
## [1] 0.0003787879 0.0011363636 0.0003787879 0.0000000000 0.0015151515
## [6] 0.0000000000 0.0000000000 0.0007575758 0.0007575758 0.0003787879
## [11] 0.0011363636 0.0003787879 0.0007575758 0.0003787879 0.0015151515
## [16] 0.0003787879 0.0007575758 0.0003787879 0.0011363636 0.0015151515
## [21] 0.0011363636 0.0007575758 0.0003787879 0.0007575758
##
## $mids
## [1] 30 90 150 210 270 330 390 450 510 570 630 690 750 810 870
## [16] 930 990 1050 1110 1170 1230 1290 1350 1410
##
## $xname
## [1] "babyboom_data$V4"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
## Var1 Freq
## 1 0 3
## 2 1 8
## 3 2 6
## 4 3 4
## 5 4 3
library(grid)
library(vcd)
goodfit(fre_matrix$Freq,type = "poisson","MinChisq")->gf
#plot(gf,main="Count data vs Poisson distribution")
summary(gf)
## Warning in summary.goodfit(gf): Chi-squared approximation may be incorrect
##
## Goodness-of-fit test for poisson distribution
##
## X^2 df P(> X^2)
## Pearson 4.555953 7 0.7139696
Obviously, p-value is larger than \(0.05\), as a result I cannot reject the null hypothesis that the births per hour is distributed by Poisson distribution.
To read “txt” files, I use R function - read.table().
read.table('../datasets/euroweight.dat.txt',header = FALSE,
dec = '.',na.strings = 'NA') -> euroweight_data
For dataset euroweight, variable descriptions are as follows:
##
## Shapiro-Wilk normality test
##
## data: euroweight_data$V2
## W = 0.97547, p-value < 2.2e-16
The p-value of whole sample is large than \(0.05\), so we need to reject the null hypothesis.
I write a function to test the normality for more than one group in a sample. The results are shown in below.
## Loading required package: magrittr
## No Group W p.value norm.test
## 1 1 1 0.9955066 6.830017e-01 Norm
## 2 2 2 0.9909001 1.218770e-01 Norm
## 3 3 3 0.8634321 4.089445e-14 Other_situation
## 4 4 4 0.9955047 6.826586e-01 Norm
## 5 5 5 0.9910340 1.289928e-01 Norm
## 6 6 6 0.9840595 6.756499e-03 Other_situation
## 7 7 7 0.9907008 1.119834e-01 Norm
## 8 8 8 0.9367201 6.827698e-09 Other_situation
## 9 Test Method: Shapiro-Wilk NA NA <NA>
As the table shown in above, not all group in the sample is distributed by normal distribution. So I use non-parametrical test pairwise.wilcox.test and Kruskal-Wallis test.
##
## Kruskal-Wallis rank sum test
##
## data: euroweight_data$V2 by euroweight_data$V3
## Kruskal-Wallis chi-squared = 97.5, df = 7, p-value < 2.2e-16
According to the result, I can conclude we need to reject the null hypothesis that the mean of weight in each group are same.
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: euroweight_data$V2 and euroweight_data$V3
##
## 1 2 3 4 5 6 7
## 2 1.00000 - - - - - -
## 3 0.04297 0.00025 - - - - -
## 4 0.00141 0.10329 7.8e-12 - - - -
## 5 0.00108 0.10329 2.6e-12 1.00000 - - -
## 6 0.76768 0.04297 1.00000 2.9e-07 1.7e-07 - -
## 7 1.00000 1.00000 0.00012 0.10329 0.10202 0.04297 -
## 8 1.00000 0.10329 0.73578 1.4e-06 7.1e-07 1.00000 0.10202
##
## P value adjustment method: holm
##
## Pairwise comparisons using t tests with pooled SD
##
## data: euroweight_data$V2 and euroweight_data$V3
##
## 1 2 3 4 5 6 7
## 2 1.00000 - - - - - -
## 3 0.01455 0.00014 - - - - -
## 4 0.00285 0.11938 3.2e-11 - - - -
## 5 0.00203 0.10225 1.7e-11 1.00000 - - -
## 6 1.00000 0.11938 0.47138 3.9e-06 2.4e-06 - -
## 7 1.00000 1.00000 0.00017 0.11019 0.09317 0.11942 -
## 8 1.00000 0.32960 0.18828 4.6e-05 3.0e-05 1.00000 0.33590
##
## P value adjustment method: holm
As we can see, there are several group are not normal distribution in the sample, compare the results to Pairwise Wilcox test, we can find that different results are in the pairs which is related with group 3,6,8.
To read “txt” files, I use R function - read.table().
read.table('../datasets/iris.txt',header = FALSE,
dec = '.',na.strings = 'NA',sep = ",") -> iris_data
For dataset iris, variable descriptions are as follows:
Similar to the previous example, I use function shapiro.test.mulity for normality test.
## No Group W p.value norm.test
## 1 1 Iris-setosa 0.9776985 0.4595132 Norm
## 2 2 Iris-versicolor 0.9778357 0.4647370 Norm
## 3 3 Iris-virginica 0.9711794 0.2583147 Norm
## 4 Test Method: Shapiro-Wilk NA NA <NA>
As shown in the table above, length of flowers for each group are distributed by normal distribution.
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: iris_data$V1 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor 1.7e-13 -
## Iris-virginica < 2e-16 5.9e-07
##
## P value adjustment method: holm
The results above shows that these pair of sample are not come from one normal distribution.
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: iris_data$V3 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16 -
## Iris-virginica <2e-16 <2e-16
##
## P value adjustment method: holm
The result is as same as sepal length.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris_data$V1 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor 1.8e-15 -
## Iris-virginica < 2e-16 2.8e-09
##
## P value adjustment method: holm
The result above shown the means of sepal length of flowers in different group are not equal.
##
## Bartlett test of homogeneity of variances
##
## data: iris_data$V1 and iris_data$V5
## Bartlett's K-squared = 16.006, df = 2, p-value = 0.0003345
The result shown us that the variance of sepal length of flowers in different group don’t have homogeneity.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris_data$V2 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor < 2e-16 -
## Iris-virginica 2.1e-09 0.0032
##
## P value adjustment method: holm
The result above shown the means of sepal width of flowers in different group are not equal.
##
## Bartlett test of homogeneity of variances
##
## data: iris_data$V2 and iris_data$V5
## Bartlett's K-squared = 2.2158, df = 2, p-value = 0.3302
The result shown us that the variance of sepal width of flowers in different group have homogeneity.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris_data$V3 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16 -
## Iris-virginica <2e-16 <2e-16
##
## P value adjustment method: holm
The result above shown the means of petal length of flowers in different group are not equal.
##
## Bartlett test of homogeneity of variances
##
## data: iris_data$V3 and iris_data$V5
## Bartlett's K-squared = 55.494, df = 2, p-value = 8.905e-13
The result shown us that the variance of petal length of flowers in different group don’t have homogeneity.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: iris_data$V4 and iris_data$V5
##
## Iris-setosa Iris-versicolor
## Iris-versicolor <2e-16 -
## Iris-virginica <2e-16 <2e-16
##
## P value adjustment method: holm
The result above shown the means of petal width of flowers in different group are not equal.
##
## Bartlett test of homogeneity of variances
##
## data: iris_data$V4 and iris_data$V5
## Bartlett's K-squared = 37.996, df = 2, p-value = 5.615e-09
The result shown us that the variance of petal width of flowers in different group don’t have homogeneity.
To read “xlsx” files, I use R function - read_excel in package readxl.
For dataset height, variable descriptions are as follows:
##
## Shapiro-Wilk normality test
##
## data: height_data$HtFt
## W = 0.93655, p-value = 0.01609
Since, p-value is smaller than \(0.05\), we could reject the null hypothesis that the heights of football player is distributed by normal distribution.
##
## Shapiro-Wilk normality test
##
## data: height_data$HtBk
## W = 0.96839, p-value = 0.3197
Since, p-value is smaller than \(0.05\), we could not reject the null hypothesis that the heights of basketball player is distributed by normal distribution.
##
## Wilcoxon rank sum test
##
## data: jitter(height_data$HtFt) and jitter(height_data$HtBk)
## W = 531, p-value = 0.0009971
## alternative hypothesis: true location shift is not equal to 0
##
## Two-sample Kolmogorov-Smirnov test
##
## data: jitter(height_data$HtFt) and jitter(height_data$HtBk)
## D = 0.33889, p-value = 0.01137
## alternative hypothesis: two-sided
The results shown that the means of the heights of football and basketball players are not equal.
## Warning in leveneTest.default(height_data$HtFt, height_data$HtBk):
## height_data$HtBk coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 14 0.5718 0.8617
## 25
P-value is larger than \(0.05\), So we cannot reject the null hypothesis that the variances of the heights of football and basketball players are equal.
##
## Wilcoxon rank sum test
##
## data: jitter(height_data$HtFt) and jitter(height_data$HtBk)
## W = 512, p-value = 0.0005228
## alternative hypothesis: true location shift is not equal to 0
##
## Two-sample Kolmogorov-Smirnov test
##
## data: jitter(height_data$HtFt) and jitter(height_data$HtBk)
## D = 0.35, p-value = 0.008027
## alternative hypothesis: two-sided
According to the results, we could reject the null hypothesis that the distributions of the heights of football and basketball players are the same.
To read “xlsx” files, I use R function - read_excel in package readxl.
temp = surgery_data_omitNA$`B V right`<surgery_data_omitNA$`A V right` &
surgery_data_omitNA$`B V left`<surgery_data_omitNA$`A V left`
as.data.frame(table(temp))
## temp Freq
## 1 FALSE 18
## 2 TRUE 69
##
## Exact binomial test
##
## data: 69 and 87
## number of successes = 69, number of trials = 87, p-value = 0.06129
## alternative hypothesis: true probability of success is not equal to 0.7
## 95 percent confidence interval:
## 0.6928684 0.8725251
## sample estimates:
## probability of success
## 0.7931034
P-value is larger than \(0.05\), so we cannot reject the null hypothesis that the operation is successful with probability 0.7.
Thanks for knitr designed by(Xie 2015).
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.