7, Analysis of variance
The main work of ANOVA is to decompose the total variation of observation data into factor effect and test error according to different variation causes, make quantitative analysis, and compare the importance of various causes in the total variation, so as to serve as the basis for further statistical inference.
Preconditions: independence, normality, homogeneity of variance
7.1 one way ANOVA
7.1. 1 mathematical model
Let the test have only one factor (factor) a, which has r levels A1,A2,..., Ar. now, ni independent observations are made at the level Ai, and the observation data Xij, j=1,2,..., ni, i=1,2,..., r are obtained
Sum of squares of SST total deviations (total variation): the sum of squares of the difference between Xij and the total average, describing the dispersion of all data.
SSE error sum of squares (sum of squares within the group): a measure of the difference between the observed value Xij for a fixed level i.
SSA sum of squares of effects of factor A (sum of squares between groups): the sum of squares of sample mean and total mean at each level of factor a.
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...)
Formula is the formula of ANOVA, expressed as x ~ A in one-way ANOVA, and data is the data frame.
Example: select the impurity removal method. In the experiment, five different impurity removal methods are selected, and each method is tested for 4 times, i.e. repeated for 4 times, as shown in the table below.
Impurity removal method Ai | Impurity removal quantity Xij | Average quantity |
---|---|---|
A1 | 25.6 22.2 28.0 29.8 | 26.4 |
A2 | 24.4 30.0 29.0 27.5 | 27.7 |
A3 | 25.0 27.7 23.0 32.2 | 27.0 |
A4 | 28.8 28.0 31.5 25.9 | 28.6 |
A5 | 20.6 21.2 22.0 21.2 | 21.3 |
> x<- c(25.6, 22.2, 28.0, 29.8,24.4 ,30.0 ,29.0, 27.5,25.0 ,27.7, 23.0 ,32.2,28.8, 28.0, 31.5 ,25.9,20.6, 21.2, 22.0 ,21.2) > A<-factor(rep(1:5,each=4)) > miscellany<-data.frame(x,A) > aov.mis<-aov(x~A,data=miscellany) > summary(aov.mis) Df Sum Sq Mean Sq F value Pr(>F) A 4 132 33.0 4.31 0.016 * Residuals 15 115 7.7 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df is the degree of freedom, Sum Sq is the sum of squares, Mean Sq is the sum of mean squares, F value is the value of F test statistics, PR (> F) is the value of test p, A is the factor, and Residuals is the residual.
It can be seen that f = 4.31 > F0 05 (5-1,20-5) = 3.06, or P = 0.0016 < 0.05, indicating that there is reason to reject the original hypothesis, that is, it is considered that the five impurity removal methods have significant differences.
Analysis of variance table
According to the above results, the following ANOVA table can be filled out:
Variance source | Degree of freedom df | Sum of squares SS | Mean square and MS | F ratio | p value |
---|---|---|---|---|---|
Factor A | 4 | 132 | 33.0 | 4.31 | 0.016 |
Residual E | 15 | 115 | 7.7 | ||
Sum T | 19 | 247 |
> plot(miscellany$x~miscellany$A)
It can also be seen from the above figure that there are significant differences in the impurity removal amount produced by the five impurity removal methods, especially the fifth and the first four, while the differences between methods 1 and 3 and methods 2 and 4 are not obvious.
7.1. 2 multiple comparisons of mean values
Before, after the analysis of variance, it was found that there were significant differences between the mean values of each effect. We can only know that some mean values are different from each other, but we don't know which mean values are different.
Multiple t-test method
When this method is used repeatedly, the probability of making the first kind of error will be increased, so that the conclusion may not be reliable. Therefore, the p value should be adjusted during repeated comparison:
p.adjust(p, method = p.adjust.methods, n = length(p))
p is the vector composed of p values, and method is the correction method.
When the comparison times are more, the Bonferroni method has better effect.
Get the p value of multiple comparisons:
pairwise.t.test(x, g, p.adjust.method = p.adjust.methods, pool.sd = !paired, paired = FALSE, alternative = c("two.sided", "less", "greater"), ...)
x is the vector composed of response variables, g is the grouping vector (factor), p.adjust.methods is the method of adjusting the p value mentioned above, and p.adjust.methods=none means no adjustment. The default value is adjusted according to the Holm method.
Example: multiple comparisons of the mean values of the above example are made to further test.
The following three methods are used for multiple comparisons:
1) No adjustment for p
> pairwise.t.test(x,A,p.adjust.method = 'none') Pairwise comparisons using t tests with pooled SD data: x and A 1 2 3 4 2 0.509 - - - 3 0.773 0.707 - - 4 0.289 0.679 0.434 - 5 0.019 0.005 0.010 0.002 P value adjustment method: none
2) Adjust the p value by the default value "holm"
> pairwise.t.test(x,A) Pairwise comparisons using t tests with pooled SD data: x and A 1 2 3 4 2 1.00 - - - 3 1.00 1.00 - - 4 1.00 1.00 1.00 - 5 0.13 0.04 0.08 0.02 P value adjustment method: holm
3) Press "bonferroni" to adjust the p value
> pairwise.t.test(x,A,p.adjust.method = "bonferroni") Pairwise comparisons using t tests with pooled SD data: x and A 1 2 3 4 2 1.00 - - - 3 1.00 1.00 - - 4 1.00 1.00 1.00 - 5 0.19 0.05 0.10 0.02 P value adjustment method: bonferroni
It can be seen from the output results that the p value increases after adjustment, which overcomes the disadvantage of multiple t-test to a certain extent.
Distinction between significant difference and significant difference
Because "significant" is often used to represent the size of P value, the most common misuse of P value is to confuse statistical significance with clinical or actual significant difference, that is, to confuse the meaning of "significant difference" and "significant difference". In fact, the former refers to P < = 0.05, which means that there are sufficient reasons to believe that the possibility of the two from the same population is less than 5%, so it is considered that there are differences between the two, and the possibility of error in this conclusion is < = 5%. The latter means that the difference between the two is really great. For example, there is a great difference between 4 and 40, so it can be said that there is a "significant difference", while there is little difference between 4 and 4.2. However, if the calculated p value is < = 0.05, it is considered that there is a "significant difference", but it can not be said that there is a "significant difference".
--------
Original link: https://blog.csdn.net/yu1581274988/article/details/117295802
Confusion of confidence intervals
The 95% confidence of the parameter is in interval A, which means:
Correct: take 100 samples to calculate the confidence interval of 95% confidence, and the interval calculated 95 times contains the real value.
Error: after 100 samples, 95 true values fall within the confidence interval.
The true value does not change, but the confidence interval.
--------
Original link: https://blog.csdn.net/yimingsilence/article/details/78084810
7.1. 3 simultaneous confidence interval: Tukey method
Difference of pair effect α i- α j make confidence intervals to know which effects are not equal.
TukeyHSD(x, which, ordered = FALSE, conf.level = 0.95, ...)
x is the object of analysis of variance, which is the factor vector that needs to calculate the comparison interval, and ordered is T, then the level of factors is increased and sorted, so that the differences between factors appear as positive values.
Example: a store sells new watches in its own way. The sales of watches for four consecutive days are shown in the table below. Is there a significant difference between sales methods?
Sales mode | Sales data |
---|---|
A1 | 23 19 21 13 |
A2 | 24 25 28 27 |
A3 | 20 18 19 15 |
A4 | 22 25 26 23 |
A5 | 24 23 26 27 |
#Generate data > sales<-data.frame(x=c(23, 19, 21, 13,24 ,25, 28, 27,20 ,18 ,19 ,15,22, 25 ,26 ,23,24 ,23 ,26, 27),A=factor(rep(1:5,c(4,4,4,4,4)))) #Analysis of variance > summary(aov(x~A,sales)) Df Sum Sq Mean Sq F value Pr(>F) A 4 213 53.2 7.98 0.0012 ** Residuals 15 100 6.7 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #It can be seen that different sales methods are different. Find the difference between the mean and the confidence interval at the same time > TukeyHSD(aov(x~A,sales)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = x ~ A, data = sales) $A diff lwr upr p adj 2-1 7 1.3622 12.638 0.0120 3-1 -1 -6.6378 4.638 0.9806 4-1 5 -0.6378 10.638 0.0945 5-1 6 0.3622 11.638 0.0344 3-2 -8 -13.6378 -2.362 0.0042 4-2 -2 -7.6378 3.638 0.8062 5-2 -1 -6.6378 4.638 0.9806 4-3 6 0.3622 11.638 0.0344 5-3 7 1.3622 12.638 0.0120 5-4 1 -4.6378 6.638 0.9806
7.1. 4 homogeneity test of variance
Test whether the variance of data is the same at different levels.
Barrett test
H0: the variance at each factor level is the same, H1: the variance at each factor level is uneven.
bartlett.test(x, g, ...) bartlett.test(formula, data, subset, na.action, ...)
X is a vector or list composed of data, and g is a vector composed of factors (this item is invalid when x is a list);
Formula is the analysis of variance formula and data is the data frame.
Levene test
levene.test(x,group) leveneTest(y, group, center=median, ...)
x is a vector of data and g is a vector of factors.
E.g.: test the homogeneity of variance for the data of the above example.
1) Using Barrett test
> bartlett.test(x~A,data=sales) Bartlett test of homogeneity of variances data: x by A Bartlett's K-squared = 3.7, df = 4, p-value = 0.4
Since P = 0.4 > 0.05, the original hypothesis is accepted and the data of each treatment group is considered to be equal variance.
2) Levene test
> leveneTest(sales$x,sales$A) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 4 0.82 0.53 15
be careful
- The ANOVA model can be regarded as a special linear model, so the linear model function lm() and anova() can be used to extract the ANOVA table. Therefore, aov(formula) is equivalent to anova(lm(formula)).
- One way ANOVA can also use the function oneway Test(), if the variance of the data at each level is equal (var.equal=TRUE), it is equivalent to using the function aov() for general analysis of variance; If the variance of the data at each level is not equal (var.equal=FALSE), the approximate method of Welch(1951) is used for analysis of variance.
- When the distribution at each level is unknown, Kruskal Wallis rank sum test is used for analysis of variance.
7.2 two way ANOVA
7.2. 1 Analysis of variance without interaction
We can still use the function aov(), the variance model formula is x~A+B, and the plus sign indicates that the two factors are additive.
Example: there were three methods A1, A2 and A3 to test the lead content in fruit juice. Now another rapid test method A4 has been developed. Whether A4 can replace the first three methods needs to be tested. The object of observation is fruit juice. Different fruit juices are regarded as different levels: B1 is apple, B2 is grape juice, B3 is tomato juice, B4 is apple juice, B5 is orange juice, B6 is pineapple lemon juice. The two factor staggered collocation test is now carried out, that is, each fruit juice is tested simultaneously by four methods, and the test results are shown in the table below. Q: do factors a (test method) and B (fruit juice products) have a significant effect on the lead content of fruit juice?
A | B1 | B2 | B3 | B4 | B5 | B6 | xi· |
---|---|---|---|---|---|---|---|
A1 | 0.05 | 0.46 | 0.12 | 0.16 | 0.84 | 1.30 | 2.93 |
A2 | 0.08 | 0.38 | 0.40 | 0.10 | 0.92 | 1.57 | 3.45 |
A3 | 0.11 | 0.43 | 0.05 | 0.10 | 0.94 | 1.10 | 2.73 |
A4 | 0.11 | 0.44 | 0.08 | 0.03 | 0.93 | 1.15 | 2.74 |
x·j | 0.35 | 1.71 | 0.65 | 0.39 | 3.63 | 5.12 | X··=11.85 |
> juice<-data.frame(x=c(0.05,0.46,0.12,0.16,0.84,1.30,0.08,0.38,0.40,0.10,0.92,1.57,0.11,0.43,0.05,0.10,0.94,1.10,0.11,0.44,0.08,0.03,0.93,1.15),A=gl(4,6),B=gl(6,1,24)) > #gl(n,k,length=n*k,labels=1:n,ordered=FALSE) n is the number of levels, K is the number of repetitions at each level, length is the total number of observations, and ordered indicates whether each level is sorted first. > juice.aov<-aov(x~A+B,data=juice) > summary(juice.aov) Df Sum Sq Mean Sq F value Pr(>F) A 3 0.06 0.019 1.63 0.22 B 5 4.90 0.980 83.98 2e-10 *** Residuals 15 0.18 0.012 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > #p indicates that B has A significant effect on lead content, but there is no sufficient reason to explain that A has A significant effect on lead content. > bartlett.test(x~A,data=juice) Bartlett test of homogeneity of variances data: x by A Bartlett's K-squared = 0.27, df = 3, p-value = 1 > bartlett.test(x~B,data=juice) Bartlett test of homogeneity of variances data: x by B Bartlett's K-squared = 17, df = 5, p-value = 0.004 > #Factor B does not meet the requirement of homogeneity of variance.
7.2. 2 Analysis of variance with interaction
The function aov() can still be used, and the variance model formula is x~A+B+A:B or x~A*B
Example: there is an experiment on testing the strength of drugs, 48 mice were injected with three poisons (factor A) and four treatment schemes (factor B). This test was repeated four times under each factor combination to test the survival time of mice. The data are shown in the table below. Try to analyze whether the poisons, treatment schemes and their interaction have A significant impact on the survival time of mice?
A | B | C | D | |
---|---|---|---|---|
1 | 0.31 0.45 0.46 0.43 | 0.82 1.10 0.88 0.72 | 0.43 0.45 0.63 0.76 | 0.45 0.71 0.66 0.62 |
2 | 0.36 0.29 0.40 0.23 | 0.92 0.61 0.49 1.24 | 0.44 0.35 0.31 0.40 | 0.56 1.02 0.71 0.38 |
3 | 0.22 0.21 0.18 0.23 | 0.30 0.37 0.38 0.29 | 0.23 0.25 0.24 0.22 | 0.30 0.36 0.31 0.33 |
#Create data frame > rats<-data.frame(x=c(0.31 ,0.45, 0.46 ,0.43,0.82 ,1.10, 0.88, 0.72,0.43, 0.45, 0.63, 0.76,0.45, 0.71 ,0.66 ,0.62,0.36, 0.29 ,0.40 ,0.23,0.92, 0.61, 0.49 ,1.24,0.44 ,0.35, 0.31, 0.40,0.56, 1.02, 0.71, 0.38,0.22 ,0.21 ,0.18 ,0.23,0.30, 0.37 ,0.38, 0.29,0.23 ,0.25, 0.24 ,0.22,0.30 ,0.36, 0.31, 0.33),A=gl(3,16,48,labels = c("1","2","3")),B=gl(4,4,48,labels = c("A","B","C","D"))) > par(mfrow=c(1,2)) #Modify the layout of the drawing > plot(x~A+B,data=rats) #It shows that there are great differences in the level of each factor
Use the function interaction Plot () makes an interaction diagram to examine whether the interaction between factors exists.
> with(rats,interaction.plot(A,B,x,trace.label = "B")) > with(rats,interaction.plot(B,A,x,trace.label = "A"))
The curves in the two figures do not obviously intersect, so it is preliminarily considered that the two factors do not interact. Further test with aov() function.
> rats.aov<-aov(x~A*B,data = rats) > summary(rats.aov) Df Sum Sq Mean Sq F value Pr(>F) A 2 1.033 0.517 23.22 3.3e-07 *** B 3 0.921 0.307 13.81 3.8e-06 *** A:B 6 0.250 0.042 1.87 0.11 Residuals 36 0.801 0.022 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It was found that poisons and treatment schemes had a significant effect on the treatment time of mice, but the effect of interaction was not significant.
Finally, check whether the data under factors A and B meet the requirements of variance homogeneity:
> bartlett.test(rats$x,rats$A) Bartlett test of homogeneity of variances data: rats$x and rats$A Bartlett's K-squared = 26, df = 2, p-value = 2e-06 > bartlett.test(rats$x,rats$B) Bartlett test of homogeneity of variances data: rats$x and rats$B Bartlett's K-squared = 13, df = 3, p-value = 0.004
It is found that the p values are less than 0.05, which does not meet the requirements, which is consistent with the box diagram at the beginning.
7.3 analysis of covariance
Analysis of covariance ANCOVA: it is a statistical method combining linear regression analysis and analysis of variance. The basic idea is to integrate some variables that have an impact on the response variable y (referring to unknown or uncontrollable factors) as covariates, establish a linear regression relationship between the response variable y and the covariate x, and use this regression relationship to make the x value equal, and then test the hypothesis of the difference between the corrected mean values of Y in each treatment group. Its essence is to deduct the regression square sum of X to y from the total square sum of Y, further decompose the square sum of residuals, and then Analysis of variance was performed to better evaluate the effect of this treatment.
The influence of these uncontrollable factors shall be deducted or balanced while comparing the difference between the mean of two or more groups
ancova(formula, data.in = NULL, ..., x, groups, transpose = FALSE, display.plot.command = FALSE, superpose.level.name = "superpose", ignore.groups = FALSE, ignore.groups.name = "ignore.groups", blocks, blocks.pch = letters[seq(levels(blocks))], layout, between, main, pch=trellis.par.get()$superpose.symbol$pch)
Formula is the formula of covariance analysis, data In is the data frame and x is the covariate in the analysis of covariance. If there is no x in the formula during drawing, it should be pointed out that groups is a factor.
Example: in order to study the fattening effect of three diets a, B and C on pigs, 8 pigs were fed with each feed for a period of time, and the initial weight X and weight gain Y of each pig were measured. Try to analyze whether the fattening effects of the three feeds on pigs are the same?
A | B | C | |
---|---|---|---|
X1 Y1 | X2 Y2 | X3 Y3 | |
1 | 15 85 | 17 97 | 22 89 |
2 | 13 83 | 16 90 | 24 91 |
3 | 11 65 | 18 100 | 20 83 |
4 | 12 76 | 18 95 | 23 95 |
5 | 12 80 | 21 103 | 25 100 |
6 | 16 91 | 22 106 | 27 102 |
7 | 14 84 | 19 99 | 30 105 |
8 | 17 90 | 18 94 | 32 110 |
Feed is a qualitative factor that can be controlled; The initial weight of pigs is a quantitative factor difficult to control, which is a covariate X; The observed index of the experiment is the increment of pigs, which is the response variable Y; Due to the influence of the original weight of pigs, the weight gain of each group cannot be directly analyzed by variance, but needs to be analyzed by covariance.
#Create dataset > feed<-rep(c('A','B','C'),each=8) > X<-c(15,13,11,12,12,16,14,17,17,16,18,18,21,22,19,18,22,24,20,23,25,27,30,32) > Y<-c(85,83,65,76,80,91,84,90,97,90,100,95,103,106,99,94,89,91,83,95,100,102,105,110) > pig<-data.frame(feed,X,Y)
1) It is considered that the initial weight of pigs is different, but the growth rate is the same
> ancova(Y~X+feed,data=pig) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X 1 1621 1621 142.4 1.5e-10 *** feed 2 707 354 31.1 7.3e-07 *** Residuals 20 228 11 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be seen that the initial weight and growth rate of pigs have significant differences on the weight gain of pigs.
2) It is considered that the initial weight and growth rate of pigs are different under three different diets
> ancova(Y~X*feed,data=pig) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X 1 1621 1621 162.49 1.9e-10 *** feed 2 707 354 35.44 5.7e-07 *** X:feed 2 48 24 2.41 0.12 Residuals 18 180 10 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test of comparing the effect of initial weight on growth from the two diagrams shows that the three feeds have the same fattening effect on pigs, and the initial weight of pigs has little effect on it.