R language learning notes 7_ variance analysis

Posted by askbapi on Wed, 22 Dec 2021 11:33:03 +0100

7, Analysis of variance

The main work of ANOVA is to decompose the total variation of observation data into factor effect and test error according to different variation causes, make quantitative analysis, and compare the importance of various causes in the total variation, so as to serve as the basis for further statistical inference.
Preconditions: independence, normality, homogeneity of variance

7.1 one way ANOVA

7.1. 1 mathematical model

Let the test have only one factor (factor) a, which has r levels A1,A2,..., Ar. now, ni independent observations are made at the level Ai, and the observation data Xij, j=1,2,..., ni, i=1,2,..., r are obtained
Sum of squares of SST total deviations (total variation): the sum of squares of the difference between Xij and the total average, describing the dispersion of all data.
SSE error sum of squares (sum of squares within the group): a measure of the difference between the observed value Xij for a fixed level i.
SSA sum of squares of effects of factor A (sum of squares between groups): the sum of squares of sample mean and total mean at each level of factor a.

aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...)

Formula is the formula of ANOVA, expressed as x ~ A in one-way ANOVA, and data is the data frame.

Example: select the impurity removal method. In the experiment, five different impurity removal methods are selected, and each method is tested for 4 times, i.e. repeated for 4 times, as shown in the table below.

Impurity removal method Ai	Impurity removal quantity Xij	Average quantity
A1	25.6 22.2 28.0 29.8	26.4
A2	24.4 30.0 29.0 27.5	27.7
A3	25.0 27.7 23.0 32.2	27.0
A4	28.8 28.0 31.5 25.9	28.6
A5	20.6 21.2 22.0 21.2	21.3

> x<- c(25.6, 22.2, 28.0, 29.8,24.4 ,30.0 ,29.0, 27.5,25.0 ,27.7, 23.0 ,32.2,28.8, 28.0, 31.5 ,25.9,20.6, 21.2, 22.0 ,21.2)
> A<-factor(rep(1:5,each=4))
> miscellany<-data.frame(x,A)
> aov.mis<-aov(x~A,data=miscellany)
> summary(aov.mis)
            Df Sum Sq Mean Sq F value Pr(>F)  
A            4    132    33.0    4.31  0.016 *
Residuals   15    115     7.7                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Df is the degree of freedom, Sum Sq is the sum of squares, Mean Sq is the sum of mean squares, F value is the value of F test statistics, PR (> F) is the value of test p, A is the factor, and Residuals is the residual.

It can be seen that f = 4.31 > F0 05 (5-1,20-5) = 3.06, or P = 0.0016 < 0.05, indicating that there is reason to reject the original hypothesis, that is, it is considered that the five impurity removal methods have significant differences.

Analysis of variance table

According to the above results, the following ANOVA table can be filled out:

Variance source	Degree of freedom df	Sum of squares SS	Mean square and MS	F ratio	p value
Factor A	4	132	33.0	4.31	0.016
Residual E	15	115	7.7
Sum T	19	247

> plot(miscellany$x~miscellany$A)

It can also be seen from the above figure that there are significant differences in the impurity removal amount produced by the five impurity removal methods, especially the fifth and the first four, while the differences between methods 1 and 3 and methods 2 and 4 are not obvious.

7.1. 2 multiple comparisons of mean values

Before, after the analysis of variance, it was found that there were significant differences between the mean values of each effect. We can only know that some mean values are different from each other, but we don't know which mean values are different.

Multiple t-test method

When this method is used repeatedly, the probability of making the first kind of error will be increased, so that the conclusion may not be reliable. Therefore, the p value should be adjusted during repeated comparison:

p.adjust(p, method = p.adjust.methods, n = length(p))

p is the vector composed of p values, and method is the correction method.
When the comparison times are more, the Bonferroni method has better effect.
Get the p value of multiple comparisons:

pairwise.t.test(x, g, p.adjust.method = p.adjust.methods, pool.sd = !paired, paired = FALSE, alternative = c("two.sided", "less", "greater"), ...)

x is the vector composed of response variables, g is the grouping vector (factor), p.adjust.methods is the method of adjusting the p value mentioned above, and p.adjust.methods=none means no adjustment. The default value is adjusted according to the Holm method.

Example: multiple comparisons of the mean values of the above example are made to further test.
The following three methods are used for multiple comparisons:

1) No adjustment for p

> pairwise.t.test(x,A,p.adjust.method = 'none')

	Pairwise comparisons using t tests with pooled SD 

data:  x and A 

  1     2     3     4    
2 0.509 -     -     -    
3 0.773 0.707 -     -    
4 0.289 0.679 0.434 -    
5 0.019 0.005 0.010 0.002

P value adjustment method: none

2) Adjust the p value by the default value "holm"

> pairwise.t.test(x,A)

	Pairwise comparisons using t tests with pooled SD 

data:  x and A 

  1    2    3    4   
2 1.00 -    -    -   
3 1.00 1.00 -    -   
4 1.00 1.00 1.00 -   
5 0.13 0.04 0.08 0.02

P value adjustment method: holm

3) Press "bonferroni" to adjust the p value

> pairwise.t.test(x,A,p.adjust.method = "bonferroni")

	Pairwise comparisons using t tests with pooled SD 

data:  x and A 

  1    2    3    4   
2 1.00 -    -    -   
3 1.00 1.00 -    -   
4 1.00 1.00 1.00 -   
5 0.19 0.05 0.10 0.02

P value adjustment method: bonferroni

It can be seen from the output results that the p value increases after adjustment, which overcomes the disadvantage of multiple t-test to a certain extent.

Distinction between significant difference and significant difference

Because "significant" is often used to represent the size of P value, the most common misuse of P value is to confuse statistical significance with clinical or actual significant difference, that is, to confuse the meaning of "significant difference" and "significant difference". In fact, the former refers to P < = 0.05, which means that there are sufficient reasons to believe that the possibility of the two from the same population is less than 5%, so it is considered that there are differences between the two, and the possibility of error in this conclusion is < = 5%. The latter means that the difference between the two is really great. For example, there is a great difference between 4 and 40, so it can be said that there is a "significant difference", while there is little difference between 4 and 4.2. However, if the calculated p value is < = 0.05, it is considered that there is a "significant difference", but it can not be said that there is a "significant difference".
--------
Original link: https://blog.csdn.net/yu1581274988/article/details/117295802

Confusion of confidence intervals

The 95% confidence of the parameter is in interval A, which means:
Correct: take 100 samples to calculate the confidence interval of 95% confidence, and the interval calculated 95 times contains the real value.
Error: after 100 samples, 95 true values fall within the confidence interval.
The true value does not change, but the confidence interval.
--------
Original link: https://blog.csdn.net/yimingsilence/article/details/78084810

7.1. 3 simultaneous confidence interval: Tukey method

Difference of pair effect α i- α j make confidence intervals to know which effects are not equal.

TukeyHSD(x, which, ordered = FALSE, conf.level = 0.95, ...)

x is the object of analysis of variance, which is the factor vector that needs to calculate the comparison interval, and ordered is T, then the level of factors is increased and sorted, so that the differences between factors appear as positive values.

Example: a store sells new watches in its own way. The sales of watches for four consecutive days are shown in the table below. Is there a significant difference between sales methods?

Sales mode	Sales data
A1	23 19 21 13
A2	24 25 28 27
A3	20 18 19 15
A4	22 25 26 23
A5	24 23 26 27

#Generate data
> sales<-data.frame(x=c(23, 19, 21, 13,24 ,25, 28, 27,20 ,18 ,19 ,15,22, 25 ,26 ,23,24 ,23 ,26, 27),A=factor(rep(1:5,c(4,4,4,4,4))))
#Analysis of variance
> summary(aov(x~A,sales))
            Df Sum Sq Mean Sq F value Pr(>F)   
A            4    213    53.2    7.98 0.0012 **
Residuals   15    100     6.7                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#It can be seen that different sales methods are different. Find the difference between the mean and the confidence interval at the same time
> TukeyHSD(aov(x~A,sales))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = x ~ A, data = sales)

$A
    diff      lwr    upr  p adj
2-1    7   1.3622 12.638 0.0120
3-1   -1  -6.6378  4.638 0.9806
4-1    5  -0.6378 10.638 0.0945
5-1    6   0.3622 11.638 0.0344
3-2   -8 -13.6378 -2.362 0.0042
4-2   -2  -7.6378  3.638 0.8062
5-2   -1  -6.6378  4.638 0.9806
4-3    6   0.3622 11.638 0.0344
5-3    7   1.3622 12.638 0.0120
5-4    1  -4.6378  6.638 0.9806

7.1. 4 homogeneity test of variance

Test whether the variance of data is the same at different levels.

Barrett test

H0: the variance at each factor level is the same, H1: the variance at each factor level is uneven.

bartlett.test(x, g, ...)
bartlett.test(formula, data, subset, na.action, ...)

X is a vector or list composed of data, and g is a vector composed of factors (this item is invalid when x is a list);
Formula is the analysis of variance formula and data is the data frame.

Levene test

levene.test(x,group)
leveneTest(y, group, center=median, ...)

x is a vector of data and g is a vector of factors.

E.g.: test the homogeneity of variance for the data of the above example.

1) Using Barrett test

> bartlett.test(x~A,data=sales)

	Bartlett test of homogeneity of variances

data:  x by A
Bartlett's K-squared = 3.7, df = 4, p-value = 0.4

Since P = 0.4 > 0.05, the original hypothesis is accepted and the data of each treatment group is considered to be equal variance.
2) Levene test

> leveneTest(sales$x,sales$A)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4    0.82   0.53
      15

be careful

The ANOVA model can be regarded as a special linear model, so the linear model function lm() and anova() can be used to extract the ANOVA table. Therefore, aov(formula) is equivalent to anova(lm(formula)).
One way ANOVA can also use the function oneway Test(), if the variance of the data at each level is equal (var.equal=TRUE), it is equivalent to using the function aov() for general analysis of variance; If the variance of the data at each level is not equal (var.equal=FALSE), the approximate method of Welch(1951) is used for analysis of variance.
When the distribution at each level is unknown, Kruskal Wallis rank sum test is used for analysis of variance.

7.2 two way ANOVA

7.2. 1 Analysis of variance without interaction

We can still use the function aov(), the variance model formula is x~A+B, and the plus sign indicates that the two factors are additive.

Example: there were three methods A1, A2 and A3 to test the lead content in fruit juice. Now another rapid test method A4 has been developed. Whether A4 can replace the first three methods needs to be tested. The object of observation is fruit juice. Different fruit juices are regarded as different levels: B1 is apple, B2 is grape juice, B3 is tomato juice, B4 is apple juice, B5 is orange juice, B6 is pineapple lemon juice. The two factor staggered collocation test is now carried out, that is, each fruit juice is tested simultaneously by four methods, and the test results are shown in the table below. Q: do factors a (test method) and B (fruit juice products) have a significant effect on the lead content of fruit juice?

A	B1	B2	B3	B4	B5	B6	xi·
A1	0.05	0.46	0.12	0.16	0.84	1.30	2.93
A2	0.08	0.38	0.40	0.10	0.92	1.57	3.45
A3	0.11	0.43	0.05	0.10	0.94	1.10	2.73
A4	0.11	0.44	0.08	0.03	0.93	1.15	2.74
x·j	0.35	1.71	0.65	0.39	3.63	5.12	X··=11.85

> juice<-data.frame(x=c(0.05,0.46,0.12,0.16,0.84,1.30,0.08,0.38,0.40,0.10,0.92,1.57,0.11,0.43,0.05,0.10,0.94,1.10,0.11,0.44,0.08,0.03,0.93,1.15),A=gl(4,6),B=gl(6,1,24))
> #gl(n,k,length=n*k,labels=1:n,ordered=FALSE) n is the number of levels, K is the number of repetitions at each level, length is the total number of observations, and ordered indicates whether each level is sorted first.
> juice.aov<-aov(x~A+B,data=juice)
> summary(juice.aov)
            Df Sum Sq Mean Sq F value Pr(>F)    
A            3   0.06   0.019    1.63   0.22    
B            5   4.90   0.980   83.98  2e-10 ***
Residuals   15   0.18   0.012                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> #p indicates that B has A significant effect on lead content, but there is no sufficient reason to explain that A has A significant effect on lead content.
> bartlett.test(x~A,data=juice)

	Bartlett test of homogeneity of variances

data:  x by A
Bartlett's K-squared = 0.27, df = 3, p-value = 1

> bartlett.test(x~B,data=juice)

	Bartlett test of homogeneity of variances

data:  x by B
Bartlett's K-squared = 17, df = 5, p-value = 0.004
> #Factor B does not meet the requirement of homogeneity of variance.

7.2. 2 Analysis of variance with interaction

The function aov() can still be used, and the variance model formula is x~A+B+A:B or x~A*B

Example: there is an experiment on testing the strength of drugs, 48 mice were injected with three poisons (factor A) and four treatment schemes (factor B). This test was repeated four times under each factor combination to test the survival time of mice. The data are shown in the table below. Try to analyze whether the poisons, treatment schemes and their interaction have A significant impact on the survival time of mice?

	A	B	C	D
1	0.31 0.45 0.46 0.43	0.82 1.10 0.88 0.72	0.43 0.45 0.63 0.76	0.45 0.71 0.66 0.62
2	0.36 0.29 0.40 0.23	0.92 0.61 0.49 1.24	0.44 0.35 0.31 0.40	0.56 1.02 0.71 0.38
3	0.22 0.21 0.18 0.23	0.30 0.37 0.38 0.29	0.23 0.25 0.24 0.22	0.30 0.36 0.31 0.33

#Create data frame
> rats<-data.frame(x=c(0.31 ,0.45, 0.46 ,0.43,0.82 ,1.10, 0.88, 0.72,0.43, 0.45, 0.63, 0.76,0.45, 0.71 ,0.66 ,0.62,0.36, 0.29 ,0.40 ,0.23,0.92, 0.61, 0.49 ,1.24,0.44 ,0.35, 0.31, 0.40,0.56, 1.02, 0.71, 0.38,0.22 ,0.21 ,0.18 ,0.23,0.30, 0.37 ,0.38, 0.29,0.23 ,0.25, 0.24 ,0.22,0.30 ,0.36, 0.31, 0.33),A=gl(3,16,48,labels = c("1","2","3")),B=gl(4,4,48,labels = c("A","B","C","D")))
> par(mfrow=c(1,2)) #Modify the layout of the drawing
> plot(x~A+B,data=rats) #It shows that there are great differences in the level of each factor

Use the function interaction Plot () makes an interaction diagram to examine whether the interaction between factors exists.

> with(rats,interaction.plot(A,B,x,trace.label = "B"))
> with(rats,interaction.plot(B,A,x,trace.label = "A"))

The curves in the two figures do not obviously intersect, so it is preliminarily considered that the two factors do not interact. Further test with aov() function.

> rats.aov<-aov(x~A*B,data = rats)
> summary(rats.aov)
            Df Sum Sq Mean Sq F value  Pr(>F)    
A            2  1.033   0.517   23.22 3.3e-07 ***
B            3  0.921   0.307   13.81 3.8e-06 ***
A:B          6  0.250   0.042    1.87    0.11    
Residuals   36  0.801   0.022                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It was found that poisons and treatment schemes had a significant effect on the treatment time of mice, but the effect of interaction was not significant.
Finally, check whether the data under factors A and B meet the requirements of variance homogeneity:

> bartlett.test(rats$x,rats$A)

	Bartlett test of homogeneity of variances

data:  rats$x and rats$A
Bartlett's K-squared = 26, df = 2, p-value = 2e-06

> bartlett.test(rats$x,rats$B)

	Bartlett test of homogeneity of variances

data:  rats$x and rats$B
Bartlett's K-squared = 13, df = 3, p-value = 0.004

It is found that the p values are less than 0.05, which does not meet the requirements, which is consistent with the box diagram at the beginning.

7.3 analysis of covariance

Analysis of covariance ANCOVA: it is a statistical method combining linear regression analysis and analysis of variance. The basic idea is to integrate some variables that have an impact on the response variable y (referring to unknown or uncontrollable factors) as covariates, establish a linear regression relationship between the response variable y and the covariate x, and use this regression relationship to make the x value equal, and then test the hypothesis of the difference between the corrected mean values of Y in each treatment group. Its essence is to deduct the regression square sum of X to y from the total square sum of Y, further decompose the square sum of residuals, and then Analysis of variance was performed to better evaluate the effect of this treatment.

The influence of these uncontrollable factors shall be deducted or balanced while comparing the difference between the mean of two or more groups

ancova(formula, data.in = NULL, ...,
       x, groups, transpose = FALSE,
       display.plot.command = FALSE,
       superpose.level.name = "superpose",
       ignore.groups = FALSE, ignore.groups.name = "ignore.groups",
       blocks, blocks.pch = letters[seq(levels(blocks))],
       layout, between, main,
       pch=trellis.par.get()$superpose.symbol$pch)

Formula is the formula of covariance analysis, data In is the data frame and x is the covariate in the analysis of covariance. If there is no x in the formula during drawing, it should be pointed out that groups is a factor.

Example: in order to study the fattening effect of three diets a, B and C on pigs, 8 pigs were fed with each feed for a period of time, and the initial weight X and weight gain Y of each pig were measured. Try to analyze whether the fattening effects of the three feeds on pigs are the same?

	A	B	C
	X1 Y1	X2 Y2	X3 Y3
1	15 85	17 97	22 89
2	13 83	16 90	24 91
3	11 65	18 100	20 83
4	12 76	18 95	23 95
5	12 80	21 103	25 100
6	16 91	22 106	27 102
7	14 84	19 99	30 105
8	17 90	18 94	32 110

Feed is a qualitative factor that can be controlled; The initial weight of pigs is a quantitative factor difficult to control, which is a covariate X; The observed index of the experiment is the increment of pigs, which is the response variable Y; Due to the influence of the original weight of pigs, the weight gain of each group cannot be directly analyzed by variance, but needs to be analyzed by covariance.

#Create dataset
> feed<-rep(c('A','B','C'),each=8)
> X<-c(15,13,11,12,12,16,14,17,17,16,18,18,21,22,19,18,22,24,20,23,25,27,30,32)
> Y<-c(85,83,65,76,80,91,84,90,97,90,100,95,103,106,99,94,89,91,83,95,100,102,105,110)
> pig<-data.frame(feed,X,Y)

1) It is considered that the initial weight of pigs is different, but the growth rate is the same

> ancova(Y~X+feed,data=pig)
Analysis of Variance Table

Response: Y
          Df Sum Sq Mean Sq F value  Pr(>F)    
X          1   1621    1621   142.4 1.5e-10 ***
feed       2    707     354    31.1 7.3e-07 ***
Residuals 20    228      11                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that the initial weight and growth rate of pigs have significant differences on the weight gain of pigs.
2) It is considered that the initial weight and growth rate of pigs are different under three different diets

> ancova(Y~X*feed,data=pig)
Analysis of Variance Table

Response: Y
          Df Sum Sq Mean Sq F value  Pr(>F)    
X          1   1621    1621  162.49 1.9e-10 ***
feed       2    707     354   35.44 5.7e-07 ***
X:feed     2     48      24    2.41    0.12    
Residuals 18    180      10                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test of comparing the effect of initial weight on growth from the two diagrams shows that the three feeds have the same fattening effect on pigs, and the initial weight of pigs has little effect on it.

Topics: R Language

Programmer Think