Jobs for the Baseball dataset
1. Topic 1
Note: I said at_before the title when I was doing my homework Bas>=100 of 209 data, I filtered, found that later topics began to use this subset, let's not change, they have spent so much time.
(1) Experimental code
baseball=read.csv("datasets/baseball.txt",stringsAsFactors=TRUE,sep='') baseball=baseball[which(baseball$at_bats>=100),] plot(baseball$homeruns,baseball$bat_ave, xlab = "homeruns",ylab = "bat_ave")
(2) Experimental results
2. Question 2
(1) Experimental code
# Two Paints Scatter Chart baseball1=baseball[,c(-1,-2,-4)] pairs(baseball1) # Calculate correlation coefficient
(2) Principle analysis
(3) Experimental results
(4) Interpretation of results
The calculation of the clearance scatterplot and the linear correlation coefficient shows that:
game and at_bats,runs,hits,doubles,RBIs,walks, strikeouts have linear correlation
at_bats and runs,hits,doubles,RBIs have linear correlation
runs and hits,doubles,homeruns,RBIs have linear correlation
hits and doubles,RBIs have linear correlation
Linear correlation between doubles and RBIs
Horuns and RBIs have linear correlation
bat_ave and on_base_pct,slugging_pct has linear correlation
on_base_pct and slugging_pct has linear correlation
3. Question 3
(1) Experimental code
#Modeling lm1<-lm(baseball$bat_ave~baseball$homeruns) plot(bat_ave ~ homeruns, data = baseball, pch = 16, col = "black", ylab = "bat_ave") abline(lm1, col = "red")#Draw a model on a scatterplot # Note that this is the standardized residuals qqnorm(rstandard(lm1), datax = TRUE) qqline(rstandard(lm1), datax = TRUE) plot_ZP = function(ti) # Draw normal probability map { n = length(ti) order = rank(ti) #In ascending order, t(i) is the first order Pi = order/n #Cumulative probability plot(ti,Pi,xlab = "standard_residual",ylab = "Percentage") #Draw normal probability map #Add Regression Line fm = lm(Pi~ti) abline(fm) } plot_ZP(rstandard(lm1))
(2) Principle analysis
(3) Experimental results
(4) Interpretation of results
I think residuals can be considered normal within an acceptable range
4. Question 4
(1) Experimental code
plot(lm1$fitted.values, lm1$residuals,pch = 16, col = "red",main = " Residuals by Fitted Values", ylab = "Residuals", xlab = "Fitted Values") abline(0,0)
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
From the results, you can see that the zero mean hypothesis is met, but others can clearly see that the distribution of residuals changes with the value of fitted values.
5. Question 5
(1) Experimental code
baseball$log_homeruns <- log(baseball$homeruns+1e-5)#Logarithm lm2<-lm(baseball$bat_ave~baseball$log_homeruns) plot(bat_ave ~ log_homeruns, data = baseball, pch = 16, col = "black", ylab = "bat_ave") abline(lm2, col = "red")#Draw a model on a scatterplot # Note that this is the standardized residuals qqnorm(rstandard(lm2), datax = TRUE) qqline(rstandard(lm2), datax = TRUE) plot_ZP(rstandard(lm2))
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
Logarithms show that the distribution of standardized residuals is more normal.
6. Title 6
(1) Experimental code
plot(lm2$fitted.values, lm2$residuals,pch = 16, col = "red",main = " Residuals by Fitted Values", ylab = "Residuals", xlab = "Fitted Values") abline(0,0)
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
Without looking at the outliers, you can see that residual s satisfy the four assumptions in the book.
7. Topic 7
(1) Experimental code
plot(baseball$caught_stealing,baseball$stolen_bases, xlab = "caught_stealing",ylab = "stolen_bases")
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
It can be seen that there is some correlation.
8. Question 8
(1) Experimental code
nothing
(2) Principle analysis
nothing
(3) Experimental results
nothing
(4) Interpretation of results
Some transformations are needed, because you can see from the scatterplot that there is a correlation between the two, but it's not obvious, and it's harder to see that there is a linear relationship.
9. Question 9
(1) Experimental code
lm3<-lm(baseball$caught_stealing~baseball$stolen_bases) plot(caught_stealing ~ stolen_bases, data = baseball, pch = 16, col = "black", ylab = "caught_stealing") abline(lm3, col = "red")#Draw a model on a scatterplot
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
It does feel like there are some linear relationships.
10.Topic 10
(1) Experimental code
a1<-anova(lm3) r2.1 <- a1$"Sum Sq"[1] / (a1$"Sum Sq"[1] + a1$"Sum Sq"[2])
(2) Principle analysis
(3) Experimental results
(4) Interpretation of results
This r^2 value is not small, but it is not large enough.
11.11 Question 11
(1) Experimental code
nothing
(2) Principle analysis
nothing
(3) Experimental results
nothing
(4) Interpretation of results
Because only one observation variable is selected, it is easy to miss the explanatory variable.
Jobs for cereal datasets
1. Topic 1
(1) Experimental code
cereal <- read.csv("datasets/cereals.csv",stringsAsFactors=TRUE, header=TRUE) plot(cereal$Sodium, cereal$Rating,pch = 16, col = "red", ylab = "Rating", xlab = "Sodium") lm1<-lm(cereal$Rating~cereal$Sodium) standard_res=rstandard(lm1) which(abs(standard_res)>2)
(2) Principle analysis
An outlier is considered if the absolute value of the standardized residuals exceeds 2
(3) Experimental results
(4) Interpretation of results
The fourth data is outliers.
2. Question 2
(1) Experimental code
plot(Rating ~ Sodium, data = cereal, pch = 16, col = "black", ylab = "Rating") abline(lm1, col = "red")#Draw a model on a scatterplot
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
nothing
3. Question 3
(1) Experimental code
nothing
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
The slope and intercept do not differ much, indicating that an occasional outlier does not have much effect on linear regression
4. Question 4
(1) Experimental code
nothing
(2) Principle analysis
nothing
(3) Experimental results
(4) Interpretation of results
As shown in the red box in the figure above, outliers occur because the score is too high and the x-axis coordinates are still within the normal range, so the outlier causes the intercept value to change more than the slope.