# Big Data Base Job 4: Linear Regression

Posted by jaylee on Sun, 07 Nov 2021 17:54:37 +0100

# Jobs for the Baseball dataset

## 1. Topic 1

Note: I said at_before the title when I was doing my homework Bas>=100 of 209 data, I filtered, found that later topics began to use this subset, let's not change, they have spent so much time.

### (1) Experimental code

```baseball=read.csv("datasets/baseball.txt",stringsAsFactors=TRUE,sep='')
baseball=baseball[which(baseball\$at_bats>=100),]
plot(baseball\$homeruns,baseball\$bat_ave,
xlab = "homeruns",ylab = "bat_ave")
```

## 2. Question 2

### (1) Experimental code

```# Two Paints Scatter Chart
baseball1=baseball[,c(-1,-2,-4)]
pairs(baseball1)
# Calculate correlation coefficient
```

### (4) Interpretation of results

The calculation of the clearance scatterplot and the linear correlation coefficient shows that:
game and at_bats,runs,hits,doubles,RBIs,walks, strikeouts have linear correlation
at_bats and runs,hits,doubles,RBIs have linear correlation
runs and hits,doubles,homeruns,RBIs have linear correlation
hits and doubles,RBIs have linear correlation
Linear correlation between doubles and RBIs
Horuns and RBIs have linear correlation
bat_ave and on_base_pct,slugging_pct has linear correlation
on_base_pct and slugging_pct has linear correlation

## 3. Question 3

### (1) Experimental code

```#Modeling
lm1<-lm(baseball\$bat_ave~baseball\$homeruns)
plot(bat_ave ~ homeruns,
data = baseball,
pch = 16,
col = "black",
ylab = "bat_ave")

abline(lm1, col = "red")#Draw a model on a scatterplot
# Note that this is the standardized residuals
qqnorm(rstandard(lm1), datax = TRUE)
qqline(rstandard(lm1), datax = TRUE)

plot_ZP = function(ti) # Draw normal probability map
{
n = length(ti)
order = rank(ti)   #In ascending order, t(i) is the first order
Pi = order/n #Cumulative probability
plot(ti,Pi,xlab = "standard_residual",ylab = "Percentage")  #Draw normal probability map
fm = lm(Pi~ti)
abline(fm)
}
plot_ZP(rstandard(lm1))

```

### (4) Interpretation of results

I think residuals can be considered normal within an acceptable range

## 4. Question 4

### (1) Experimental code

```plot(lm1\$fitted.values, lm1\$residuals,pch = 16, col = "red",main = "
Residuals by Fitted Values",  ylab = "Residuals", xlab = "Fitted Values")
abline(0,0)
```

nothing

### (4) Interpretation of results

From the results, you can see that the zero mean hypothesis is met, but others can clearly see that the distribution of residuals changes with the value of fitted values.

## 5. Question 5

### (1) Experimental code

```baseball\$log_homeruns <- log(baseball\$homeruns+1e-5)#Logarithm
lm2<-lm(baseball\$bat_ave~baseball\$log_homeruns)
plot(bat_ave ~ log_homeruns,
data = baseball,
pch = 16,
col = "black",
ylab = "bat_ave")

abline(lm2, col = "red")#Draw a model on a scatterplot

# Note that this is the standardized residuals
qqnorm(rstandard(lm2), datax = TRUE)
qqline(rstandard(lm2), datax = TRUE)
plot_ZP(rstandard(lm2))

```

nothing

### (4) Interpretation of results

Logarithms show that the distribution of standardized residuals is more normal.

## 6. Title 6

### (1) Experimental code

```plot(lm2\$fitted.values, lm2\$residuals,pch = 16, col = "red",main = "
Residuals by Fitted Values",  ylab = "Residuals", xlab = "Fitted Values")
abline(0,0)
```

nothing

### (4) Interpretation of results

Without looking at the outliers, you can see that residual s satisfy the four assumptions in the book.

## 7. Topic 7

### (1) Experimental code

```plot(baseball\$caught_stealing,baseball\$stolen_bases,
xlab = "caught_stealing",ylab = "stolen_bases")
```

nothing

### (4) Interpretation of results

It can be seen that there is some correlation.

## 8. Question 8

nothing

nothing

nothing

### (4) Interpretation of results

Some transformations are needed, because you can see from the scatterplot that there is a correlation between the two, but it's not obvious, and it's harder to see that there is a linear relationship.

## 9. Question 9

### (1) Experimental code

```lm3<-lm(baseball\$caught_stealing~baseball\$stolen_bases)
plot(caught_stealing ~ stolen_bases,
data = baseball,
pch = 16,
col = "black",
ylab = "caught_stealing")

abline(lm3, col = "red")#Draw a model on a scatterplot
```

nothing

### (4) Interpretation of results

It does feel like there are some linear relationships.

## 10.Topic 10

### (1) Experimental code

```a1<-anova(lm3)
r2.1 <- a1\$"Sum Sq"[1] / (a1\$"Sum Sq"[1] +
a1\$"Sum Sq"[2])
```

### (4) Interpretation of results

This r^2 value is not small, but it is not large enough.

## 11.11 Question 11

nothing

nothing

nothing

### (4) Interpretation of results

Because only one observation variable is selected, it is easy to miss the explanatory variable.

# Jobs for cereal datasets

## 1. Topic 1

### (1) Experimental code

```cereal <- read.csv("datasets/cereals.csv",stringsAsFactors=TRUE, header=TRUE)
plot(cereal\$Sodium, cereal\$Rating,pch = 16, col = "red",  ylab = "Rating", xlab = "Sodium")
lm1<-lm(cereal\$Rating~cereal\$Sodium)
standard_res=rstandard(lm1)
which(abs(standard_res)>2)
```

### (2) Principle analysis

An outlier is considered if the absolute value of the standardized residuals exceeds 2

### (4) Interpretation of results

The fourth data is outliers.

## 2. Question 2

### (1) Experimental code

```plot(Rating ~ Sodium,
data = cereal,
pch = 16,
col = "black",
ylab = "Rating")

abline(lm1, col = "red")#Draw a model on a scatterplot
```

nothing

nothing

## 3. Question 3

nothing

nothing

### (4) Interpretation of results

The slope and intercept do not differ much, indicating that an occasional outlier does not have much effect on linear regression

## 4. Question 4

nothing

nothing

### (4) Interpretation of results

As shown in the red box in the figure above, outliers occur because the score is too high and the x-axis coordinates are still within the normal range, so the outlier causes the intercept value to change more than the slope.

Topics: Big Data