R language self study notes: basis of regression analysis

Posted by beerman on Tue, 08 Mar 2022 14:48:27 +0100

I Linear regression

1. lm() function returns the prediction model of input variables, and the returned results can be used with many functions.

> lm.model <- lm(wt ~ mpg, data = mtcars)


> coefficients(lm.model) # Extraction coefficient
(Intercept)         mpg 
   6.047255   -0.140862 

> confint(lm.model, level=0.95) # After the distribution of the correlation coefficient of the linear model is obtained, the interval is limited to obtain the value of the boundary point
                 2.5 %     97.5 %
(Intercept)  5.4168245  6.6776856
mpg         -0.1709569 -0.1107671


> fitted(lm.model) # Use the established model to predict wt
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
           3.089154            3.089154            2.835602            3.032809            3.413136            3.497653            4.032929 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
           2.610223            2.835602            3.342705            3.539912            3.737119            3.610343            3.906153 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
           4.582291            4.582291            3.976584            1.483327            1.765051            1.272034            3.018723 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
           3.863894            3.906153            4.173791            3.342705            2.201723            2.384844            1.765051 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
           3.821636            3.272274            3.934325            3.032809 



> residuals(lm.model) # Residual between prediction and reality
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
        -0.46915365         -0.21415365         -0.51560210          0.18219114          0.02686382         -0.03765336         -0.46292884 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
         0.57977705          0.31439790          0.09729481         -0.09991195          0.33288129          0.11965707         -0.12615307 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
         0.66770947          0.84170947          1.36841594          0.71667281         -0.15005113          0.56296577         -0.55372266 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
        -0.34389448         -0.47115307         -0.33379081          0.50229481         -0.26672324         -0.24484380         -0.25205113 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
        -0.65163589         -0.50227421         -0.36432547         -0.25280886 



> anova(lm.model) # anova table
Analysis of Variance Table

Response: wt
          Df  Sum Sq Mean Sq F value    Pr(>F)    
mpg        1 22.3431 22.3431  91.375 1.294e-10 ***
Residuals 30  7.3356  0.2445                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



> vcov(lm.model) # covariance matrix 
             (Intercept)           mpg
(Intercept)  0.095289963 -0.0043626666
mpg         -0.004362667  0.0002171494



> influence(lm.model) # It is equivalent to a summary of the results of the analysis
$hat
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
         0.03198439          0.03198439          0.03776901          0.03277255          0.03296737          0.03476902          0.06102792 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
         0.04774195          0.03776901          0.03195442          0.03590963          0.04334604          0.03816586          0.05249086 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
         0.11464634          0.11464634          0.05705606          0.16580983          0.12563611          0.20060244          0.03301399 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
         0.04996488          0.05249086          0.07220085          0.03195442          0.07740711          0.06226177          0.12563611 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         0.04759875          0.03138551          0.05426366          0.03277255 

$coefficients
                     (Intercept)           mpg
Mazda RX4           -0.007282030 -3.913985e-04
Mazda RX4 Wag       -0.003324014 -1.786609e-04
Datsun 710           0.009157470 -1.289282e-03
Hornet 4 Drive       0.001485910  2.190313e-04
Hornet Sportabout    0.001557359 -3.430679e-05
Valiant             -0.002604528  6.896128e-05
Duster 360          -0.066342655  2.535306e-03
Merc 240D           -0.027785644  2.330044e-03
Merc 230            -0.005583936  7.861634e-04
Merc 280             0.004737902 -7.949360e-05
Merc 280C           -0.007473905  2.108129e-04
Merc 450SE           0.033786317 -1.140454e-03
Merc 450SL           0.010081735 -3.083068e-04
Merc 450SLC         -0.015778260  5.782587e-04
Cadillac Fleetwood   0.153962430 -6.490318e-03
Lincoln Continental  0.194083866 -8.181645e-03
Chrysler Imperial    0.184925710 -6.947280e-03
Fiat 128            -0.161833613  9.391507e-03
Honda Civic          0.026202892 -1.571169e-03
Toyota Corolla      -0.151504789  8.636477e-03
Toyota Corona       -0.003495500 -7.167077e-04
Dodge Challenger    -0.040959820  1.475709e-03
AMC Javelin         -0.058928219  2.159665e-03
Camaro Z28          -0.054830702  2.169570e-03
Pontiac Firebird     0.024459924 -4.103942e-04
Fiat X1-9            0.028152063 -1.850938e-03
Porsche 914-2        0.019369325 -1.370227e-03
Lotus Europa         0.044014788 -2.639199e-03
Ford Pantera L      -0.073758568  2.607048e-03
Ferrari Dino        -0.019818649  1.798843e-04
Maserati Bora       -0.047027088  1.741542e-03
Volvo 142E          -0.002061852 -3.039283e-04

$sigma
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
          0.4950874           0.5013168           0.4933815           0.5017657           0.5029179           0.5028932           0.4950577 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
          0.4906934           0.4994096           0.5026082           0.5025884           0.4989569           0.5024330           0.5023674 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
          0.4853738           0.4747194           0.4295043           0.4813738           0.5020600           0.4891637           0.4919538 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
          0.4986579           0.4948469           0.4988098           0.4939281           0.5002931           0.5007472           0.5004465 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
          0.4874197           0.4939342           0.4981090           0.5006732 

$wt.res
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
        -0.46915365         -0.21415365         -0.51560210          0.18219114          0.02686382         -0.03765336         -0.46292884 
          Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
         0.57977705          0.31439790          0.09729481         -0.09991195          0.33288129          0.11965707         -0.12615307 
 Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
         0.66770947          0.84170947          1.36841594          0.71667281         -0.15005113          0.56296577         -0.55372266 
   Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
        -0.34389448         -0.47115307         -0.33379081          0.50229481         -0.26672324         -0.24484380         -0.25205113 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
        -0.65163589         -0.50227421         -0.36432547         -0.25280886

2. Evaluation of fitting quality

(1) The predicted value is obtained according to the model

predict(model, For training generation model Data set of)

(2) Use the R2 function in the caret package

R2(predict As a result, in the original dataset y value)

II multiple linear regression

1. Analysis of LM () and interaction

(1) Directly enter multiple variables into fomular and connect them with a plus sign:

mtcars.m <- lm(wt~mpg+disp+mpg:disp, data = mtcars)

##Analyze the model of independent variable disp and mpg to dependent variable wt, where mpg:disp means that the interaction between the two variables is considered when constructing the linear model
##That is, the two variables are not independent of each other

summary(mtcars.m)

Model:

> summary(mtcars.m)

Call:
lm(formula = wt ~ mpg + disp + mpg:disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8047 -0.2636  0.0539  0.2340  0.8817 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.3716803  0.7066020   4.772 5.18e-05 ***
mpg         -0.0505361  0.0257971  -1.959  0.06014 .  
disp         0.0065258  0.0020525   3.179  0.00359 ** 
mpg:disp    -0.0001603  0.0001232  -1.301  0.20383    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4055 on 28 degrees of freedom
Multiple R-squared:  0.8449,	Adjusted R-squared:  0.8283 
F-statistic: 50.84 on 3 and 28 DF,  p-value: 1.871e-11

(2) influence weight of independent variable

Count the "importance" of different variables to the model:

library(relaimpo);
calc.relimp( wt~mpg+disp+mpg:disp, data = mtcars )

> calc.relimp( wt~mpg+disp+mpg:disp, data = mtcars )
Response variable: wt 
Total response variance: 0.957379 
Analysis based on 32 observations 

3 Regressors: 
mpg disp mpg:disp 
Proportion of variance explained by model: 84.49%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: ##Influence weight

                 lmg
mpg      0.399914296
disp     0.435589844
mpg:disp 0.009378141

Average coefficients for different model sizes: 

                   1X          2Xs           3Xs
mpg      -0.140861970 -0.066315339 -0.0505361268
disp      0.007010325  0.004277131  0.0065257929
mpg:disp          NaN          NaN -0.0001603097

(3) Visualize how independent variables affect each other

The interaction is mainly reflected in the change of slope:

library(interactions)
sim_slopes(mtcars.m, pred = mpg, modx = disp, jnplot = TRUE)
##pred input affected amount modx input affected amount

Results:

JOHNSON-NEYMAN INTERVAL 

When disp is INSIDE the interval [8.39, 747.98], the slope of mpg is p < .05.

Note: The range of observed values of disp is [71.10, 472.00]

SIMPLE SLOPES ANALYSIS 

##Random access to data, the results are somewhat outrageous

2. glm()

When the parameter famil y = "Gaussian" in glm function, your fitting effect is consistent with lm(), but glm can also be used in other regression models, such as linear regression in binary classification:

##glm is used to establish a linear model for the variables in swiss. When different independent variables are input, we can judge the corresponding specifications type through the numerical results

dat <- iris %>% filter( Species %in% c("setosa", "virginica") )
dat$Species <- as.factor(dat$Species)##The column corresponding to "dependent variable" needs to be processed, otherwise an error may be reported

bm <- glm( Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = dat, family = binomial )  ##Select the model of binomial distribution

mtcars.predict1 <- predict(bm, dat,type = "response") 
mtcars.predict2 <- predict(bm, dat) ##Use the default type

Results: note that the response model is set, and the output value has the effect of binary classification

> mtcars.predict1
           1            2            3            4            5            6            7            8            9           10           11           12 
2.220446e-16 4.732528e-13 1.369834e-13 3.388726e-12 2.220446e-16 3.892826e-13 7.748530e-13 1.988060e-13 5.204109e-12 3.900993e-13 2.220446e-16 1.453927e-12 
          13           14           15           16           17           18           19           20           21           22           23           24 
2.656364e-13 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.472442e-13 2.682276e-13 2.220446e-16 4.132652e-11 
          25           26           27           28           29           30           31           32           33           34           35           36 
5.282004e-11 3.493413e-12 4.614812e-12 2.220446e-16 2.220446e-16 4.976501e-12 5.082694e-12 2.857677e-13 2.220446e-16 2.220446e-16 1.032749e-12 2.220446e-16 
          37           38           39           40           41           42           43           44           45           46           47           48 
2.220446e-16 2.220446e-16 1.035326e-12 1.337875e-13 2.220446e-16 3.421128e-11 4.494793e-13 2.131127e-11 2.126479e-11 1.861773e-12 2.220446e-16 6.741667e-13 
          49           50           51           52           53           54           55           56           57           58           59           60 
2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
          61           62           63           64           65           66           67           68           69           70           71           72 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
          73           74           75           76           77           78           79           80           81           82           83           84 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
          85           86           87           88           89           90           91           92           93           94           95           96 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
          97           98           99          100 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
> mtcars.predict2
        1         2         3         4         5         6         7         8         9        10        11        12        13        14        15 
-31.25726 -28.37915 -29.61892 -26.41057 -31.27837 -28.57447 -27.88610 -29.24645 -25.98157 -28.57237 -32.08233 -27.25675 -28.95665 -30.56889 -38.51083 
       16        17        18        19        20        21        22        23        24        25        26        27        28        29        30 
-34.24374 -33.36463 -30.28368 -30.31909 -30.33771 -28.43567 -28.94694 -34.48423 -23.90952 -23.66413 -26.38014 -26.10175 -30.45579 -31.23614 -26.02629 
       31        32        33        34        35        36        37        38        39        40        41        42        43        44        45 
-26.00518 -28.88360 -33.93252 -35.76190 -27.59880 -32.00469 -34.03910 -31.85587 -27.59630 -29.64252 -31.08514 -24.09847 -28.43069 -24.57178 -24.57397 
       46        47        48        49        50        51        52        53        54        55        56        57        58        59        60 
-27.00949 -30.11375 -28.02530 -31.68625 -30.02680  42.30338  30.16757  35.29448  32.36693  37.44698  41.69689  25.43424  36.78895  34.84647  38.68465 
       61        62        63        64        65        66        67        68        69        70        71        72        73        74        75 
 26.28266  30.18619  31.69255  31.17407  34.61827  31.99455  29.96005  40.13440  48.50935  26.36952  34.80433  29.12103  42.35915  24.81853  33.23214 
       76        77        78        79        80        81        82        83        84        85        86        87        88        89        90 
 32.34083  23.59987  24.35911  35.30878  28.83298  35.38856  33.80247  36.28236  23.87569  30.51635  37.26026  36.12244  29.93893  23.55764  29.68174 
       91        92        93        94        95        96        97        98        99       100 
 35.78971  28.03628  30.16757  37.59549  37.12645  30.44316  27.82403  28.31458  33.14986  27.54634

III Multiple nonlinear regression

1. Machine learning -- regression analysis

(1) The data were analyzed using earth and caret toolkits

For example, regression analysis of Education for all other elements in swiss;

nls.model <- earth(Education~.,data = swiss)
nls.model
plot(nls.model)

(2) Adjust the parameters of the results (training)

Take n-Fold cross validation as an example, use the train function in the caret package:

train(x, y, method = "rf", preProcess = NULL, ...,
	weights = NULL, metric = ifelse(is.factor(y), "Accuracy",  "RMSE"),
	maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE),
   trControl = trainControl(), tuneGrid = NULL,
   tuneLength = ifelse(trControl$method == "none", 1, 3))

Of which:

x: Input independent variable during fitting

y: Dependent variable input during fitting

Weights: the weights set according to the model are unnecessary

trControl: you need to call the result returned from trainControl() to design an appropriate inspection method

tuneGrid: the method applied in parameter adjustment. Different methods have different parameter adjustment methods, so the column names of the input grid are also different (the function controls parameter adjustment by identifying the column names of the input columns)

For example, a fitting function is established for the Fertility of swiss, and the dependent variable is all other columns:

grid <- expand.grid(
  degree = 1:3,
  nprune = seq(2, 100, length.out = 8)
)
##For a parameter adjustment grid specially constructed by method, degree represents the iteration depth and nprune represents the feature selection quantity
##expand. The grid function will adaptively generate the grid according to rows and columns to meet the matching relationship between data


set.seed(12222)##Generating random seeds ensures the repeatability of the model

swiss.model1 <- train(
x = subset(swiss, select = -Fertility),
y = swiss$Fertility,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),  ## 10% off cross validation
tuneGrid = grid
)

result:

> grid
   degree nprune
1       1      2
2       2      2
3       3      2
4       1     16
5       2     16
6       3     16
7       1     30
8       2     30
9       3     30
10      1     44
11      2     44
12      3     44
13      1     58
14      2     58
15      3     58
16      1     72
17      2     72
18      3     72
19      1     86
20      2     86
21      3     86
22      1    100
23      2    100
24      3    100

> swiss.model1
Multivariate Adaptive Regression Spline 

47 samples
 5 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 42, 42, 42, 42, 42, 44, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared   MAE     
  1         2     11.278358  0.3284905  9.209832
  1        16      9.541177  0.5068536  7.839405
  1        30      9.541177  0.5068536  7.839405
  1        44      9.541177  0.5068536  7.839405
  1        58      9.541177  0.5068536  7.839405
  1        72      9.541177  0.5068536  7.839405
  1        86      9.541177  0.5068536  7.839405
  1       100      9.541177  0.5068536  7.839405
  2         2     11.294113  0.3270052  9.216748
  2        16     10.717470  0.4382987  8.802805
  2        30     10.717470  0.4382987  8.802805
  2        44     10.717470  0.4382987  8.802805
  2        58     10.717470  0.4382987  8.802805
  2        72     10.717470  0.4382987  8.802805
  2        86     10.717470  0.4382987  8.802805
  2       100     10.717470  0.4382987  8.802805
  3         2     11.294113  0.3270052  9.216748
  3        16     10.798077  0.4366048  8.891187
  3        30     10.798077  0.4366048  8.891187
  3        44     10.798077  0.4366048  8.891187
  3        58     10.798077  0.4366048  8.891187
  3        72     10.798077  0.4366048  8.891187
  3        86     10.798077  0.4366048  8.891187
  3       100     10.798077  0.4366048  8.891187

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 16 and degree = 1.

(3) vip is used to show the importance of different independent variables in the result model

library(vip);
p1 <- vip(swiss.model1, num_features = 4, geom = "point", value = "gcv")

Where num_features controls the number of variables to be displayed (priority is given to those with significant influence). value is an indicator of quantitative importance, "gcv" means "Generalized cross − validation". You can also enter "rss" to measure importance by Residual Sums of Squares.

2. Logistic regression

> m <- drm(cyl ~ wt, fct = L.3(), data = mtcars)
> plot(m, log="", main = "Logistic function")

##For example, natural data do not fit the logistic regression curve

3. Nonlinear least square method

(1) Use nls function

nls(formula, data, start, control, algorithm,
    trace, subset, weights, na.action, model,
    lower, upper, ...)

Formula: the formula needs to know the regression model in advance, and the function is only to fit the appropriate parameters

start: it can be a list that records the initial values of various parameters in a model function that cannot be started by itself

Control: a control of fitting conditions

set.seed(43247) 
x<-seq(0, 100, 1) 
y<-(rnorm(1,0,1)*x^2+runif(1,0,1)*x^3 +rnorm(101,0,1))
m<-nls(y~(a*x^2+b*x^3), 
       control = nls.control(maxiter = 100000),  ##Control the maximum number of iterations
       start = list(a=1, b=1));

result:

> m
Nonlinear regression model
  model: y ~ (a * x^2 + b * x^3)
   data: parent.frame()
      a       b 
-0.7908  0.3721 
 residual sum-of-squares: 87.8

Number of iterations to convergence: 1 
Achieved convergence tolerance: 5.332e-09

(2) Automatic fitting using drm function

There is no need to know the function form in advance, only select the model with input, and adaptively start the fitting

library(drc)
m <- drm( y ~ x, fct = LL.3() ); ## Built in model
dat = data.frame(x=x,y=y,nls=predict(m)) ##When predict only inputs the model, it is equivalent to getting a function file
ggplot( dat,aes(x=x,y=y) ) +
  geom_point() + geom_line( aes(x = x, y = nls), colour = "red");

'''
##fct: a list with three or more elements specifying the non-linear function, the accompanying self starter function, the names of the parameter in the non-linear function and, optionally, the first and second derivatives as well as information used for calculation of ED values. Currently available functions include, among others, the four- and five-parameter log-logistic models LL.4, LL.5 and the Weibull model W1.4. Use getMeanFunctions for a full list.

LL.3: 

LL.3(fixed = c(NA, NA, NA), names = c("b", "d", "e"), ...)

f(x) = 0 + \frac{d-0}{1+\exp(b(\log(x)-\log(e)))}
'''

Topics: R Language

Programmer Think

R language self study notes: basis of regression analysis

Hot Topics