statsmodels learning -- fitting data using multiple regression

Posted by AliasXNeo on Sun, 30 Jan 2022 19:07:09 +0100

1. Multiple regression fitting code

First, construct an arbitrary dataframe:

import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.DataFrame(data=np.random.randint(0, 20, size=(100, 4)),
                  columns=['x1', 'x2', 'x3', 'y'])
# Start splitting x and y
X = df.loc[:, 'x1':'x3']
Y = df['y']

The dataframe constructed is as follows:

x1x2x3y
01213110
1111098
21218182
318111812
...............

Then divide the independent variable and dependent variable:

X = df.loc[:, 'x1':'x3']
Y = df['y']

Then fit the data using statsmodels:

X = sm.add_constant(X)  # adding a constant
model = sm.OLS(Y, X).fit()

In this way, the fitting is successful, and the prediction:

predictions = model.predict(X)

predictions are the prediction results of the model on the original training set data, which can be replaced by new test set data

2. Analysis results

  1. Print all results:
print(model.summary())

The following table is obtained:

==============================================================================
Dep. Variable:                      y   R-squared:                       0.032
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     1.046
Date:                Fri, 11 Jun 2021   Prob (F-statistic):              0.376
Time:                        10:29:28   Log-Likelihood:                -316.20
No. Observations:                 100   AIC:                             640.4
Df Residuals:                      96   BIC:                             650.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.8099      1.793      3.241      0.002       2.252       9.368
x1             0.1445      0.094      1.533      0.128      -0.043       0.332
x2             0.0976      0.104      0.936      0.352      -0.109       0.305
x3             0.0169      0.102      0.165      0.869      -0.186       0.220
==============================================================================
Omnibus:                       27.225   Durbin-Watson:                   1.684
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.946
Skew:                           0.162   Prob(JB):                       0.0512
Kurtosis:                       1.850   Cond. No.                         55.4
==============================================================================
  1. The fitting results are obtained

coef is the fitted coefficient, which can be printed using the following code:

print(model.params)

Other variables:

  • Dep.Variable: it is the dependent variable, that is, y1 we input, but here statsmodes uses y to represent the result of the model.
  • Model: the least squares model. Here is OLS.
  • Date/Time: the Date/Time when the model is generated
  • No. Observations: sample size
  • Df Residuals: degree of freedom of residuals, whose value = No.Observations - Df Model - 1
  • Df Model: degree of freedom of model. The value is equal to the input dimension. Here x1, x2 and x3 are three dimensions
  • Covariance Type: robustness of covariance matrix
  • R-squared: determination coefficient. The calculation method is SSR/SST. SSR is Sum of Squares for Regression and SST is Sum of Squares for Total
  • Adj. R-squared: use the corrected R-squared value of Okam razor (Okam razor is a theory, which believes that the general model tends to be simpler parameters, which is the so-called great road to simplicity)
  • F-statistic al: F-test. The larger the value, the more it can overturn the original hypothesis. The larger the value, the model is a linear model. The original hypothesis is "our model is not a linear model".
  • Prob (F-statistic): it is the probability of F-statistic. The smaller the value, the more it can reject the original hypothesis. In this example, it is 1.25e-08. This value is very small, which is enough to prove that our model is linear and significant
  • Log likelihood: log likelihood
  • AIC: Akaike information volume, which is used to measure the degree of fitting. Generally, the model with smaller AIC is selected
  • BIC: Bayesian information criterion
  • std err: standard error of coefficient estimation.
  • t: It is our commonly used t-statistic. The larger this value is, the more we can reject the original hypothesis.
  • P> |t|: P value in statistical test. The smaller this value is, the more it can reject the original hypothesis.
  • [0.025, 0.975]: the lower and upper limits of the confidence interval with 95% confidence.
  • Omnibus: test the normality of data based on kurtosis and skewness, which is usually used together with Jarque BERA test
  • Prob(Omnibus): probability of Omnibus test.
  • Durbin Watson: test whether there is autocorrelation in the residual. It mainly tests whether there is autocorrelation in the regression residual by determining whether the correlation of two adjacent error terms is zero.
  • Skewness: skewness
  • Kurtosis: kurtosis
  • Jarque BERA (JB): it is also used to test the normality of data based on kurtosis and skewness. It is usually used together with Omnibus test
  • Prob(JB): probability of JB test.
  • Cond. No.: multicollinearity test, that is, to test whether there is accurate correlation or high correlation between variables

3. Complete case

import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.DataFrame(data=np.random.randint(0, 20, size=(100, 4)),
                  columns=['x1', 'x2', 'x3', 'y'])
# Start splitting x and y
X = df.loc[:, 'x1':'x3']
Y = df['y']

X = sm.add_constant(X)  # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

reference material

Topics: Python