1. Multiple regression fitting code
First, construct an arbitrary dataframe:
import pandas as pd import numpy as np import statsmodels.api as sm df = pd.DataFrame(data=np.random.randint(0, 20, size=(100, 4)), columns=['x1', 'x2', 'x3', 'y']) # Start splitting x and y X = df.loc[:, 'x1':'x3'] Y = df['y']
The dataframe constructed is as follows:
x1 | x2 | x3 | y | |
---|---|---|---|---|
0 | 12 | 13 | 1 | 10 |
1 | 11 | 10 | 9 | 8 |
2 | 12 | 18 | 18 | 2 |
3 | 18 | 11 | 18 | 12 |
... | ... | ... | ... | ... |
Then divide the independent variable and dependent variable:
X = df.loc[:, 'x1':'x3'] Y = df['y']
Then fit the data using statsmodels:
X = sm.add_constant(X) # adding a constant model = sm.OLS(Y, X).fit()
In this way, the fitting is successful, and the prediction:
predictions = model.predict(X)
predictions are the prediction results of the model on the original training set data, which can be replaced by new test set data
2. Analysis results
- Print all results:
print(model.summary())
The following table is obtained:
============================================================================== Dep. Variable: y R-squared: 0.032 Model: OLS Adj. R-squared: 0.001 Method: Least Squares F-statistic: 1.046 Date: Fri, 11 Jun 2021 Prob (F-statistic): 0.376 Time: 10:29:28 Log-Likelihood: -316.20 No. Observations: 100 AIC: 640.4 Df Residuals: 96 BIC: 650.8 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 5.8099 1.793 3.241 0.002 2.252 9.368 x1 0.1445 0.094 1.533 0.128 -0.043 0.332 x2 0.0976 0.104 0.936 0.352 -0.109 0.305 x3 0.0169 0.102 0.165 0.869 -0.186 0.220 ============================================================================== Omnibus: 27.225 Durbin-Watson: 1.684 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.946 Skew: 0.162 Prob(JB): 0.0512 Kurtosis: 1.850 Cond. No. 55.4 ==============================================================================
- The fitting results are obtained
coef is the fitted coefficient, which can be printed using the following code:
print(model.params)
Other variables:
- Dep.Variable: it is the dependent variable, that is, y1 we input, but here statsmodes uses y to represent the result of the model.
- Model: the least squares model. Here is OLS.
- Date/Time: the Date/Time when the model is generated
- No. Observations: sample size
- Df Residuals: degree of freedom of residuals, whose value = No.Observations - Df Model - 1
- Df Model: degree of freedom of model. The value is equal to the input dimension. Here x1, x2 and x3 are three dimensions
- Covariance Type: robustness of covariance matrix
- R-squared: determination coefficient. The calculation method is SSR/SST. SSR is Sum of Squares for Regression and SST is Sum of Squares for Total
- Adj. R-squared: use the corrected R-squared value of Okam razor (Okam razor is a theory, which believes that the general model tends to be simpler parameters, which is the so-called great road to simplicity)
- F-statistic al: F-test. The larger the value, the more it can overturn the original hypothesis. The larger the value, the model is a linear model. The original hypothesis is "our model is not a linear model".
- Prob (F-statistic): it is the probability of F-statistic. The smaller the value, the more it can reject the original hypothesis. In this example, it is 1.25e-08. This value is very small, which is enough to prove that our model is linear and significant
- Log likelihood: log likelihood
- AIC: Akaike information volume, which is used to measure the degree of fitting. Generally, the model with smaller AIC is selected
- BIC: Bayesian information criterion
- std err: standard error of coefficient estimation.
- t: It is our commonly used t-statistic. The larger this value is, the more we can reject the original hypothesis.
- P> |t|: P value in statistical test. The smaller this value is, the more it can reject the original hypothesis.
- [0.025, 0.975]: the lower and upper limits of the confidence interval with 95% confidence.
- Omnibus: test the normality of data based on kurtosis and skewness, which is usually used together with Jarque BERA test
- Prob(Omnibus): probability of Omnibus test.
- Durbin Watson: test whether there is autocorrelation in the residual. It mainly tests whether there is autocorrelation in the regression residual by determining whether the correlation of two adjacent error terms is zero.
- Skewness: skewness
- Kurtosis: kurtosis
- Jarque BERA (JB): it is also used to test the normality of data based on kurtosis and skewness. It is usually used together with Omnibus test
- Prob(JB): probability of JB test.
- Cond. No.: multicollinearity test, that is, to test whether there is accurate correlation or high correlation between variables
3. Complete case
import pandas as pd import numpy as np import statsmodels.api as sm df = pd.DataFrame(data=np.random.randint(0, 20, size=(100, 4)), columns=['x1', 'x2', 'x3', 'y']) # Start splitting x and y X = df.loc[:, 'x1':'x3'] Y = df['y'] X = sm.add_constant(X) # adding a constant model = sm.OLS(Y, X).fit() predictions = model.predict(X) print_model = model.summary() print(print_model)
reference material
- youtube related tutorials: https://www.youtube.com/watch?v=L_h7XFUGWAk
- statsmodels.regression.linear_model.OLS official documents: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html#generated-statsmodels-regression-linear-model-ols–page-root
- Official documents of Ordinary Least Squares: https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html#examples-notebooks-generated-ols–page-root