1 Multivariable Linear Regression Application Scenario
So far, we have discussed the single variable/feature regression model. Now we add more features to the housing price model, such as the number of rooms and floors, to form a multi-variable model.
1.1 Univariate Linear Regression Case
- Model: h theta(x) = theta 0+theta 1x
1.2 Multivariate Linear Regression Case
- Model:
- New concept
For example:
x(1) = [40, 1, 1, 10] x(2) = [96, 2, 1, 5] x(3) = [135, 3, 2, 20]
For example:
x(1)1 = 40 x(1)2 = 1 .......
2 multivariate gradient descent method
- Model:
- Parameters:
- Loss function:
- Gradient descent formula (repeated execution):
2.1 Univariate gradient descent n=1, repeat until convergence
2.2 Multivariate gradient descent n>1
2.3 Multivariate Batch Gradient Decrease Code
import numpy as np # 1. Analog data X1 = 2 * np.random.randn(100, 1) X2 = 4 * np.random.rand(100, 1) X3 = 6 * np.random.rand(100, 1) y = 4 + 3 * X1 + 4 * X2 + 5 * X3 + np.random.randn(100, 1) # 2. Implementing gradient descent algorithm # np.c_is the combination of data into vector formats: (n, 1) (n,1) = (n, 2) X_b = np.c_[np.ones((100, 1)), X1, X2, X3] # To initialize the value of theta, four values of theta need to be calculated. theta = np.random.randn(4, 1) # Setting learning rate and convergence times learning_rate = 0.1 n_iterations = 1000 # Calculate according to formula for iteration in range(n_iterations): # Gradient descent formula = 1/sample number* (predicted value-true value)*Xi gradients = 1 / 100 * X_b.T.dot(X_b.dot(theta) - y) # theta = theta - learning rate * gradient value theta = theta - learning_rate * gradients print(theta)
- Code execution results:
3 Gradient Decrease Method Practice 1: Feature Scaling
3.1 Problems encountered by gradient descent method
When we face multi-dimensional feature problems, we need to ensure that these features have similar scales, which will help the gradient descent algorithm converge faster. The feature scaling is to ensure that the feature is on an order of magnitude.
Taking the housing price problem as an example, suppose we use two features, the size of the house and the number of rooms, x1 = the area of the house (0-400 m2) and x2 = the number of bedrooms (1-5). Using two parameters as abscissal and longitudinal coordinates, we can draw the contour map of the cost function. We can see that the image will appear very flat, and the gradient descent algorithm needs many iterations to converge.
3.2 Solutions
-
Solution 1:
- Trying to scale all features down to - 1 to 1 as much as possible. For example:
x1 = housing area / 400
x2 = Number of bedrooms / 5
- Solution 2: Square Mean Method
The feature xi is replaced by xi-Mu on the original basis.
The maximum can also be replaced by the standard deviation, or the maximum - minimum.
4 Gradient Decline Method Practice II: Learning Rate
4.1 Problems encountered in gradient descent method
The number of iterations required for the convergence of gradient descent algorithm varies according to the model. We can not predict in advance. We can plot the number of iterations and cost function to observe when the algorithm converges.
Each iteration of gradient descent algorithm is affected by learning rate.
- If the learning rate is too small, the number of iterations needed to achieve convergence will be very high.
- If the learning rate is too high, each iteration may not reduce the cost function, and may cross the local minimum and lead to non-convergence.
4.2 Solutions
- Automatically test for convergence, such as comparing the change value of the cost function with a threshold value (e.g. 0.001), but it is usually better to look at the chart above.
Trying to select alpha from the following values:... 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,...
Supplementary 5-gradient descent algorithm
5.1 Summary of Three Gradient Decline
How to choose?
- Less training set: using batch gradient descent (less than 2000)
- The training set is relatively large: Mini-batch size is 64, 128, 256, 512, 1024 with Mini-bitch gradient descent. Mini-batch size is suitable for CPU/GPU memory.
5.2 Random Gradient Decline
The idea of random gradient descent: divide m samples into m parts, and use one part for gradient descent each time; that is to say, when there are m samples, batch gradient descent can only do gradient descent once, but random gradient descent can do m times.
- Implementation code
import numpy as np import random X = 2 * np.random.rand(100, 1) Y = 4 + 3 * X + np.random.randn(100, 1) X_b = np.c_[np.ones((100, 1)), X] # m samples were processed by epochs in each round. n_epochs = 1000 # learning rate a0 = 0.1 # Define attenuation rate decay_rate = 1 def learning_schedule(epoch_num): """ //Define a function of learning rate decay """ return (1.0 / (decay_rate * epoch_num + 1)) * a0 # Initialize the TA value theta = np.random.randn(2, 1) # Initialize random values num = [i for i in range(100)] m = 100 for epoch in range(n_epochs): rand = random.sample(num, 100) for i in range(m): random_index = rand[i] xi = X_b[random_index:random_index + 1] yi = Y[random_index:random_index + 1] # Random gradient descent gradients = xi.T.dot(xi.dot(theta) - yi) # learning rate learning_rate = learning_schedule(epoch+1) theta = theta - learning_rate * gradients print(theta)
- Execution results show:
5.3 Mini-batch gradient algorithm
Random gradient descent will lose the acceleration brought by vector, so we will not use random gradient descent too much.
- Implementation code
import numpy as np import random X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) X_b = np.c_[np.ones((100, 1)), X] # print(X_b) n_epochs = 500 a = 0.03 m = 100 num = [i for i in range(100)] theta = np.random.randn(2, 1) batch_num = 5 batch_size = m // 5 # epoch means round, which means doing an iteration with m samples for epoch in range(n_epochs): # Generate 100 unrepeated random numbers for i in range(batch_num): start = i*batch_size end = (i+1)*batch_size xi = X_b[start:end] yi = y[start:end] gradients = 1/batch_size * xi.T.dot(xi.dot(theta)-yi) print(a) learning_rate = a theta = theta - learning_rate * gradients print(theta)
- Execution results show:
5.4 Mini-batch gradient algorithm optimization: learning rate decay
In Mini-batch, because of the noise, the training results may not converge, but sway around the lowest point. If we want to solve this problem, we need to reduce the learning rate and let him sway in the smallest possible range.
1 epoch = 1 traversal of all data
- Learning rate attenuation formula:
- Implementation code
import numpy as np import random X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) X_b = np.c_[np.ones((100, 1)), X] # print(X_b) n_epochs = 500 t0, t1 = 5, 50 m = 100 num = [i for i in range(100)] def learning_schedule(t): return float(t0) / (t + t1) theta = np.random.randn(2, 1) batch_num = 5 batch_size = m // 5 # epoch means round, which means doing an iteration with m samples for epoch in range(n_epochs): # Generate 100 unrepeated random numbers for i in range(batch_num): start = i*batch_size end = (i+1)*batch_size xi = X_b[start:end] yi = y[start:end] gradients = 1/batch_size * xi.T.dot(xi.dot(theta)-yi) learning_rate = learning_schedule(epoch*m + i) theta = theta - learning_rate * gradients print(theta)
- Execution results show:
6-Characteristic and Polynomial Regression
6.1 Overfitting
The problem of over-fitting occurs when there are too many variables (theta). At this time, we do not have more data to fit the model, although the value of loss function is close to zero.
6.2 Overfitting Solutions:
- Reduce the number of features (usually not used)
1) Manual selection of feature number
2) Model selection - Regularization (feature scaling)
Keep all features, but reduce the magnitude or parameter theta_j
6.2 Feature Scaling
When predicting house prices, let's assume that we don't know the area of the house, but we know the length and width of the house.
- Model design:
h theta(x) = theta 0+theta 1 x house length + theta 2 x house width - Characteristic Unscaled Graphic Display
- Characteristic Zoom Graphics Display
Note: If we use polynomial regression model, feature scaling is necessary before running gradient descent algorithm.
6.3 Regularization
- How do you not want theta 3 and theta 4?
First, we can add terms about theta 3 and theta 4 to the loss function, forcing theta 3 and theta 4 to be as small as possible if the loss function is to be minimized.
Then regularization, the formula is as follows:
The Difference between 6.4 L1 Regular and L2 Regular
- L1 tends to reduce eigenvalues
- L2 tends to retain eigenvalues
7 Regularization Algorithms and Code Implementation
7.1 Ridge regression
7.1.1 Algorithmic Understanding
7.1.2 Implementation Formula
7.1.3 Code Implementation
- Two methods of ridge regression:
""" //ridge regression //Method 1: Ridge regression uses L2 regularization """ import numpy as np from sklearn.linear_model import Ridge from sklearn.linear_model import SGDRegressor X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) # Alpha is the alpha in the penalty term, solver is the method of processing data, auto is automatically selected according to data, svd is the analytical solution, sag is the random gradient descent. ridge_reg = Ridge(alpha=1, solver='auto') # learning process ridge_reg.fit(X, y) # Forecast print(ridge_reg.predict([[1.5], [2], [2.5]])) # Print intercept print(ridge_reg.intercept_) # Printing coefficient print(ridge_reg.coef_) """ //Method 2: Ridge regression and SGD & penalty = 2 are equivalent """ sgd_reg = SGDRegressor(penalty='l2') sgd_reg.fit(X, y.ravel()) print(sgd_reg.predict([[1.5], [2], [2.5]])) # Print intercept print("W0=", sgd_reg.intercept_) # Printing coefficient print("W1=", sgd_reg.coef_)
7.2 Lasso regression
7.2.1 Algorithmic Understanding
7.2.2 Realization Formula
7.2.3 Code Implementation
""" Lasso regression Lasso Used is l1 Regularization """ import numpy as np from sklearn.linear_model import Lasso from sklearn.linear_model import SGDRegressor X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) lasso_reg = Lasso(alpha=0.15) lasso_reg.fit(X, y) print(lasso_reg.predict([[1.5]])) print(lasso_reg.coef_) sgd_reg = SGDRegressor(penalty='l1', n_iter=1000) sgd_reg.fit(X, y.ravel()) print(sgd_reg.predict([[1.5]])) print(sgd_reg.coef_)
7.3 Elastic Net Regression
7.3.1 Algorithmic Understanding
7.3.2 Implementation Formula
7.3.3 Code Implementation
import numpy as np from sklearn.linear_model import ElasticNet X = 2 * np.random.rand(100, 1) Y = 4 + 3 * X + np.random.randn(100, 1) elastic_reg = ElasticNet(alpha=0.15, l1_ratio=0.5) elastic_reg.fit(X, Y) print(elastic_reg.predict([[1.5]])) print(elastic_reg.coef_) print(elastic_reg.intercept_) from sklearn.linear_model import SGDRegressor elastic_reg = SGDRegressor(penalty='elasticnet') elastic_reg.fit(X, Y) print(elastic_reg.predict([[1.5]])) print(elastic_reg.coef_)
8-normal equation and gradient descent comparison
Gradient descent:
- Need to choose the right alpha
- Multiple iterations are required
- When n is large, the effect is good.
Normal equation:
- No need to choose learning rate a
- No iteration is required
- Need to compute the inversion of X's transformation multiplied by X's whole
- When n is large, the calculation is slow.
Summary: According to experience, when the number of features reaches 10000, it is better to change to gradient descent.
Gradient descent code for 8.1 polynomial regression
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression # 1) Data preparation; # Sample size m = 100 X = 6 * np.random.randn(m, 1) - 3 Y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1) # 2). # 2-1. Convert a higher-order equation into a first-order equation; (Multivariate linear regression) # degree: Processing data in a few dimensions; poly_features = PolynomialFeatures(degree=2, include_bias=False) # Fit_transform=== fit () + transform (), where transform is used for normalization; X_poly = poly_features.fit_transform(X, Y) # 2-2. Processing first-order equations line_reg = LinearRegression() line_reg.fit(X_poly, Y) print(line_reg.coef_) print(line_reg.intercept_)
8.2 Graphics drawn from different dimensions
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression # 1) Data preparation; # Sample size m = 100 X = 6 * np.random.randn(m, 1) - 3 Y = 7 * X ** 2 + 5 *X + 2 + np.random.randn(m, 1) # plt.plot(X, Y, 'b.') # plt.show() # Setting the font display of image dimension and line d = {1: 'g-', 2: 'r.', 10: 'y*'} # d = {2: 'g-'} for i in d: # 2). # 2-1. Convert a higher-order equation into a first-order equation; (Multivariate linear regression) # degree: Processing data in a few dimensions; poly_features = PolynomialFeatures(degree=i, include_bias=False) # Fit_transform=== fit () + transform (), where transform is used for normalization; X_poly = poly_features.fit_transform(X) print(X_poly) # 2-2. Processing first-order equations line_reg = LinearRegression() line_reg.fit(X_poly, Y) print(line_reg.coef_) print(line_reg.intercept_) y_predict = line_reg.predict(X_poly) plt.plot(X_poly[:, 0], y_predict, d[i]) plt.show()