Machine Learning-Multivariable Linear Regression

Posted by mispris006 on Thu, 16 May 2019 02:37:45 +0200

1 Multivariable Linear Regression Application Scenario

So far, we have discussed the single variable/feature regression model. Now we add more features to the housing price model, such as the number of rooms and floors, to form a multi-variable model.

1.1 Univariate Linear Regression Case

Model: h theta(x) = theta 0+theta 1x

1.2 Multivariate Linear Regression Case

Model:

New concept

For example:

         x(1) = [40, 1, 1, 10]
        x(2) = [96, 2, 1, 5]    
        x(3) = [135, 3, 2, 20]

For example:

x(1)1 = 40
x(1)2 = 1
.......

2 multivariate gradient descent method

Model:

Parameters:

Loss function:

Gradient descent formula (repeated execution):

2.1 Univariate gradient descent n=1, repeat until convergence

2.2 Multivariate gradient descent n>1

2.3 Multivariate Batch Gradient Decrease Code

import numpy as np

# 1. Analog data
X1 = 2 * np.random.randn(100, 1)
X2 = 4 * np.random.rand(100, 1)
X3 = 6 * np.random.rand(100, 1)
y = 4 + 3 * X1 + 4 * X2 + 5 * X3 + np.random.randn(100, 1)

#  2. Implementing gradient descent algorithm
#  np.c_is the combination of data into vector formats: (n, 1) (n,1) = (n, 2)
X_b = np.c_[np.ones((100, 1)), X1, X2, X3]
# To initialize the value of theta, four values of theta need to be calculated.
theta = np.random.randn(4, 1)
# Setting learning rate and convergence times
learning_rate = 0.1
n_iterations = 1000

# Calculate according to formula
for iteration in range(n_iterations):
    # Gradient descent formula = 1/sample number* (predicted value-true value)*Xi
    gradients = 1 / 100 * X_b.T.dot(X_b.dot(theta) - y)
    # theta = theta - learning rate * gradient value
    theta = theta - learning_rate * gradients

print(theta)

Code execution results:

3 Gradient Decrease Method Practice 1: Feature Scaling

3.1 Problems encountered by gradient descent method

When we face multi-dimensional feature problems, we need to ensure that these features have similar scales, which will help the gradient descent algorithm converge faster. The feature scaling is to ensure that the feature is on an order of magnitude.

Taking the housing price problem as an example, suppose we use two features, the size of the house and the number of rooms, x1 = the area of the house (0-400 m2) and x2 = the number of bedrooms (1-5). Using two parameters as abscissal and longitudinal coordinates, we can draw the contour map of the cost function. We can see that the image will appear very flat, and the gradient descent algorithm needs many iterations to converge.

3.2 Solutions

Solution 1:
- Trying to scale all features down to - 1 to 1 as much as possible. For example:

x1 = housing area / 400
x2 = Number of bedrooms / 5

Solution 2: Square Mean Method
The feature xi is replaced by xi-Mu on the original basis.

The maximum can also be replaced by the standard deviation, or the maximum - minimum.

4 Gradient Decline Method Practice II: Learning Rate

4.1 Problems encountered in gradient descent method

The number of iterations required for the convergence of gradient descent algorithm varies according to the model. We can not predict in advance. We can plot the number of iterations and cost function to observe when the algorithm converges.

Each iteration of gradient descent algorithm is affected by learning rate.

If the learning rate is too small, the number of iterations needed to achieve convergence will be very high.
If the learning rate is too high, each iteration may not reduce the cost function, and may cross the local minimum and lead to non-convergence.

4.2 Solutions

Automatically test for convergence, such as comparing the change value of the cost function with a threshold value (e.g. 0.001), but it is usually better to look at the chart above.

Trying to select alpha from the following values:... 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1,...

Supplementary 5-gradient descent algorithm

5.1 Summary of Three Gradient Decline

How to choose?

Less training set: using batch gradient descent (less than 2000)
The training set is relatively large: Mini-batch size is 64, 128, 256, 512, 1024 with Mini-bitch gradient descent. Mini-batch size is suitable for CPU/GPU memory.

5.2 Random Gradient Decline

The idea of random gradient descent: divide m samples into m parts, and use one part for gradient descent each time; that is to say, when there are m samples, batch gradient descent can only do gradient descent once, but random gradient descent can do m times.

Implementation code

import numpy as np
import random
X = 2 * np.random.rand(100, 1)
Y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]
# m samples were processed by epochs in each round.
n_epochs = 1000
# learning rate
a0 = 0.1
# Define attenuation rate
decay_rate = 1

def learning_schedule(epoch_num):
    """
    //Define a function of learning rate decay
    """
    return (1.0 / (decay_rate * epoch_num + 1)) * a0


# Initialize the TA value
theta = np.random.randn(2, 1)

# Initialize random values
num = [i for i in range(100)]
m = 100

for epoch in range(n_epochs):
    rand = random.sample(num, 100)
    for i in range(m):
        random_index = rand[i]
        xi = X_b[random_index:random_index + 1]
        yi = Y[random_index:random_index + 1]
        # Random gradient descent
        gradients = xi.T.dot(xi.dot(theta) - yi)
        # learning rate
        learning_rate = learning_schedule(epoch+1)
        theta = theta - learning_rate * gradients

print(theta)

Execution results show:

5.3 Mini-batch gradient algorithm

Random gradient descent will lose the acceleration brought by vector, so we will not use random gradient descent too much.

Implementation code

import numpy as np
import random

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]
# print(X_b)

n_epochs = 500
a = 0.03
m = 100
num = [i for i in range(100)]



theta = np.random.randn(2, 1)
batch_num = 5
batch_size = m // 5

# epoch means round, which means doing an iteration with m samples
for epoch in range(n_epochs):
    # Generate 100 unrepeated random numbers
    for i in range(batch_num):
        start = i*batch_size
        end = (i+1)*batch_size
        xi = X_b[start:end]
        yi = y[start:end]
        gradients = 1/batch_size * xi.T.dot(xi.dot(theta)-yi)
        print(a)
        learning_rate = a
        theta = theta - learning_rate * gradients

print(theta)

Execution results show:

5.4 Mini-batch gradient algorithm optimization: learning rate decay

In Mini-batch, because of the noise, the training results may not converge, but sway around the lowest point. If we want to solve this problem, we need to reduce the learning rate and let him sway in the smallest possible range.
1 epoch = 1 traversal of all data

Learning rate attenuation formula:

Implementation code

import numpy as np
import random

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X]
# print(X_b)

n_epochs = 500
t0, t1 = 5, 50

m = 100
num = [i for i in range(100)]

def learning_schedule(t):
    return float(t0) / (t + t1)

theta = np.random.randn(2, 1)

batch_num = 5
batch_size = m // 5

# epoch means round, which means doing an iteration with m samples
for epoch in range(n_epochs):
    # Generate 100 unrepeated random numbers
    for i in range(batch_num):
        start = i*batch_size
        end = (i+1)*batch_size
        xi = X_b[start:end]
        yi = y[start:end]
        gradients = 1/batch_size * xi.T.dot(xi.dot(theta)-yi)
        learning_rate = learning_schedule(epoch*m + i)
        theta = theta - learning_rate * gradients

print(theta)

Execution results show:

6-Characteristic and Polynomial Regression

6.1 Overfitting

The problem of over-fitting occurs when there are too many variables (theta). At this time, we do not have more data to fit the model, although the value of loss function is close to zero.

6.2 Overfitting Solutions:

Reduce the number of features (usually not used)
1) Manual selection of feature number
2) Model selection
Regularization (feature scaling)
Keep all features, but reduce the magnitude or parameter theta_j

6.2 Feature Scaling

When predicting house prices, let's assume that we don't know the area of the house, but we know the length and width of the house.

Model design:
h theta(x) = theta 0+theta 1 x house length + theta 2 x house width
Characteristic Unscaled Graphic Display

Characteristic Zoom Graphics Display

Note: If we use polynomial regression model, feature scaling is necessary before running gradient descent algorithm.

6.3 Regularization

How do you not want theta 3 and theta 4?

First, we can add terms about theta 3 and theta 4 to the loss function, forcing theta 3 and theta 4 to be as small as possible if the loss function is to be minimized.

Then regularization, the formula is as follows:

The Difference between 6.4 L1 Regular and L2 Regular

L1 tends to reduce eigenvalues
L2 tends to retain eigenvalues

7 Regularization Algorithms and Code Implementation

7.1 Ridge regression

7.1.1 Algorithmic Understanding

7.1.2 Implementation Formula

7.1.3 Code Implementation

Two methods of ridge regression:


"""
//ridge regression
//Method 1: Ridge regression uses L2 regularization
"""
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.linear_model import SGDRegressor


X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Alpha is the alpha in the penalty term, solver is the method of processing data, auto is automatically selected according to data, svd is the analytical solution, sag is the random gradient descent.
ridge_reg = Ridge(alpha=1, solver='auto')
# learning process
ridge_reg.fit(X, y)
# Forecast
print(ridge_reg.predict([[1.5], [2], [2.5]]))
# Print intercept
print(ridge_reg.intercept_)
# Printing coefficient
print(ridge_reg.coef_)

"""
//Method 2: Ridge regression and SGD & penalty = 2 are equivalent
"""
sgd_reg = SGDRegressor(penalty='l2')
sgd_reg.fit(X, y.ravel())
print(sgd_reg.predict([[1.5], [2], [2.5]]))
# Print intercept
print("W0=", sgd_reg.intercept_)
# Printing coefficient
print("W1=", sgd_reg.coef_)

7.2 Lasso regression

7.2.1 Algorithmic Understanding

7.2.2 Realization Formula

7.2.3 Code Implementation


"""
Lasso regression

Lasso Used is l1 Regularization
"""
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.linear_model import SGDRegressor

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

lasso_reg = Lasso(alpha=0.15)
lasso_reg.fit(X, y)
print(lasso_reg.predict([[1.5]]))
print(lasso_reg.coef_)

sgd_reg = SGDRegressor(penalty='l1', n_iter=1000)
sgd_reg.fit(X, y.ravel())
print(sgd_reg.predict([[1.5]]))
print(sgd_reg.coef_)

7.3 Elastic Net Regression

7.3.1 Algorithmic Understanding

7.3.2 Implementation Formula

7.3.3 Code Implementation


import numpy as np
from sklearn.linear_model import ElasticNet

X = 2 * np.random.rand(100, 1)
Y = 4 + 3 * X + np.random.randn(100, 1)

elastic_reg = ElasticNet(alpha=0.15, l1_ratio=0.5)

elastic_reg.fit(X, Y)
print(elastic_reg.predict([[1.5]]))
print(elastic_reg.coef_)
print(elastic_reg.intercept_)


from sklearn.linear_model import SGDRegressor
elastic_reg = SGDRegressor(penalty='elasticnet')
elastic_reg.fit(X, Y)
print(elastic_reg.predict([[1.5]]))
print(elastic_reg.coef_)

8-normal equation and gradient descent comparison

Gradient descent:

Need to choose the right alpha
Multiple iterations are required
When n is large, the effect is good.

Normal equation:

No need to choose learning rate a
No iteration is required
Need to compute the inversion of X's transformation multiplied by X's whole
When n is large, the calculation is slow.

Summary: According to experience, when the number of features reaches 10000, it is better to change to gradient descent.

Gradient descent code for 8.1 polynomial regression




import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# 1) Data preparation;
# Sample size
m = 100
X = 6 * np.random.randn(m, 1) - 3
Y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1)

# 2).
# 2-1. Convert a higher-order equation into a first-order equation; (Multivariate linear regression)
# degree: Processing data in a few dimensions;
poly_features = PolynomialFeatures(degree=2, include_bias=False)
# Fit_transform=== fit () + transform (), where transform is used for normalization;
X_poly = poly_features.fit_transform(X, Y)

# 2-2. Processing first-order equations
line_reg = LinearRegression()
line_reg.fit(X_poly, Y)

print(line_reg.coef_)
print(line_reg.intercept_)

8.2 Graphics drawn from different dimensions

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# 1) Data preparation;
# Sample size
m = 100
X = 6 * np.random.randn(m, 1) - 3
Y = 7 * X ** 2 + 5 *X + 2 + np.random.randn(m, 1)

# plt.plot(X, Y, 'b.')
# plt.show()


# Setting the font display of image dimension and line
d = {1: 'g-', 2: 'r.', 10: 'y*'}
# d = {2: 'g-'}


for i in d:
    # 2).
    # 2-1. Convert a higher-order equation into a first-order equation; (Multivariate linear regression)
    # degree: Processing data in a few dimensions;
    poly_features = PolynomialFeatures(degree=i, include_bias=False)
    # Fit_transform=== fit () + transform (), where transform is used for normalization;
    X_poly = poly_features.fit_transform(X)
    print(X_poly)

    # 2-2. Processing first-order equations
    line_reg = LinearRegression()
    line_reg.fit(X_poly, Y)

    print(line_reg.coef_)
    print(line_reg.intercept_)

    y_predict = line_reg.predict(X_poly)
    plt.plot(X_poly[:, 0], y_predict, d[i])


plt.show()

Topics: Python less

Programmer Think