Notes on in-depth study of zero foundation introduction - Lesson 1 (introduction and case of house price return)

Posted by jameslloyd on Thu, 10 Feb 2022 20:47:09 +0100

Zero foundation introduction depth learning lesson 1

The relationship between artificial intelligence, machine learning and deep learning

artificial intelligence

As the literal meaning, artificial intelligence is a new technical science to develop theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Because this definition only expounds the goal and does not limit the method, many methods and branches to realize artificial intelligence have turned it into a "hodgepodge" discipline.

machine learning

Machine learning is a special study of how computers simulate or realize human learning behavior, so as to obtain new knowledge or skills, reorganize the existing knowledge structure and continuously improve its own performance. Its implementation method can be divided into two steps: training and prediction, which is similar to our familiar induction and deduction.

The process of machine learning is basically the same as that of Newton's second law, which is divided into three stages: hypothesis, evaluation and Optimization:

Hypothesis: by observing the observed data of acceleration a and force F, it is assumed that a and F are linear, that is, a=w ⋅ F.
Evaluation: the fitting effect on the known observation data is good, that is, the result of w ⋅ F calculation should be as close as possible to the observed a.
Optimization: among all possible values of parameter W, it is found that w=1/m can make the evaluation the best (the best fitting observation sample).

The essence of machine learning is to find parameters and fit a large formula

Deep learning

Compared with traditional machine learning algorithms, the two are consistent in theoretical structure, that is, model assumptions, evaluation functions and optimization algorithms. The fundamental difference lies in the complexity of assumptions, as follows:

After inputting the beauty picture, the computer can only receive a digital matrix. For the advanced semantic concept of beauty, the complexity of information transformation from pixel to advanced semantic concept is unimaginable! This transformation can no longer be expressed by mathematical formula, so researchers use the structure of human brain neurons for reference to design the model of neural network.

Basic concepts of neural network

Artificial neural network includes multiple neural network layers, such as convolution layer, full connection layer, LSTM, etc. each layer includes many neurons. Nonlinear neural networks with more than three layers can be called deep neural networks. Generally speaking, the deep learning model can be regarded as the mapping function from input to output, such as the mapping from image to advanced semantics (beauty). A deep enough neural network can fit any complex function in theory.

Therefore, neural network is very suitable for learning the internal law and representation level of sample data, and has good applicability to text, image and voice tasks.

Neuron: each node in the neural network is called neuron and consists of two parts:
Weighted sum: weighted sum of all inputs.
Nonlinear transformation (activation function): the result of weighted sum is transformed by a nonlinear function, which makes the neuron calculation have nonlinear ability.

Multilayer connection: a large number of such nodes are arranged according to different levels to form a multilayer structure, which is called neural network.
Forward calculation: the process of calculating the output from the input, in order from the front of the network to the back.
Calculation diagram: it shows the calculation logic of neural network in a graphical way, which is also called calculation diagram. We can also express the calculation diagram of neural network in the form of formula, as follows:

Case: Boston house price

Boston house price forecasting is a classic machine learning task, and its importance is equivalent to the "hello world" in the program.

House prices in Boston are affected by many factors, including 13 factors that may affect house prices and the average price of this type of house, as follows:

The prediction problem can be divided into regression task and classification task according to whether the type of prediction output is continuous real value or discrete label. Because house price is a continuous value, house price prediction is obviously a regression task.

Linear regression prediction model

Linear regression model is the simplest regression model. It is assumed that the influence relationship between house price and various factors can be expressed by the following formula:

The solution of the model is to fit the weight slope Wj and offset term b through the data

In this case, we generally use the mean square error as the loss function (rationality and easiness):

In the standard structure of neural network, each neuron is composed of weighted sum and nonlinear transformation, and then multiple neurons are placed and connected to form a neural network.

The linear regression model can be regarded as a minimalist special case of the neural network model, which is a neuron with only weighted sum and no nonlinear transformation (and no network is required), as follows:

model building

Generally, there are five steps to build the model

data processing

Data processing includes five parts: data import, data shape transformation, data set division, data normalization and encapsulation of load data function. Only after data preprocessing can it be called by the model.

 #Import the required package s
import numpy as np
import json
# Read in training data
datafile = './work/housing.data'
data = np.fromfile(datafile, sep=' ')
data

The imported data is one-dimensional, so we transform it into 14 columns of data (13 X influencing factors and one Y output)

# Here, reshape the original data into the form of N x 14
feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 
                 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
feature_num = len(feature_names)
data = data.reshape([data.shape[0] // feature_num, feature_num])

# View data
x = data[0]
print(x.shape)
print(x)

The data set is divided into training set and test set. The training set is used to determine the parameters of the model, and the test set is used to evaluate the effect of the model.

ratio = 0.8
offset = int(data.shape[0] * ratio)
training_data = data[:offset]
training_data.shape

Data set normalization

# Calculate the maximum, minimum and average values of the train dataset
maximums, minimums, avgs = \
                     training_data.max(axis=0), \
                     training_data.min(axis=0), \
     training_data.sum(axis=0) / training_data.shape[0]
# Normalize the data
for i in range(feature_num):
    #print(maximums[i], minimums[i], avgs[i])
    data[:, i] = (data[:, i] - minimums[i]) / (maximums[i] - minimums[i])

Encapsulated as load data function

Encapsulate the above data processing operations into the load data function for the next model call. The implementation method is as follows.

def load_data():
    # Import data from file
    datafile = './work/housing.data'
    data = np.fromfile(datafile, sep=' ')

    # Each data includes 14 items, of which the first 13 items are influencing factors and the 14th item is the corresponding median house price
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', \
                      'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)

    # Reshape the original data into a shape like [N, 14]
    data = data.reshape([data.shape[0] // feature_num, feature_num])

    # Split the original data set into training set and test set
    # Here, 80% of the data are used for training and 20% for testing
    # There must be no intersection between training and testing
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]

    # Calculate the maximum, minimum and average of the training set
    maximums, minimums, avgs = training_data.max(axis=0), training_data.min(axis=0), \
                                 training_data.sum(axis=0) / training_data.shape[0]

    # Normalize the data
    for i in range(feature_num):
        #print(maximums[i], minimums[i], avgs[i])
        data[:, i] = (data[:, i] - minimums[i]) / (maximums[i] - minimums[i])

    # Division ratio of training set and test set
    training_data = data[:offset]
    test_data = data[offset:]
    return training_data, test_data

Data processing has come to an end

Note: the normalization of test set is also the maximum and minimum standard of training set (limited to a certain range)

verification:

# get data
training_data, test_data = load_data()
x = training_data[:, :-1]
y = training_data[:, -1:]

model design

Model design is one of the key elements of deep learning model, also known as network structure design, which is equivalent to the hypothesis space of the model, that is, the process of realizing the "forward calculation" (from input to output) of the model.

If both the input feature and the output prediction value are represented by vectors, and the input feature x has 13 components and y has 1 component, the shape of the parameter weight is 13 × 1.
Initialization is as follows:

w = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, -0.1, -0.2, -0.3, -0.4, 0.0]
w = np.array(w).reshape([13, 1])

For a complete linear regression formula, it is also necessary to initialize the offset bbb and assign the initial value of - 0.2 at will. Then, the complete output of the linear regression model is z=x*w+b. this process of calculating the output value from the characteristics and parameters is called "forward calculation"

The above process of calculating and predicting the output is described in the form of "class and object". The class member variables include parameters w and b. Write a forward function (representing "forward calculation") to complete the above calculation process from characteristics and parameters to output predicted value. The code is as follows.

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program,
        # A fixed random number seed is set here
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z

Based on the definition of Network class, the calculation process of the model is as follows.

net = Network(13)
x1 = x[0]
y1 = y[0]
z = net.forward(x1)
print(z)

Training configuration

After the model design is completed, it is necessary to find the optimal value of the model through training configuration, that is, to measure the quality of the model through the loss function. Training configuration is also one of the key elements of deep learning model.

The mean square error formula is as follows:

Mean square error is often used in regression problems and cross entropy loss function is often used in classification problems

When calculating the loss function, we need to take the loss function value of each sample into account, so we need to sum the loss function of a single sample and divide it by the total number of samples N

The code is defined as follows:

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program, a fixed random number seed is set here
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, z, y):
        error = z - y
        cost = error * error
        cost = np.mean(cost)
        return cost

The tests are as follows:

net = Network(13)
# Here, the predicted value and loss function of multiple samples can be calculated at one time
x1 = x[0:3]
y1 = y[0:3]
z = net.forward(x1)
print('predict: ', z)
loss = net.loss(z, y1)
print('loss:', loss)

Training process

The training process is one of the key elements of the deep learning model. Its goal is to make the defined Loss function Loss as small as possible, that is to find a parameter solution w and b to make the Loss function obtain a minimum.

As shown in the figure, the slope at the extreme point of the curve is 0, that is, the derivative of the function at the extreme point is 0. Then, w and b, which minimize the loss function, should be the solutions of the following equations:

However, it is not easy to solve the model with nonlinear variance. In order to solve this problem, we will introduce a more general numerical solution method: gradient descent method.

Gradient descent method

In reality, there are a large number of functions, which are easy to solve in the forward direction, but difficult to solve in the reverse direction. They are called one-way functions, which have a large number of applications in cryptography. The characteristic of cipher lock is that it can quickly judge whether a key is correct (it is easy to find y when x is known), but even if the cipher lock system is obtained, it can not crack what the correct key is (it is difficult to find x when y is known).

This situation is particularly similar to a blind man who wants to walk from the peak to the valley. He can't see where the valley is (he can't reverse solve the parameter value when the Loss derivative is 0), but he can stretch out his feet to explore the slope around him (the derivative value of the current point, also known as the gradient). Then, the minimum value of Loss function can be solved in this way: from the current parameter value, step by step in the downhill direction until it reaches the lowest point. This method is called "blind downhill method" or "gradient descent method".

Example: consider the case where there are only two parameters W5 and W9

net = Network(13)
losses = []
#Only the curve part of parameters w5 and w9 in the interval [- 160, 160] and the extreme value including the loss function are drawn
w5 = np.arange(-160.0, 160.0, 1.0)
w9 = np.arange(-160.0, 160.0, 1.0)
losses = np.zeros([len(w5), len(w9)])

#Calculate the Loss corresponding to each parameter value in the setting area
for i in range(len(w5)):
    for j in range(len(w9)):
        net.w[5] = w5[i]
        net.w[9] = w9[j]
        z = net.forward(x)
        loss = net.loss(z, y)
        losses[i, j] = loss

#Use matplotlib to plot the two variables and the corresponding Loss as a 3D graph
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)

w5, w9 = np.meshgrid(w5, w9)

ax.plot_surface(w5, w9, losses, rstride=1, cstride=1, cmap='rainbow')
plt.show()

Observe that the above curve shows a "smooth" slope, which is one of the reasons why we choose the mean square error as the loss function. Figure 6 shows the loss function curve of mean square error and absolute value error (only the error of each sample is accumulated without square processing) when there is only one parameter dimension.

The advantages of using mean square error are as follows:
(1) The lowest point of the curve is differentiable.
(2) The closer to the lowest point, the slope of the curve will gradually slow down, which is helpful to judge the degree of approaching the lowest point through the current gradient (whether to gradually reduce the step size to avoid missing the lowest point)

The scheme to realize the gradient descent method is as follows:
(1) Randomly select a set of initial values
(2) Select a point to make the error smaller
(3) Repeat step 2 until the loss function hardly drops

The basic knowledge of calculus tells us that along the opposite direction of the gradient, the function value decreases the fastest: the gradient direction of the function at a certain point is the direction with the largest slope of the curve, but the gradient direction is upward, so the opposite direction of the gradient decreases the fastest.

Calculated gradient

The loss function is defined as follows:

When considering a sample:

View each data scale

x1 = x[0]
y1 = y[0]
z1 = net.forward(x1)
print('x1 {}, shape {}'.format(x1, x1.shape))
print('y1 {}, shape {}'.format(y1, y1.shape))
print('z1 {}, shape {}'.format(z1, z1.shape))

#Calculate the gradient of w0 according to the above formula
gradient_w0 = (z1 - y1) * x1[0]
print('gradient_w0 {}'.format(gradient_w0))

#Calculate the gradient of w1
gradient_w1 = (z1 - y1) * x1[1]
print('gradient_w1 {}'.format(gradient_w1))
#and so on

Gradient calculation using Numpy

Based on Numpy broadcast mechanism (the calculation of vector and matrix is the same as that of a single variable), gradient calculation can be realized more quickly,
The result is a 13 dimensional vector, and each component represents the gradient of that dimension.

gradient_w = (z1 - y1) * x1
print('gradient_w_by_sample1 {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))

Similarly, calculate the gradient of multiple samples

x3 = x[2]
y3 = y[2]
z3 = net.forward(x3)
gradient_w = (z3 - y3) * x3
print('gradient_w_by_sample3 {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))

# Note that this is the data of three samples at a time, not the third sample
x3samples = x[0:3]
y3samples = y[0:3]
z3samples = net.forward(x3samples)

print('x {}, shape {}'.format(x3samples, x3samples.shape))
print('y {}, shape {}'.format(y3samples, y3samples.shape))
print('z {}, shape {}'.format(z3samples, z3samples.shape))

For the case with N samples, we can directly calculate the contribution of all samples to the gradient in the following way, which is the convenience brought by using the broadcast function of Numpy library.

z = net.forward(x)
gradient_w = (z - y) * x
print('gradient_w shape {}'.format(gradient_w.shape))
print(gradient_w)

Gradient above_ Each row of W represents the contribution of a sample to the gradient. According to the gradient calculation formula, the total gradient is the average value of each sample's contribution to the gradient.

# axis = 0 means adding each row and then dividing by the total number of rows
gradient_w = np.mean(gradient_w, axis=0)
print('gradient_w ', gradient_w.shape)
print('w ', net.w.shape)
print(gradient_w)
print(net.w)

np. The 0-th dimension is eliminated in the mean function. Therefore, in order to ensure the consistency of dimensions, we use the following code:

gradient_w = gradient_w[:, np.newaxis]
print('gradient_w shape', gradient_w.shape)

Calculate gradient total code

z = net.forward(x)
gradient_w = (z - y) * x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_w


gradient_b = (z - y)
gradient_b = np.mean(gradient_b)
# Here b is a numerical value, so you can directly use NP Mean gets a scalar
gradient_b

Write the above process of calculating the gradient of w and b into the gradient function of Network class. The implementation method is as follows.

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program, a fixed random number seed is set here
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, z, y):
        error = z - y
        num_samples = error.shape[0]
        cost = error * error
        cost = np.sum(cost) / num_samples
        return cost
    
    def gradient(self, x, y):
        z = self.forward(x)
        gradient_w = (z-y)*x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (z - y)
        gradient_b = np.mean(gradient_b)
        
        return gradient_w, gradient_b

verification:

# Call the gradient function defined above to calculate the gradient
# Initialize network
net = Network(13)
# Set [w5, w9] = [-100., -100.]
net.w[5] = -100.0
net.w[9] = -100.0

z = net.forward(x)
loss = net.loss(z, y)
gradient_w, gradient_b = net.gradient(x, y)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))

Determine the point where the loss function is smaller

Next, we begin to study the method of updating the gradient. First, move a small step along the opposite direction of the gradient, find the next point P1 and observe the change of the loss function.

# On the [w5, w9] plane, move to the next point P1 along the opposite direction of the gradient
# Define moving step eta
eta = 0.1
# Update parameters w5 and w9
net.w[5] = net.w[5] - eta * gradient_w5
net.w[9] = net.w[9] - eta * gradient_w9
# Recalculate z and loss
z = net.forward(x)
loss = net.loss(z, y)
gradient_w, gradient_b = net.gradient(x, y)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))

eta: controls the magnitude of each parameter value change along the opposite direction of the gradient, that is, the step size of each movement, also known as the learning rate.
You can think about why we should normalize the input features to keep the scale consistent before? This is to make the uniform step more appropriate.

Encapsulate train function

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program, a fixed random number seed is set here
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights,1)
        self.w[5] = -100.
        self.w[9] = -100.
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, z, y):
        error = z - y
        num_samples = error.shape[0]
        cost = error * error
        cost = np.sum(cost) / num_samples
        return cost
    
    def gradient(self, x, y):
        z = self.forward(x)
        gradient_w = (z-y)*x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (z - y)
        gradient_b = np.mean(gradient_b)        
        return gradient_w, gradient_b
    
    def update(self, gradient_w5, gradient_w9, eta=0.01):
        net.w[5] = net.w[5] - eta * gradient_w5
        net.w[9] = net.w[9] - eta * gradient_w9
        
    def train(self, x, y, iterations=100, eta=0.01):
        points = []
        losses = []
        for i in range(iterations):
            points.append([net.w[5][0], net.w[9][0]])
            z = self.forward(x)
            L = self.loss(z, y)
            gradient_w, gradient_b = self.gradient(x, y)
            gradient_w5 = gradient_w[5][0]
            gradient_w9 = gradient_w[9][0]
            self.update(gradient_w5, gradient_w9, eta)
            losses.append(L)
            if i % 50 == 0:
                print('iter {}, point {}, loss {}'.format(i, [net.w[5][0], net.w[9][0]], L))
        return points, losses

# get data
train_data, test_data = load_data()
x = train_data[:, :-1]
y = train_data[:, -1:]
# Create network
net = Network(13)
num_iterations=2000
# Start training
points, losses = net.train(x, y, iterations=num_iterations, eta=0.01)

# Draw the change trend of the loss function
plot_x = np.arange(num_iterations)
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()

Training extended to all parameters

The gradient descent process demonstrated by the plane only includes w5 and w9 parameters, but the complete model of house price prediction must solve all parameters w and b

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program, a fixed random number seed is set here
        np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, z, y):
        error = z - y
        num_samples = error.shape[0]
        cost = error * error
        cost = np.sum(cost) / num_samples
        return cost
    
    def gradient(self, x, y):
        z = self.forward(x)
        gradient_w = (z-y)*x
        gradient_w = np.mean(gradient_w, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = (z - y)
        gradient_b = np.mean(gradient_b)        
        return gradient_w, gradient_b
    
    def update(self, gradient_w, gradient_b, eta = 0.01):
        self.w = self.w - eta * gradient_w
        self.b = self.b - eta * gradient_b
        
    def train(self, x, y, iterations=100, eta=0.01):
        losses = []
        for i in range(iterations):
            z = self.forward(x)
            L = self.loss(z, y)
            gradient_w, gradient_b = self.gradient(x, y)
            self.update(gradient_w, gradient_b, eta)
            losses.append(L)
            if (i+1) % 10 == 0:
                print('iter {}, loss {}'.format(i, L))
        return losses

# get data
train_data, test_data = load_data()
x = train_data[:, :-1]
y = train_data[:, -1:]
# Create network
net = Network(13)
num_iterations=1000
# Start training
losses = net.train(x,y, iterations=num_iterations, eta=0.01)

# Draw the change trend of the loss function
plot_x = np.arange(num_iterations)
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()

Random gradient descent method

In the above program, each loss function and gradient calculation is based on the full amount of data in the data set. For the Boston house price forecasting task data set, the number of samples is relatively small, only 404. However, in practical problems, the data set is often very large. If the full amount of data is used for calculation every time, the efficiency is very low. Generally speaking, it is "how to kill a chicken with an ox knife".

Since the parameters are updated only a little bit in the opposite direction of the gradient at a time, the direction does not need to be so accurate. A reasonable solution is to randomly extract a small part of data from the total data set each time to represent the whole, and calculate the gradient and loss based on this part of data to update the parameters. This method is called Stochastic Gradient Descent (SGD). The core concepts are as follows:

Mini batch: a batch of data extracted during each iteration is called a mini batch.
batch_size: the number of samples contained in a mini batch is called batch_size.
Epoch: when the program iterates, the samples are gradually extracted by pressing Mini batch. When the entire data set is traversed, a round of training is completed, also known as epoch. When you start training, you can set the number of rounds of training to num_epochs and batch_size is passed in as a parameter.

# get data
train_data, test_data = load_data()
train_data.shape

train_data contains a total of 404 pieces of data. If batch_size=10, i.e. take the first 0-9 samples as the first mini batch and name it train_data1.

train_data1 = train_data[0:10]
train_data1.shape

Using train_ The data of data1 (samples 0-9) calculates the gradient and updates the network parameters.

net = Network(13)
x = train_data1[:, :-1]
y = train_data1[:, -1:]
loss = net.train(x, y, iterations=1, eta=0.01)
loss

According to this method, continuously take out new mini batch and gradually update network parameters.

Next, set train_data is divided into batch in size_ Multiple mini of size_ Batch, as shown in the following code:_ Data is divided into 404 / 10 + 1 = 41 mini_batch, including the first 40 mini_batch, each containing 10 samples, the last mini_batch contains only 4 samples.

batch_size = 10
n = len(train_data)
mini_batches = [train_data[k:k+batch_size] for k in range(0, n, batch_size)]
print('total number of mini_batches is ', len(mini_batches))
print('first mini_batch shape ', mini_batches[0].shape)
print('last mini_batch shape ', mini_batches[-1].shape)

Through a large number of experiments, it is found that the model is more impressed by the last data. After the training data is imported, the closer to the end of model training, the greater the impact of the last few batches of data on model parameters. In order to avoid the influence of model memory on the training effect, it is necessary to operate the samples in disorder.

# Create a new array
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
a = a.reshape([6, 2])
print('before shuffle\n', a)
np.random.shuffle(a)
print('after shuffle\n', a)

Observing the running results, it can be found that the elements of the array are randomly disrupted in the 0th dimension, but the order of the 1st dimension remains unchanged. For example, the number 2 is still next to the number 1, the number 8 is still next to the number 7, and the two-dimensional [3,4] is not next to [1,2]. Integrate this part of the code to realize SGD algorithm into the train function in the Network class. The final complete code is as follows.

# get data
train_data, test_data = load_data()

# Disrupt sample order
np.random.shuffle(train_data)

# train_data is divided into multiple mini_batch
batch_size = 10
n = len(train_data)
mini_batches = [train_data[k:k+batch_size] for k in range(0, n, batch_size)]

# Create network
net = Network(13)

# Use each mini in turn_ Batch data
for mini_batch in mini_batches:
    x = mini_batch[:, :-1]
    y = mini_batch[:, -1:]
    loss = net.train(x, y, iterations=1)

Key points of modification:

Input each randomly selected Mini batch data into the model for parameter training. The core of the training process is two-level cycle:

The first layer of loop represents that the sample set needs to be trained and traversed several times, which is called "epoch", and the code is as follows:
for epoch_id in range(num_epochs):

The second layer of loop represents the multiple batches of the sample set split into during each traversal, which need to be trained completely. It is called "iter (iteration)", and the code is as follows:
for iter_id,mini_batch in emumerate(mini_batches):

Inside the two-layer loop is a classic four step training process: forward calculation - > calculate loss - > calculate gradient - > update parameters
The final code is as follows:

import numpy as np

class Network(object):
    def __init__(self, num_of_weights):
        # Initial value of randomly generated w
        # In order to maintain the consistency of the results of each run of the program, a fixed random number seed is set here
        #np.random.seed(0)
        self.w = np.random.randn(num_of_weights, 1)
        self.b = 0.
        
    def forward(self, x):
        z = np.dot(x, self.w) + self.b
        return z
    
    def loss(self, z, y):
        error = z - y
        num_samples = error.shape[0]
        cost = error * error
        cost = np.sum(cost) / num_samples
        return cost
    
    def gradient(self, x, y):
        z = self.forward(x)
        N = x.shape[0]
        gradient_w = 1. / N * np.sum((z-y) * x, axis=0)
        gradient_w = gradient_w[:, np.newaxis]
        gradient_b = 1. / N * np.sum(z-y)
        return gradient_w, gradient_b
    
    def update(self, gradient_w, gradient_b, eta = 0.01):
        self.w = self.w - eta * gradient_w
        self.b = self.b - eta * gradient_b
            
                
    def train(self, training_data, num_epochs, batch_size=10, eta=0.01):
        n = len(training_data)
        losses = []
        for epoch_id in range(num_epochs):
            # Shuffle the data before the start of each iteration
            # Each time, press batch and then take_ Size data is retrieved in the form of
            np.random.shuffle(training_data)
            # Split the training data to each mini_batch contains batch_size bar data
            mini_batches = [training_data[k:k+batch_size] for k in range(0, n, batch_size)]
            for iter_id, mini_batch in enumerate(mini_batches):
                #print(self.w.shape)
                #print(self.b)
                x = mini_batch[:, :-1]
                y = mini_batch[:, -1:]
                a = self.forward(x)
                loss = self.loss(a, y)
                gradient_w, gradient_b = self.gradient(x, y)
                self.update(gradient_w, gradient_b, eta)
                losses.append(loss)
                print('Epoch {:3d} / iter {:3d}, loss = {:.4f}'.
                                 format(epoch_id, iter_id, loss))
        
        return losses

# get data
train_data, test_data = load_data()

# Create network
net = Network(13)
# Start training
losses = net.train(train_data, num_epochs=50, batch_size=100, eta=0.1)

# Draw the change trend of the loss function
plot_x = np.arange(len(losses))
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()

High level API rewriting code of flying slurry

#Load the propeller, Numpy and related class libraries
import paddle
from paddle.nn import Linear
import paddle.nn.functional as F
import numpy as np
import os
import random

Paddle: the main library of the propeller. The aliases of common APIs are reserved in the root directory of paddle, which currently includes: paddle tensor,paddle. All APIs in the framework directory.

paddle.nn: networking related API s, such as Linear, convolution Conv2D, cyclic neural network LSTM, loss function CrossEntropyLoss, activation function ReLU, etc.

Linear: the full connection layer function of neural network, that is, the basic neuron structure containing the addition of all input weights. In the task of house price prediction, only one layer neural network (full connection layer) is used to realize the linear regression model.

paddle.nn.functional: and padding NN, including networking related API s, such as Linear, activation function ReLu, etc. The functions of the modules with the same name under the two are the same, and the operation performance is basically the same. But, paddle The modules under NN are all classes, and each class can have its own module parameters; paddle.nn. All modules under functional are functions, which need to be manually passed into the module to calculate the required parameters. In practical use, convolution, full connection layer and other layers have learnable parameters, and it is recommended to use padding NN module, but the activation function, pooling and other operations have no learnable parameters, so we can consider using the pad directly nn. Function under functional.

data processing

Independent of framework implementation

def load_data():
    # Import data from file
    datafile = './work/housing.data'
    data = np.fromfile(datafile, sep=' ', dtype=np.float32)

    # Each data includes 14 items, of which the first 13 items are influencing factors and the 14th item is the corresponding median house price
    feature_names = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', \
                      'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV' ]
    feature_num = len(feature_names)

    # Reshape the original data into a shape like [N, 14]
    data = data.reshape([data.shape[0] // feature_num, feature_num])

    # Split the original data set into training set and test set
    # Here, 80% of the data are used for training and 20% for testing
    # There must be no intersection between training and testing
    ratio = 0.8
    offset = int(data.shape[0] * ratio)
    training_data = data[:offset]

    # Calculate the maximum, minimum and average values of the train dataset
    maximums, minimums, avgs = training_data.max(axis=0), training_data.min(axis=0), \
                                 training_data.sum(axis=0) / training_data.shape[0]
    
    # Record the normalization parameters of the data and normalize the data during prediction
    global max_values
    global min_values
    global avg_values
    max_values = maximums
    min_values = minimums
    avg_values = avgs

    # Normalize the data
    for i in range(feature_num):
        data[:, i] = (data[:, i] - avgs[i]) / (maximums[i] - minimums[i])

    # Division ratio of training set and test set
    training_data = data[:offset]
    test_data = data[offset:]
    return training_data, test_data

model design

The essence of model definition is to define the network structure of linear regression. It is suggested that the definition of model network be completed by creating Python class, which needs to inherit the paddle nn. Layer parent class, and define init function and forward function in the class. The forward function is a function specified by the framework to implement the forward calculation logic. The program will automatically execute the forward method when calling the model instance. The network layer used in the forward function needs to be declared in the init function.

The implementation process is divided into the following two steps:

Define init function: declare the implementation function of each layer of network in the initialization function of the class.
Define the forward function: construct the neural network structure, realize the forward calculation process, and return the prediction results. In this task, the house price prediction results are returned.

class Regressor(paddle.nn.Layer):

    # self represents the instance of the class itself
    def __init__(self):
        # Initialize some parameters in the parent class
        super(Regressor, self).__init__()
        
        # Define a full connection layer. The input dimension is 13 and the output dimension is 1
        self.fc = Linear(in_features=13, out_features=1)
    
    # Forward computing of network
    def forward(self, inputs):
        x = self.fc(inputs)
        return x

Training configuration

1. Declare the defined Regressor instance of the regression model and set the state of the model to training.
2. Use load_ The data function loads training data and test data.
3. Set the optimization algorithm and learning rate. The optimization algorithm adopts random gradient descent SGD, and the learning rate is set to 0.01.
The training configuration code is as follows:

# Declare a well-defined linear regression model
model = Regressor()
# Open model training mode
model.train()
# Load data
training_data, test_data = load_data()
# Define the optimization algorithm and use random gradient descent SGD
# The learning rate is set to 0.01
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())

In the case of implementing the neural network model based on Python, we have written a lot of code to realize the gradient descent, while using the propeller framework, we can realize the optimizer setting only by defining SGD, which greatly simplifies the process.

Training process

The training process adopts two-level circular nesting:

Inner loop: it is responsible for one-time traversal of the whole data set, using batch. Assuming that the number of data set samples is 1000 and there are 10 samples in a batch, the number of batches traversing the data set once is 1000 / 10 = 100, that is, the inner loop needs to be executed 100 times.

for iter_id, mini_batch in enumerate(mini_batches):

Outer loop: defines the number of times to traverse the data set through the parameter EPOCH_NUM settings.

for epoch_id in range(EPOCH_NUM):

The value of batch will affect the training effect of the model. If the batch is too large, the memory consumption and calculation time will be increased, and the training effect will not be significantly improved (because each parameter only moves a small step in the opposite direction of the gradient, the direction is not necessary to be particularly accurate); If the batch is too small, the sample data of each batch will have no statistical significance, and the calculated gradient direction may deviate greatly. Since the training data set of house price prediction model is small, we set batch to 10.

EPOCH_NUM = 10   # Sets the number of outer cycles
BATCH_SIZE = 10  # Set batch size

# Define outer loop
for epoch_id in range(EPOCH_NUM):
    # Before each iteration, the sequence of training data is randomly disrupted
    np.random.shuffle(training_data)
    # Split the training data, and each batch contains 10 pieces of data
    mini_batches = [training_data[k:k+BATCH_SIZE] for k in range(0, len(training_data), BATCH_SIZE)]
    # Define inner loop
    for iter_id, mini_batch in enumerate(mini_batches):
        x = np.array(mini_batch[:, :-1]) # Get training data of current batch
        y = np.array(mini_batch[:, -1:]) # Get the training label of the current batch (real house price)
        # Convert numpy data into the form of propeller dynamic diagram tensor
        house_features = paddle.to_tensor(x)
        prices = paddle.to_tensor(y)
        
        # Forward calculation
        predicts = model(house_features)
        
        # Calculate loss
        loss = F.square_error_cost(predicts, label=prices)
        avg_loss = paddle.mean(loss)
        if iter_id%20==0:
            print("epoch: {}, iter: {}, loss is: {}".format(epoch_id, iter_id, avg_loss.numpy()))
        
        # Back propagation
        avg_loss.backward()
        # Minimize loss and update parameters
        opt.step()
        # Clear gradient
        opt.clear_grad()

Save model

# Save the model parameters with the file name LR_model.pdparams
paddle.save(model.state_dict(), 'LR_model.pdparams')
print("The model is saved successfully, and the model parameters are saved in LR_model.pdparams in")

test model

Through load_ one_ The example function extracts a sample from the data set as a test sample. The specific implementation code is as follows.

def load_one_example():
    # Randomly select one from the loaded test set above as the test data
    idx = np.random.randint(0, test_data.shape[0])
    idx = -10
    one_data, label = test_data[idx, :-1], test_data[idx, -1]
    # Modify the data shape to [1,13]
    one_data =  one_data.reshape([1,-1])

    return one_data, label

# The parameter is the file address where the model parameters are saved
model_dict = paddle.load('LR_model.pdparams')
model.load_dict(model_dict)
model.eval()

# The parameter is the file address of the dataset
one_data, label = load_one_example()
# Convert data to variable format of dynamic graph 
one_data = paddle.to_tensor(one_data)
predict = model(one_data)

# The results are inverse normalized
predict = predict * (max_values[-1] - min_values[-1]) + avg_values[-1]
# Inverse normalization of label data
label = label * (max_values[-1] - min_values[-1]) + avg_values[-1]

print("Inference result is {}, the corresponding label is {}".format(predict.numpy(), label))

Programmer Think

Notes on in-depth study of zero foundation introduction - Lesson 1 (introduction and case of house price return)

Zero foundation introduction depth learning lesson 1

The relationship between artificial intelligence, machine learning and deep learning

artificial intelligence

machine learning

Deep learning

Basic concepts of neural network

Case: Boston house price

Linear regression prediction model

model building

data processing

Encapsulated as load data function

model design

Training configuration

Training process

Gradient descent method

Calculated gradient

Gradient calculation using Numpy

Calculate gradient total code

Determine the point where the loss function is smaller

Encapsulate train function

Training extended to all parameters

Random gradient descent method

High level API rewriting code of flying slurry

data processing

model design

Training configuration

Training process

Save model

test model

Hot Topics