Neural network and back propagation algorithm

Posted by grevathi_02 on Mon, 20 Dec 2021 20:36:10 +0100

neuron

Neuron and perceptron are essentially the same, but when we say perceptron, its activation function is a step function; When we talk about neurons, the activation function is often sigmoid function or tanh function. As shown in the figure below:

The method of calculating the output of a neuron is the same as that of a perceptron. Suppose the input of a neuron is a vector, the weight vector is(the offset term is ), if the activation function is a sigmoid function, its output:

The sigmoid function is defined as follows:

Take it into the previous equation and get

sigmoid function is a nonlinear function with a range of (0,1). The function image is shown in the figure below

The derivative of sigmoid function is:

As you can see, the derivative of sigmoid function is very interesting. It can be expressed by sigmoid function itself. In this way, once the value of sigmoid function is calculated, it is very convenient to calculate the value of its derivative.

What is neural network

Neural network is actually a number of neurons connected according to certain rules. The figure above shows a fully connected (FC) neural network. By observing the figure above, we can find that its rules include:

Neurons are arranged in layers. The leftmost layer is called the input layer, which is responsible for receiving input data; The rightmost layer is called the output layer, from which we can obtain the output data of neural network. The layers between the input layer and the output layer are called hidden layers because they are invisible to the outside.
There is no connection between neurons in the same layer.
Each neuron in layer N is connected to all neurons in layer N-1 (this is the meaning of full connected), and the output of neurons in layer N-1 is the input of neurons in layer n.
Each connection has a weight.

These rules define the structure of fully connected neural networks. In fact, there are many other neural networks, such as convolutional neural network (CNN) and cyclic neural network (RNN), which have different connection rules.

Calculate the output of neural network

Neural network is actually an input vectorTo output vectorFunctions of, i.e.:

To calculate the output of the neural network according to the input, we need to first convert the input vectorEach element ofThe value of is assigned to the corresponding neurons of the input layer of the neural network, and then the value of each neuron of each layer is calculated forward according to equation 1 until the value of all neurons of the output layer of the last layer is calculated. Finally, the output vector is obtained by concatenating the values of each neuron in the output layer.

Next, take an example to illustrate this process. We first write the number of each unit of the neural network.

As shown in the figure above, the input layer has three nodes, which are numbered as 1, 2 and 3 in turn; The 4 nodes of the hidden layer are numbered 4, 5, 6 and 7; Finally, the two nodes of the output layer are numbered 8 and 9. Because our neural network is a fully connected network, we can see that each node is connected to all nodes in the upper layer. For example, we can see that node 4 of the hidden layer is connected with three nodes 1, 2 and 3 of the input layer, and the weight of the connection is. So, how do we calculate the output value of node 4And?

In order to calculate the output value of node 4, we must first obtain the output values of all its upstream nodes (i.e. nodes 1, 2 and 3). Nodes 1, 2 and 3 are the nodes of the input layer, so their output values are the input vectorsItself. According to the corresponding relationship shown in the figure above, you can see that the output values of nodes 1, 2 and 3 are. We require that the dimension of the input vector is the same as the number of neurons in the input layer, and the input node corresponding to an element of the input vector can be determined freely. You don't want toThere is no problem with assigning a value to node 2, but it is of little value except to stun yourself.

Once we have the output values of nodes 1, 2 and 3, we can calculate the output value of node 4 according to equation 1:

Above formulaIs the offset term of node 4, which is not drawn in the figure. andThese are the weights connected from node 1, 2 and 3 to node 4 respectivelyWhen numbering, we put the number of the target nodePut it first and put the number of the source nodePut it in the back.

Similarly, we can continue to calculate the output values of nodes 5, 6 and 7. In this way, the output values of the four nodes of the hidden layer are calculated, and we can then calculate the output values of node 8 of the output layer:

Similarly, we can also calculateValue of. In this way, the output values of all nodes in the output layer are calculated, and we get the input vectorThe output vector of the neural network. Here we also see that the dimension of the output vector is the same as the number of neurons in the output layer.

Matrix representation of neural networks

The calculation of neural network will be very convenient if it is represented by matrix (of course, the lattice is higher). Let's first look at the matrix representation of hidden layer.

First, we arrange the calculations of the four nodes of the hidden layer in turn:

Next, the input vector of the network is definedAnd the weight vector of each node of the hidden layer. order

Substitute into the previous set of equations to obtain:

Now, let's put the above calculationThe four formulas of are written into a matrix. Each formula is used as a row of the matrix, and their calculation can be expressed by the matrix. order

Bring in the previous set of equations and get

In equation 2,Is the activation function, in this case, the sigmoid function;Is the weight matrix of a layer;Is the input vector of a layer;Is the output vector of a layer. Equation 2 shows that the function of each layer of neural network is actually to multiply the input vector left by an array for linear transformation to obtain a new vector, and then apply an activation function to this vector element by element.

The algorithm of each layer is the same. For example, for neural networks with one input layer, one output layer and three hidden layers, we assume that their weight matrices are, the output of each hidden layer is, the input of neural network is, the input of neural network is, as shown in the figure below:

Then the calculation of the output vector of each layer can be expressed as:

This is the calculation method of the output value of neural network.

Training of neural network

Now, we need to know how to get the weight on each connection of a neural network. We can say that neural network is a model, so these weights are the parameters of the model, that is, what the model needs to learn. However, the connection mode, the number of layers and the number of nodes in each layer of a neural network are not learned, but artificially set in advance. These artificially set parameters are called hyper parameters.

Next, we will introduce the training algorithm of neural network: back propagation algorithm.

Back propagation algorithm

We first intuitively introduce the back propagation algorithm, and finally introduce the derivation of this algorithm. Of course, readers can skip the derivation part completely, because even if you don't know how to deduce, it doesn't affect you to write the training code of a neural network. In fact, there are many mature open source implementations of neural networks. Except for practicing, you may not have the opportunity to write a neural network.

We take supervised learning as an example to explain the back propagation algorithm. stay Zero basis introduction to deep learning (2) - linear units and gradient descent In this article, we introduced what supervised learning is. If you forget it, you can take a look again. In addition, we set the activation function of neuronsIs a function (the calculation formulas of different activation functions are different, see Derivation of back propagation algorithm Section).

We assume that each training sample isWhere vectorIs the characteristic of the training sample, andIs the target value of the sample.

Firstly, we use the characteristics of samples according to the algorithm introduced in the previous section, the output of each hidden layer node in the neural network is calculated And the output of each node of the output layer.

Then, we calculate the error term of each node according to the following method:

For output layer nodes，

Among them,Is the error term of the node,Is nodeOutput value of, Yes, the sample corresponds to the nodeTarget value for. For example, according to the above figure, for the output layer node 8, its output value is, and the target value of the sample is, bring in the above formula to obtain the error term of node 8Should be:

For hidden layer nodes,

Among them,Is nodeOutput value of,Is nodeTo its next level nodeThe weight of the connection,Is nodeNext level nodeError term. For example, for hidden layer node 4, the calculation method is as follows:

Finally, update the weights on each connection:

Among them,Is nodeTo nodeThe weight of,Is a constant that becomes the learning rate,Is nodeError term,Is node Pass to nodeInput of. For example, weightsThe update method is as follows:

Similarly, weightThe update method is as follows:

The input value of the offset item is always 1. For example, the offset term of node 4It shall be calculated as follows:

We have introduced the calculation and weight update method of each node error term of neural network. Obviously, to calculate the error term of a node, we need to calculate the error term of each node connected to the next layer. This requires that the calculation order of the error term must start from the output layer, and then calculate the error term of each hidden layer in reverse order until the hidden layer connected to the input layer. This is the meaning of the name of the back propagation algorithm. When the error terms of all nodes are calculated, we can update all weights according to equation 5.

The above is the basic back propagation algorithm, which is not very complex. Have you figured it out?

Derivation of back propagation algorithm

Back propagation algorithm is actually the application of chain derivation rule. However, this simple and obvious method was invented and popularized nearly 30 years after Roseblatt proposed the perceptron algorithm. In response, Bengio responded:

Many seemingly obvious ideas become obvious only after the fact.

Next, we use the chain derivation rule to derive the back propagation algorithm, that is, equations 3, 4 and 5 in the previous section.

High energy early warning ahead - next is the hardest hit area of mathematical formula. Readers can read it at their own discretion without forcing.

According to the general routine of machine learning, we first determine the objective function of the neural network, and then use the random gradient descent optimization algorithm to find the parameter value when the minimum value of the objective function is obtained.

We take the sum of squares of errors of all output layer nodes of the network as the objective function:

Among them,Indicates a sampleError.

Then, we use the article Zero basis introduction to deep learning (2) - linear units and gradient descent The random gradient descent algorithm introduced in optimizes the objective function:

The random gradient descent algorithm needs to find the errorFor each weightHow do you find the partial derivative of (that is, the gradient)?

Looking at the figure above, we find that the weightOnly by affecting nodesThe input value of affects other parts of the networkIs nodeWeighted input, i.e

yesFunction, andyesFunction of. According to the chain derivation rule, we can get:

Where,Is nodePass to nodeThe input value of the node, that is, the output value of the node.

aboutIt is necessary to distinguish between output layer and hidden layer.

Output layer weight training

For the output layer,Only through nodesOutput value ofTo affect the rest of the network, that isyesFunction, andyesFunction of, where. So we can use the chain derivation rule again:

Consider the first item of the above formula:

Consider the second item of the above formula:

Bring in the first and second items to get:

If orderThat is, the error term of a nodeIs the inverse of the partial derivative of the network error to the input of this node. Bring in the above formula to obtain:

The above formula is formula 3.

The above derivation is brought into the random gradient descent formula to obtain:

The above equation is equation 5.

Hidden layer weight training

Now we're going to derive the of the hidden layer .

First, we need to define nodesA collection of all direct downstream nodes. For example, for node 4, its direct downstream nodes are node 8 and node 9. Can seeOnly through influenceRe influence. set upIs nodeThe input of the downstream node, thenyesFunction, andyesFunction of. becauseThere are several. We can make the following derivation by using the full derivative formula:

because, bring in the above formula to obtain:

The above equation is equation 4.

——Math formula alarm cleared——

So far, we have derived the back propagation algorithm. It should be noted that the training rules we just deduced are based on the activation function is sigmoid function, square sum error, fully connected network and random gradient descent optimization algorithm. If the activation function is different, the error calculation method is different, the network connection structure is different, and the optimization algorithm is different, the specific training rules will be different. However, in any case, the derivation of training rules is the same, which can be deduced by using the chain derivation rule.

Implementation of neural network

For the complete code, please refer to GitHub: https://github.com/hanbt/learn_dl/blob/master/bp.py (python2.7)

Now, we need to implement a basic fully connected neural network according to the previous algorithm, which does not require much code. We still use object-oriented design here.

First, let's make a basic model:

As shown in the figure above, five domain objects can be decomposed to realize the neural network:

Network neural network object, which provides API interface. It consists of several layer objects and connected objects.
A Layer object that consists of multiple nodes.
The Node object calculates and records the Node's own information (such as output values). error termAnd the upstream and downstream connections associated with this node.
Connection each connection object should record the weight of the connection.
Connections only serves as the collection object of Connection and provides some collection operations.

Node implementation is as follows:

# The node class is responsible for recording and maintaining the node's own information and the upstream and downstream connections related to this node, and realizing the calculation of output value and error term.
2.class Node(object):
3.    def __init__(self, layer_index, node_index):
4.        '''
5.        Construct node objects.
6.        layer_index: The number of the layer to which the node belongs
7.        node_index: Node number
8.        '''
9.        self.layer_index = layer_index
10.        self.node_index = node_index
11.        self.downstream = []
12.        self.upstream = []
13.        self.output = 0
14.        self.delta = 0
15.
16.    def set_output(self, output):
17.        '''
18.        Sets the output value of the node. This function is used if the node belongs to the input layer.
19.        '''
20.        self.output = output
21.
22.    def append_downstream_connection(self, conn):
23.        '''
24.        Add a connection to the downstream node
25.        '''
26.        self.downstream.append(conn)
27.
28.    def append_upstream_connection(self, conn):
29.        '''
30.        Add a connection to the upstream node
31.        '''
32.        self.upstream.append(conn)
33.
34.    def calc_output(self):
35.        '''
36.        Calculate the output of the node according to equation 1
37.        '''
38.        output = reduce(lambda ret, conn: ret + conn.upstream_node.output * conn.weight, self.upstream, 0)
39.        self.output = sigmoid(output)
40.
41.    def calc_hidden_layer_delta(self):
42.        '''
43.        When the node belongs to the hidden layer, it is calculated according to equation 4 delta
44.        '''
45.        downstream_delta = reduce(
46.            lambda ret, conn: ret + conn.downstream_node.delta * conn.weight,
47.            self.downstream, 0.0)
48.        self.delta = self.output * (1 - self.output) * downstream_delta
49.
50.    def calc_output_layer_delta(self, label):
51.        '''
52.        When the node belongs to the output layer, it is calculated according to equation 3 delta
53.        '''
54.        self.delta = self.output * (1 - self.output) * (label - self.output)
55.
56.    def __str__(self):
57.        '''
58.        Print node information
59.        '''
60.        node_str = '%u-%u: output: %f delta: %f' % (self.layer_index, self.node_index, self.output, self.delta)
61.        downstream_str = reduce(lambda ret, conn: ret + '\n\t' + str(conn), self.downstream, '')
62.        upstream_str = reduce(lambda ret, conn: ret + '\n\t' + str(conn), self.upstream, '')
63.        return node_str + '\n\tdownstream:' + downstream_str + '\n\tupstream:' + upstream_str

ConstNode object, in order to realize a node whose output is constant 1 (calculate the offset term(if required)

class ConstNode(object):
2.    def __init__(self, layer_index, node_index):
3.        '''
4.        Construct node objects.
5.        layer_index: The number of the layer to which the node belongs
6.        node_index: Node number
7.        '''    
8.        self.layer_index = layer_index
9.        self.node_index = node_index
10.        self.downstream = []
11.        self.output = 1
12.
13.    def append_downstream_connection(self, conn):
14.        '''
15.        Add a connection to the downstream node
16.        '''       
17.        self.downstream.append(conn)
18.
19.    def calc_hidden_layer_delta(self):
20.        '''
21.        When the node belongs to the hidden layer, it is calculated according to equation 4 delta
22.        '''
23.        downstream_delta = reduce(
24.            lambda ret, conn: ret + conn.downstream_node.delta * conn.weight,
25.            self.downstream, 0.0)
26.        self.delta = self.output * (1 - self.output) * downstream_delta
27.
28.    def __str__(self):
29.        '''
30.        Print node information
31.        '''
32.        node_str = '%u-%u: output: 1' % (self.layer_index, self.node_index)
33.        downstream_str = reduce(lambda ret, conn: ret + '\n\t' + str(conn), self.downstream, '')
34.        return node_str + '\n\tdownstream:' + downstream_str

Layer object, which is responsible for initializing a layer. In addition, as a collection object of nodes, it provides operations on Node collections.

class Layer(object):
2.    def __init__(self, layer_index, node_count):
3.        '''
4.        Initialize layer 1
5.        layer_index: Layer number
6.        node_count: Number of nodes contained in the layer
7.        '''
8.        self.layer_index = layer_index
9.        self.nodes = []
10.        for i in range(node_count):
11.            self.nodes.append(Node(layer_index, i))
12.        self.nodes.append(ConstNode(layer_index, node_count))
13.
14.    def set_output(self, data):
15.        '''
16.        Sets the output of the layer. Used when the layer is an input layer.
17.        '''
18.        for i in range(len(data)):
19.            self.nodes[i].set_output(data[i])
20.
21.    def calc_output(self):
22.        '''
23.        Calculate the output vector of the layer
24.        '''
25.        for node in self.nodes[:-1]:
26.            node.calc_output()
27.
28.    def dump(self):
29.        '''
30.        Print layer information
31.        '''
32.        for node in self.nodes:
33.            print node

The Connection object is mainly responsible for recording the weight of the Connection and the upstream and downstream nodes associated with the Connection.

class Connection(object):
2.    def __init__(self, upstream_node, downstream_node):
3.        '''
4.        Initialize the connection. The weight is initialized to a small random number
5.        upstream_node: Connected upstream node
6.        downstream_node: Connected downstream nodes
7.        '''
8.        self.upstream_node = upstream_node
9.        self.downstream_node = downstream_node
10.        self.weight = random.uniform(-0.1, 0.1)
11.        self.gradient = 0.0
12.
13.    def calc_gradient(self):
14.        '''
15.        Calculated gradient
16.        '''
17.        self.gradient = self.downstream_node.delta * self.upstream_node.output
18.
19.    def get_gradient(self):
20.        '''
21.        Gets the current gradient
22.        '''
23.        return self.gradient
24.
25.    def update_weight(self, rate):
26.        '''
27.        Update the weight according to the gradient descent algorithm
28.        '''
29.        self.calc_gradient()
30.        self.weight += rate * self.gradient
31.
32.    def __str__(self):
33.        '''
34.        Print connection information
35.        '''
36.        return '(%u-%u) -> (%u-%u) = %f' % (
37.            self.upstream_node.layer_index, 
38.            self.upstream_node.node_index,
39.            self.downstream_node.layer_index, 
40.            self.downstream_node.node_index, 
41.            self.weight)

Connections object that provides Connection collection operations.

class Connections(object):
2.    def __init__(self):
3.        self.connections = []
4.
5.    def add_connection(self, connection):
6.        self.connections.append(connection)
7.
8.    def dump(self):
9.        for conn in self.connections:
10.            print conn

Network object that provides API s.

class Network(object):
2.    def __init__(self, layers):
3.        '''
4.        Initialize a fully connected neural network
5.        layers: A two-dimensional array describing the number of nodes in each layer of the neural network
6.        '''
7.        self.connections = Connections()
8.        self.layers = []
9.        layer_count = len(layers)
10.        node_count = 0;
11.        for i in range(layer_count):
12.            self.layers.append(Layer(i, layers[i]))
13.        for layer in range(layer_count - 1):
14.            connections = [Connection(upstream_node, downstream_node) 
15.                           for upstream_node in self.layers[layer].nodes
16.                           for downstream_node in self.layers[layer + 1].nodes[:-1]]
17.            for conn in connections:
18.                self.connections.add_connection(conn)
19.                conn.downstream_node.append_upstream_connection(conn)
20.                conn.upstream_node.append_downstream_connection(conn)
21.
22.
23.    def train(self, labels, data_set, rate, iteration):
24.        '''
25.        Training neural network
26.        labels: Array, training sample label. Each element is a label of a sample.
27.        data_set: Two dimensional array, training sample features. Each element is a feature of a sample.
28.        '''
29.        for i in range(iteration):
30.            for d in range(len(data_set)):
31.                self.train_one_sample(labels[d], data_set[d], rate)
32.
33.    def train_one_sample(self, label, sample, rate):
34.        '''
35.        The internal function uses a sample to train the network
36.        '''
37.        self.predict(sample)
38.        self.calc_delta(label)
39.        self.update_weight(rate)
40.
41.    def calc_delta(self, label):
42.        '''
43.        Internal function to calculate the of each node delta
44.        '''
45.        output_nodes = self.layers[-1].nodes
46.        for i in range(len(label)):
47.            output_nodes[i].calc_output_layer_delta(label[i])
48.        for layer in self.layers[-2::-1]:
49.            for node in layer.nodes:
50.                node.calc_hidden_layer_delta()
51.
52.    def update_weight(self, rate):
53.        '''
54.        Internal function to update the weight of each connection
55.        '''
56.        for layer in self.layers[:-1]:
57.            for node in layer.nodes:
58.                for conn in node.downstream:
59.                    conn.update_weight(rate)
60.
61.    def calc_gradient(self):
62.        '''
63.        Internal function to calculate the gradient of each connection
64.        '''
65.        for layer in self.layers[:-1]:
66.            for node in layer.nodes:
67.                for conn in node.downstream:
68.                    conn.calc_gradient()
69.
70.    def get_gradient(self, label, sample):
71.        '''
72.        Obtain the gradient on each connection of the network under one sample
73.        label: Sample label
74.        sample: Sample input
75.        '''
76.        self.predict(sample)
77.        self.calc_delta(label)
78.        self.calc_gradient()
79.
80.    def predict(self, sample):
81.        '''
82.        Predict the output value according to the input sample
83.        sample: Array, the characteristics of the sample, that is, the input vector of the network
84.        '''
85.        self.layers[0].set_output(sample)
86.        for i in range(1, len(self.layers)):
87.            self.layers[i].calc_output()
88.        return map(lambda node: node.output, self.layers[-1].nodes[:-1])
89.
90.    def dump(self):
91.        '''
92.        Print network information
93.        '''
94.        for layer in self.layers:
95.            layer.dump()

Gradient check

How to ensure that the neural network you write has no BUG? In fact, this is a very important issue. On the one hand, I have worked hard to think of an algorithm, and the result is not ideal. Is the algorithm itself wrong or the code implementation wrong? Locating this problem must take a lot of time and energy. On the other hand, due to the complexity of neural network, we can hardly know the input and output of neural network in advance. Therefore, development methods such as TDD (Test Driven Development) seem to be infeasible.

The way is to use gradient check to confirm whether the program is correct. The idea of gradient inspection is as follows:

For gradient descent algorithm:

The key here isThe calculation must be correct, and it isyesPartial derivative of. According to the definition of derivative:

For anyWe can use the right side of the equation to approximate the derivative of. We putyesFunction of, i.eSo, according to the derivative definition,Should be equal to:

If you putSet to a small number (e.g), then the above formula can be written as:

We can use equation 6 to calculate the gradientAnd then compare it with the gradient value calculated in our neural network code. If the difference between the two is very small, then our code is correct.

The following is the code for gradient check. If we want to check the parametersWhether the gradient is correct, we need the following steps:

Firstly, a sample is used to train the neural network, so that the gradient of each weight can be obtained.
A small value () will be added to recalculate the neural network under this sample.
A small value () will be subtracted to recalculate the neural network under this sample.
Calculate the desired gradient value according to equation 6 and compare it with the gradient value obtained in the first step. They should be almost equal (at least 4 significant digits are the same).

Of course, we can repeat the above process to check each weight. Multiple samples can also be used to repeat the inspection.

def gradient_check(network, sample_feature, sample_label):
2.    '''
3.    Gradient check
4.    network: Neural network object
5.    sample_feature: Characteristics of samples
6.    sample_label: Label of sample
7.    '''
8.    # Calculate network error
9.    network_error = lambda vec1, vec2: \
10.            0.5 * reduce(lambda a, b: a + b, 
11.                      map(lambda v: (v[0] - v[1]) * (v[0] - v[1]),
12.                          zip(vec1, vec2)))
13.
14.    # Get the gradient of each connection of the network under the current sample
15.    network.get_gradient(sample_feature, sample_label)
16.
17.    # Gradient check for each weight    
18.    for conn in network.connections.connections: 
19.        # Gets the gradient of the specified connection
20.        actual_gradient = conn.get_gradient()
21.
22.        # Add a small value to calculate the error of the network
23.        epsilon = 0.0001
24.        conn.weight += epsilon
25.        error1 = network_error(network.predict(sample_feature), sample_label)
26.
27.        # Subtract a small value to calculate the error of the network
28.        conn.weight -= 2 * epsilon # I just added it once, so I need to subtract twice here
29.        error2 = network_error(network.predict(sample_feature), sample_label)
30.
31.        # Calculate the desired gradient value according to equation 6
32.        expected_gradient = (error2 - error1) / (2 * epsilon)
33.
34.        # Print
35.        print 'expected gradient: \t%f\nactual gradient: \t%f' % (
36.            expected_gradient, actual_gradient)

Neural network practice -- handwritten numeral recognition

For this task, we use the MNIST dataset, which is very popular in the industry. MNIST has about 60000 training samples of handwritten letters. We use it to train our neural network, and then use the trained network to recognize handwritten digits.

Handwritten numeral recognition is a relatively simple task. Numbers can only be one of 0-9, which is a 10 classification problem.

Determination of super parameters

We first need to determine the number of layers of the network and the number of nodes in each layer. As for the first question, in fact, there is no theoretical method. Everyone takes pictures based on experience. If there is no experience, just take one at random. Then, you can try a few more values and train neural networks with different layers to see which one works best. Well, now you may understand why deep learning is a craft. Some crafts are speechless, while others are very technical.

However, we still understand some basic principles. We know that the more layers of the network, the better. We also know that the more layers, the more difficult it is to train. For fully connected networks, the hidden layer should not exceed three layers. Then, we can first try the effect of neural network with only one hidden layer. After all, if the model is small, the training is faster (when I first started playing the model, I hope to see the results quickly).

The number of input layer nodes is determined. Because each training data of MNIST dataset is a 28 * 28 picture with a total of 784 pixels, the number of input layer nodes should be 784, and each pixel corresponds to an input node.

The number of output layer nodes is also determined. Because it is 10 classification, we can use 10 nodes, and each node corresponds to a classification. Among the 10 nodes in the output layer, the classification corresponding to the node that outputs the maximum value is the prediction result of the model.

The number of hidden layer nodes is uncertain, ranging from 1 to 1 million. Here are several empirical formulas:

Therefore, we can first set the number of hidden layer nodes according to the above formula. If we have time, we can set different node numbers and train them separately to see which effect is the best. Let's take one first and set the number of hidden layer nodes to 300.

For 3 layersAll connected network, a total ofParameters! The reason why neural network is powerful is that it provides a very simple method to realize a large number of parameters. At present, there are also super large-scale neural networks with 10 billion parameters and 100 billion samples. Because MNIST has only 60000 training samples, too many parameters are easy to fit, but the effect is not good.

Model training and evaluation

The MNIST dataset contains 10000 test samples. We first train our network with 60000 training samples, and then test the network with test samples to calculate the recognition error rate:

We evaluate the accuracy every 10 rounds of training. When the accuracy begins to decline (over fitting occurs), the training is terminated.

code implementation

First, we need to process the MNIST data set into a form acceptable to the neural network. The file format of MNIST training set can refer to the official website, which will not be repeated here. Each training sample is a 28 * 28 image. We convert it into a 784 dimensional vector according to row priority. Each label is a value of 0-9. We convert it into a 10 dimensional one hot vector: if the label value is, we set the dimension of the vector (numbered from 0) to 0.9 and the other dimensions to 0.1. For example, the vector [0.1,0.1,0.9,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1] represents the value 2.

The following is the code for processing MNIST data:

#!/usr/bin/env python
2.# -*- coding: UTF-8 -*-
3.
4.import struct
5.from bp import *
6.from datetime import datetime
7.
8.
9.# Data loader base class
10.class Loader(object):
11.    def __init__(self, path, count):
12.        '''
13.        Initialize loader
14.        path: Data file path
15.        count: Number of samples in the file
16.        '''
17.        self.path = path
18.        self.count = count
19.
20.    def get_file_content(self):
21.        '''
22.        Read file contents
23.        '''
24.        f = open(self.path, 'rb')
25.        content = f.read()
26.        f.close()
27.        return content
28.
29.    def to_int(self, byte):
30.        '''
31.        take unsigned byte Convert characters to integers
32.        '''
33.        return struct.unpack('B', byte)[0]
34.
35.
36.# Image data loader
37.class ImageLoader(Loader):
38.    def get_picture(self, content, index):
39.        '''
40.        Internal function to obtain an image from a file
41.        '''
42.        start = index * 28 * 28 + 16
43.        picture = []
44.        for i in range(28):
45.            picture.append([])
46.            for j in range(28):
47.                picture[i].append(
48.                    self.to_int(content[start + i * 28 + j]))
49.        return picture
50.
51.    def get_one_sample(self, picture):
52.        '''
53.        Internal function to convert the image into the input vector of the sample
54.        '''
55.        sample = []
56.        for i in range(28):
57.            for j in range(28):
58.                sample.append(picture[i][j])
59.        return sample
60.
61.    def load(self):
62.        '''
63.        Load the data file to obtain the input vector of all samples
64.        '''
65.        content = self.get_file_content()
66.        data_set = []
67.        for index in range(self.count):
68.            data_set.append(
69.                self.get_one_sample(
70.                    self.get_picture(content, index)))
71.        return data_set
72.
73.
74.# Tag data loader
75.class LabelLoader(Loader):
76.    def load(self):
77.        '''
78.        Load the data file to obtain the label vector of all samples
79.        '''
80.        content = self.get_file_content()
81.        labels = []
82.        for index in range(self.count):
83.            labels.append(self.norm(content[index + 8]))
84.        return labels
85.
86.    def norm(self, label):
87.        '''
88.        Internal function to convert a value into a 10 dimensional label vector
89.        '''
90.        label_vec = []
91.        label_value = self.to_int(label)
92.        for i in range(10):
93.            if i == label_value:
94.                label_vec.append(0.9)
95.            else:
96.                label_vec.append(0.1)
97.        return label_vec
98.
99.
100.def get_training_data_set():
101.    '''
102.    Obtain training data set
103.    '''
104.    image_loader = ImageLoader('train-images-idx3-ubyte', 60000)
105.    label_loader = LabelLoader('train-labels-idx1-ubyte', 60000)
106.    return image_loader.load(), label_loader.load()
107.
108.
109.def get_test_data_set():
110.    '''
111.    Get test data set
112.    '''
113.    image_loader = ImageLoader('t10k-images-idx3-ubyte', 10000)
114.    label_loader = LabelLoader('t10k-labels-idx1-ubyte', 10000)
115.    return image_loader.load(), label_loader.load()

The output of the network is a 10 dimensional vector. If the value of the first (numbered from 0) element of this vector is the largest, it is the recognition result of the network. The following is the code implementation:

def get_result(vec):
2.    max_value_index = 0
3.    max_value = 0
4.    for i in range(len(vec)):
5.        if vec[i] > max_value:
6.            max_value = vec[i]
7.            max_value_index = i
8.    return max_value_index

We use the error rate to evaluate the network. The following is the code implementation:

def evaluate(network, test_data_set, test_labels):
2.    error = 0
3.    total = len(test_data_set)
4.
5.    for i in range(total):
6.        label = get_result(test_labels[i])
7.        predict = get_result(network.predict(test_data_set[i]))
8.        if label != predict:
9.            error += 1
10.    return float(error) / float(total)

Finally, we implement our training strategy: evaluate the accuracy once every 10 rounds of training, and terminate the training when the accuracy begins to decline. The following is the code implementation:

def train_and_evaluate():
2.    last_error_ratio = 1.0
3.    epoch = 0
4.    train_data_set, train_labels = get_training_data_set()
5.    test_data_set, test_labels = get_test_data_set()
6.    network = Network([784, 300, 10])
7.    while True:
8.        epoch += 1
9.        network.train(train_labels, train_data_set, 0.3, 1)
10.        print '%s epoch %d finished' % (now(), epoch)
11.        if epoch % 10 == 0:
12.            error_ratio = evaluate(network, test_data_set, test_labels)
13.            print '%s after epoch %d, error ratio is %f' % (now(), epoch, error_ratio)
14.            if error_ratio > last_error_ratio:
15.                break
16.            else:
17.                last_error_ratio = error_ratio
18.
19.
20.if __name__ == '__main__':
21.    train_and_evaluate()

I tested it on my machine. An epoch takes about 9000 seconds, Therefore, we need to do a lot of performance optimization for the code (such as vector programming). The training takes a long time. You can upload it to the server and run it in the tmux session. In order to prevent the previous work from being wasted due to abnormal termination, we save the obtained parameter values on the disk every 10 rounds of training so that they can be recovered later. (code omitted)

Vectorization programming

For the complete code, please refer to GitHub: https://github.com/hanbt/learn_dl/blob/master/fc.py (python2.7)

After a long training, we may think that there must be a better way! Yes, programmers, now we need to say goodbye to object-oriented programming and use another programming method more suitable for deep learning algorithm: vectorial programming. There are two main reasons: one is that we don't really need to define objects such as Node and Connection, so we can directly implement mathematical calculation; Another reason, It is the underlying algorithm library that will optimize vector operations (even special hardware, such as GPU), and the program efficiency will be greatly improved. Therefore, in the world of deep learning, we will always try to express the calculation in the form of vectors. I believe that excellent programmers will not stick to a certain way (familiar with) programming paradigm, but will learn and use the most appropriate paradigm.

Next, we use the vector programming method to re implement the previous fully connected neural network.

First, we need to express all the calculations in the form of vectors. For fully connected neural networks, there are three main calculation formulas.

For forward calculation, we find that equation 2 is already a vectorized expression:

In the above formulaRepresents the sigmoid function.

For reverse calculation, we need to use vectors to express Equations 3 and 4:

In equation 8,Represents the error term of layer l;Representation matrixTranspose of.

We also need a vectorized representation of the gradient calculation of the weight array W and the bias term b. That is, equation 5 needs to be expressed by vectorization:

The corresponding vectorization is expressed as:

The vectorization of the update offset term is expressed as:

Now, according to the above formulas, we re implement a class: FullConnectedLayer. It realizes the forward and backward calculation of the full connection layer:

# Full connection layer implementation class
2.class FullConnectedLayer(object):
3.    def __init__(self, input_size, output_size, 
4.                 activator):
5.        '''
6.        Constructor
7.        input_size: Dimension of input vector of this layer
8.        output_size: Dimension of output vector of this layer
9.        activator: Activation function
10.        '''
11.        self.input_size = input_size
12.        self.output_size = output_size
13.        self.activator = activator
14.        # Weight array W
15.        self.W = np.random.uniform(-0.1, 0.1,
16.            (output_size, input_size))
17.        # Offset term b
18.        self.b = np.zeros((output_size, 1))
19.        # Output vector
20.        self.output = np.zeros((output_size, 1))
21.
22.    def forward(self, input_array):
23.        '''
24.        Forward calculation
25.        input_array: Input vector, dimension must be equal to input_size
26.        '''
27.        # Equation 2
28.        self.input = input_array
29.        self.output = self.activator.forward(
30.            np.dot(self.W, input_array) + self.b)
31.
32.    def backward(self, delta_array):
33.        '''
34.        Reverse calculation W and b Gradient of
35.        delta_array: Error term passed from the upper layer
36.        '''
37.        # Equation 8
38.        self.delta = self.activator.backward(self.input) * np.dot(
39.            self.W.T, delta_array)
40.        self.W_grad = np.dot(delta_array, self.input.T)
41.        self.b_grad = delta_array
42.
43.    def update(self, learning_rate):
44.        '''
45.        Update weights using gradient descent algorithm
46.        '''
47.        self.W += learning_rate * self.W_grad
48.        self.b += learning_rate * self.b_grad

The above class replaces the original Layer, Node, Connection and other classes in one fell swoop, which not only makes the code easier to understand, but also runs hundreds of times faster.

Now, we modify the Network class to use FullConnectedLayer:

# Sigmoid activation function class
2.class SigmoidActivator(object):
3.    def forward(self, weighted_input):
4.        return 1.0 / (1.0 + np.exp(-weighted_input))
5.
6.    def backward(self, output):
7.        return output * (1 - output)
8.
9.
10.# Neural networks
11.class Network(object):
12.    def __init__(self, layers):
13.        '''
14.        Constructor
15.        '''
16.        self.layers = []
17.        for i in range(len(layers) - 1):
18.            self.layers.append(
19.                FullConnectedLayer(
20.                    layers[i], layers[i+1],
21.                    SigmoidActivator()
22.                )
23.            )
24.
25.    def predict(self, sample):
26.        '''
27.        Prediction using neural network
28.        sample: Input sample
29.        '''
30.        output = sample
31.        for layer in self.layers:
32.            layer.forward(output)
33.            output = layer.output
34.        return output
35.
36.    def train(self, labels, data_set, rate, epoch):
37.        '''
38.        Training function
39.        labels: Sample label
40.        data_set: Input sample
41.        rate: Learning rate
42.        epoch: Number of training rounds
43.        '''
44.        for i in range(epoch):
45.            for d in range(len(data_set)):
46.                self.train_one_sample(labels[d], 
47.                    data_set[d], rate)
48.
49.    def train_one_sample(self, label, sample, rate):
50.        self.predict(sample)
51.        self.calc_gradient(label)
52.        self.update_weight(rate)
53.
54.    def calc_gradient(self, label):
55.        delta = self.layers[-1].activator.backward(
56.            self.layers[-1].output
57.        ) * (label - self.layers[-1].output)
58.        for layer in self.layers[::-1]:
59.            layer.backward(delta)
60.            delta = layer.delta
61.        return delta
62.
63.    def update_weight(self, rate):
64.        for layer in self.layers:
65.            layer.update(rate)

Now, the Network class is much cleaner. Let's train the MNIST dataset again with our new code.

Summary

So far, you have completed another long learning journey. You should have understood the basic principle of neural network by now. If you are happy, you even have the ability to implement one and use it to solve some problems. If you feel difficult, don't be discouraged. This article is an important watershed. If you fully understand it, there is no problem bragging in front of the real "Xiaobai" and the pretentious "Daniel".

As an introduction to deep learning, this article is also the end of the first half. In this half, you have mastered the basic concepts of machine learning and neural network, and have the ability to solve some simple problems (such as handwritten numeral recognition. If you use the traditional point of view, these problems are not simple). Moreover, once you master the basic concepts, the later learning will be much easier.

In the second half, we will introduce more "deep" learning. We have talked about neural network, but we have not talked about deep neural network. Deep will bring more powerful capabilities and more problems. If you don't understand these problems and their solutions, you can't be said to have started "in-depth" learning.

At present, there are many open source neural networks in the industry, and their functions are much more powerful, so you don't need to implement your own neural networks. We invented the wheel from scratch in the first half to let you understand the basic principle of neural network, so that you can master these tools very quickly. In the second half of the article, we changed our strategy: instead of starting from scratch, we applied existing tools as much as possible.

In the next article, we introduce neural networks with different structures, such as the famous convolutional neural network, which has created many miracles in the field of image and speech, and the research in the field of natural language processing is also in full swing. In a sense, its success has greatly enhanced people's confidence in deep learning.

reference material

Tom M. Mitchell, "machine learning", translated by Zeng Huajun, machinery industry press
CS 224N / Ling 284, Neural Networks for Named Entity Recognition
LeCun et al. Gradient-Based Learning Applied to Document Recognition 1998

Topics: Python Deep Learning

Programmer Think