The overall structure, convolution layer, pooling and python implementation of convolutional neural network

Posted by jennyp on Thu, 06 Jan 2022 06:19:20 +0100

1, Overall structure

All neurons in the adjacent layer of neural network are connected, which is called full connection. Previously, the full connection was realized with the affinity layer.

for instance

Fully connected neural network structure:

Structure of convolutional neural network CNN:

The Conv convolution layer and Pooling layer are added, and the previous affine relu connection is replaced by Conv relu Pooling connection.

In CNN, the layer close to the output uses the previous affine relu combination, and the last output layer uses the previous affine softmax combination, which is a common CNN structure.

2, Convolution layer

The problem of fully connected neural network is that the shape of data is ignored and the three-dimensional image is input, but when input to the fully connected layer, the three-dimensional needs to be pulled into one-dimensional, which leads to the neglect of the spatial information contained in the shape, that is, fully connected and unable to use the information related to the shape.

In order to solve this problem, the data transmitted to each layer is shaped data, such as three-dimensional data.

The convolution layer performs convolution operation, just like the filtering of image processing.

For the input data, the convolution operation slides the window of the filter at a certain interval, multiplies and sums the filter elements at each position and the corresponding input elements, and carries out the so-called product accumulation operation.

In the fully connected neural network, the parameters include weight and bias.

In CNN, the number in the filter is a parameter, and the bias of the data in the filter is also a parameter. To learn the data through CNN network, find these parameters, so as to achieve excellent learning effect.

Sometimes filling is required before convolution processing, that is, filling fixed data around the input data. Filling is mainly to adjust the size of the output. When the (3,3) filter is applied to the input data of (4,4), the output is (2,2), and the output is reduced by two elements than the input, resulting in a problem. If the output is reduced every time the convolution operation is performed, the output cannot be convoluted at a certain time, and the network cannot be transmitted. If the filling amplitude of this example is 1, the convolution operation output is also (4,4), and the convolution operation can transfer data to the next layer while keeping the space size unchanged.

Increase the stride, the output size will become smaller, and increase the fill output size will become larger.

Suppose the input size is (H,W), the filter size is (FH,FW), the output size is (OH,OW), the filling is P and the step is S.

There is the following formula.

Convolution of multidimensional data:

Data shape with channel number C, height H and length W, (C,H,W)

The number of filter channels is C, the height is FH, and the length is FW, (C,FH,FW)

After convolution, a characteristic graph is output. It can be understood that a small block moves in a large block. Because the C of the large block is the same as that of the small block, it is two-dimensional after convolution.

If you want to have multiple convolution operations on the channel, you need to use multiple filters, as shown in the figure below. The output will be transmitted to the next layer in the form of blocks.

If you increase the offset, as shown in the following figure: small box, that is, FN offset values.

If batch processing is added, that is, N data are processed at one time, the data saving dimension of each layer will increase by one dimension, as shown in the following figure.

3, Pool layer

Pooling is a spatial operation that reduces the height and rectangle upward. The Max pool layer obtains the maximum value of the target area, and then puts the maximum value into one of the elements of the output data. The pool layer has no parameters to learn, and the number of channels does not change. Pooling is robust to small deviation of input data, that is, when there is a small deviation of input data, the pooling layer returns the same output result (when there is no small deviation of input data).

4, python implements convolution layer and pooling layer

For convolution, im2col this function can expand the small blocks of large blocks into one line, so it can perform matrix product operation with the small blocks of the filter, as shown in the figure below.

For pooling, im2col can still be expanded in this way, that is, the small squares inside the large square can be expanded into a row, and the figure will not be displayed here.

im2col implementation code:

def im2col(input_data, filter_h, filter_w, stride=1, pad=0):

    input_data : from(Data volume, passageway, high, long)4-dimensional array of input data
    filter_h : Filter high
    filter_w : Filter length
    stride : stride
    pad : fill

    col : 2 Dimension group
    N, C, H, W = input_data.shape
    out_h = (H + 2*pad - filter_h)//stride + 1
    out_w = (W + 2*pad - filter_w)//stride + 1

    img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
    col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))

    for y in range(filter_h):
        y_max = y + stride*out_h
        for x in range(filter_w):
            x_max = x + stride*out_w
            col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col

During back propagation, im2col inverse processing is performed, using the col2im function.

def col2im(col, input_shape, filter_h, filter_w, stride=1, pad=0):

    col :
    input_shape : Shape of input data (example:(10, 1, 28, 28))
    filter_h :


    N, C, H, W = input_shape
    out_h = (H + 2*pad - filter_h)//stride + 1
    out_w = (W + 2*pad - filter_w)//stride + 1
    col = col.reshape(N, out_h, out_w, C, filter_h, filter_w).transpose(0, 3, 4, 5, 1, 2)

    img = np.zeros((N, C, H + 2*pad + stride - 1, W + 2*pad + stride - 1))
    for y in range(filter_h):
        y_max = y + stride*out_h
        for x in range(filter_w):
            x_max = x + stride*out_w
            img[:, :, y:y_max:stride, x:x_max:stride] += col[:, :, y, x, :, :]

    return img[:, :, pad:H + pad, pad:W + pad]

Implementation code of convolution layer and pooling layer:

class Convolution:
    def __init__(self, W, b, stride=1, pad=0):
        self.W = W
        self.b = b
        self.stride = stride
        self.pad = pad
        # Intermediate data (used in backward)
        self.x = None   
        self.col = None
        self.col_W = None
        # Gradient of weight and bias parameters
        self.dW = None
        self.db = None

    def forward(self, x):
        FN, C, FH, FW = self.W.shape
        N, C, H, W = x.shape
        out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
        out_w = 1 + int((W + 2*self.pad - FW) / self.stride)

        col = im2col(x, FH, FW, self.stride, self.pad)
        col_W = self.W.reshape(FN, -1).T

        out =, col_W) + self.b
        out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)

        self.x = x
        self.col = col
        self.col_W = col_W

        return out

    def backward(self, dout):
        FN, C, FH, FW = self.W.shape
        dout = dout.transpose(0,2,3,1).reshape(-1, FN)

        self.db = np.sum(dout, axis=0)
        self.dW =, dout)
        self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)

        dcol =, self.col_W.T)
        dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)

        return dx

class Pooling:
    def __init__(self, pool_h, pool_w, stride=1, pad=0):
        self.pool_h = pool_h
        self.pool_w = pool_w
        self.stride = stride
        self.pad = pad
        self.x = None
        self.arg_max = None

    def forward(self, x):
        N, C, H, W = x.shape
        out_h = int(1 + (H - self.pool_h) / self.stride)
        out_w = int(1 + (W - self.pool_w) / self.stride)

        col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
        col = col.reshape(-1, self.pool_h*self.pool_w)

        arg_max = np.argmax(col, axis=1)
        out = np.max(col, axis=1)
        out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)

        self.x = x
        self.arg_max = arg_max

        return out

    def backward(self, dout):
        dout = dout.transpose(0, 2, 3, 1)
        pool_size = self.pool_h * self.pool_w
        dmax = np.zeros((dout.size, pool_size))
        dmax[np.arange(self.arg_max.size), self.arg_max.flatten()] = dout.flatten()
        dmax = dmax.reshape(dout.shape + (pool_size,)) 
        dcol = dmax.reshape(dmax.shape[0] * dmax.shape[1] * dmax.shape[2], -1)
        dx = col2im(dcol, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad)
        return dx

Topics: Python Machine Learning neural networks CNN