A simple convolutional neural network is realized from zero

Posted by jimpat on Mon, 07 Feb 2022 14:16:53 +0100

For convolution formula

Maybe some people know, maybe some people don't know, or just know but don't understand. But whether you know the meaning of this formula or not, it doesn't affect you to realize a convolution. I won't talk about his specific mathematical meaning first, because many people speak more clearly and thoroughly than I do. What I want to tell you is how to realize the convolution operation in the deconvolution neural network

When it comes to convolution neural network, the most heard operations should be convolution, activation and pooling. Take VGG16, a classic network model, which is actually stacked through three operations: convolution + activation + pooling.

So what are they and how are they implemented? Let's talk about their specific implementation principle with numpy this time.


Convolution in neural network is to use a convolution kernel to slide on your image matrix. In the above figure, it is assumed that the gray matrix is the image matrix we input, the numbers in it represent the pixel values in the image, and the green matrix is the convolution kernel, The green area above the gray matrix is the area covered by the current convolution kernel. The sliding distance of each convolution kernel is called step size. In the figure above, the step size is 1. The specific operation is to multiply and add the convolution kernel and the corresponding position of the area covered in the image matrix. The final blue matrix is the result of this convolution.


In the above convolution operation, the size of the original matrix is 4x4, but after convolution, it becomes a 2x2 matrix. This is because the pixels on the edge will never be in the center of the convolution core, and the convolution core cannot be extended beyond the edge region.
This result is unacceptable. Sometimes we hope that the size of input and output should be consistent. To solve this problem, you can pad the original matrix before convolution, that is, fill some values on the boundary of the matrix to increase the size of the matrix, which is usually filled with "0".

For convolution with a convolution kernel size of 3, it is usually sufficient to fill a circle of 0 on the periphery. For convolution kernels of different sizes, in order to make the size of the output image consistent, the number of circles of 0 filled on the periphery of the image is different. The filling formula is as follows:

P stands for the number of circles to be filled, K stands for the side length of the convolution kernel, which is usually odd

Multichannel convolution

As we all know, a color picture has three RGB channels, that is, it has three channels. How to realize convolution in this multi-channel situation. Very simply, how many channels are there in the input image, then how many channels are there in our convolution kernel

Like the convolution operation of a single channel, slide the convolution kernel on the picture according to the corresponding channel, multiply and add the corresponding positions, and finally add up the convolution results obtained by the three channels

code implementation

#Input: input data, input_channel: number of channels of input data, out_channel: the number of channels of the output characteristic graph, kernel_size: size of convolution kernel, stripe: step size
def convolution(input,input_channel,out_channel,kernel_size,stride):
    kernel = np.random.randn(out_channel,input_channel,kernel_size,kernel_size)     #Create convolution kernel
    padding = int((kernel_size - 1) / 2)        #Calculate the size of the fill
    padding_input = []
    # Fill in the input matrix
    for i in range(input_channel):
        padding_input.append(np.pad(input[i], ((padding, padding), (padding, padding)), 'constant', constant_values=(0, 0)))
    padding_input = np.array(padding_input)
    #The size of the output matrix is calculated according to the filled input size, convolution kernel size and step size
    out_size = int((len(input[0])+2*padding-kernel_size)/stride+1)
    # Create a 0-filled output matrix
    out = np.zeros((out_channel,out_size,out_size))

    for i in range(out_channel):
        out_x = 0
        out_y = 0
        x_end = padding_input.shape[1] - padding - 1  # Convolution boundary

        x = padding
        y = padding
        while x<=x_end:
            if y>padding_input.shape[1]-padding-1:      #When the convolution kernel exceeds the right boundary, it moves down one step
                y = padding
                x = x+stride
                out_y = 0
                out_x = out_x + 1
                if x>x_end:
            #Convolution operation
            out[i][out_x][out_y] = np.sum(padding_input[:,x-padding:x+padding+1,y-padding:y+padding+1]*kernel[i])

            y = y+stride
            out_y += 1

    return out


It is to operate the activation function for each value in the matrix. Take relu as an example. Relu is the immobility greater than 0 and the immobility less than 0 makes it 0:

code implementation

def ReLu(input):
    out = np.maximum(0,input)
    return out


Pooling operation is actually very similar to convolution operation, which requires a core to slide on the picture and perform some operations on the area covered by the core, but the difference is that there are numbers in the convolution core, and convolution operation is to calculate the area covered with the core. The pooled core is empty. The most common pooling operations are maximum pooling, average pooling, etc.

Maximum pooling

Is to select the maximum value in the area covered by the nucleus

Average pooling

Is to select the average value in the area covered by the nucleus

code implementation

Take maximizing pooling as an example

#Input: input data, pooling_size: convolution kernel size, stripe: step size
def pooling(input,pooling_size,stride):
    out_size = int((len(input[0])-pooling_size)/stride+1)   #Calculate the size of the pooled output matrix
    out = np.zeros((len(input[0]),out_size,out_size))       #Initialize output matrix
    # Start pooling for each channel
    for i in range(input.shape[0]):
        out_x = 0
        out_y = 0
        in_x = 0
        in_y = 0
        #Start sliding
        while True:
            if out_y>=out_size:
                in_y = 0
                out_y = 0
                if out_x==out_size:
            #Pool operation
            out[i][out_x][out_y] = np.max(input[i,in_x:in_x+pooling_size,in_y:in_y+pooling_size])
    return out


Here are three channels of the output of a convolution operation

Topics: Computer Vision Deep Learning CNN