Analysis of YOLOV5 network module

Posted by monkeymade on Wed, 05 Jan 2022 11:14:30 +0100


[personal learning notes, please correct any errors]

YOLO-V5 code warehouse address:

1, Conv module

Before introducing each module, you need to introduce the most basic Conv module in YOLOV5.

class Conv(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
    			# ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False) = nn.BatchNorm2d(c2)
        self.act = nn.Hardswish() if act else nn.Identity()

    def forward(self, x):
        return self.act(

    def fuseforward(self, x):
        return self.act(self.conv(x))

The Conv module here is the combination of [convolution] + [BN] + [activation]. The activation function uses [hardwish], and [nn.Identity] is simply understood as a placeholder for returning input.

Among them, [autopad(k,p)] is an automatic padding function,

def autopad(k, p=None):  # kernel, padding
    # Pad to 'same'
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p

Conv operates as follows:

2, Focus module

class Focus(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        super(Focus, self).__init__()
        # 4 times the number of channels
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)
    def forward(self, x):
        return self.conv([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

The operation of the Focus module is shown in the figure

The operation of the Focus module is: take out every other pixel on the three RGB channels, as shown in the figure above, so that each channel can generate four channels, that is, halve the height and width and change the channel to four times.

3, Bottleneck module

Bottleneck module defined

class Bottleneck(nn.Module):
    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5):  # ch_in, ch_out, shortcut, groups, expansion
        super(Bottleneck, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_, c2, 3, 1, g=g)
        self.add = shortcut and c1 == c2

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

Flowchart of Bottleneck module

When the number of input channels [c1] and the number of output channels [c2] are equal, there is a shortcut connection.
The first [Conv]: change the number of input channels into half of the number of output channels through the convolution kernel of 1x1,
The second [Conv]: normal ordinary convolution (it can be changed into group convolution by changing super parameters).
Note: the number of channels here is halved through the parameter [e]. When e == 1, the number of channels remains unchanged!

IV BottleneckCSP module

class BottleneckCSP(nn.Module):
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
    			 # ch_in, ch_out, number, shortcut, groups, expansion
        super(BottleneckCSP, self).__init__()
        c_ = int(c2 * e)  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = nn.Conv2d(c1, c_, 1, 1, bias=False)
        self.cv3 = nn.Conv2d(c_, c_, 1, 1, bias=False)
        self.cv4 = Conv(2 * c_, c2, 1, 1) = nn.BatchNorm2d(2 * c_)  # applied to cat(cv2, cv3)
        self.act = nn.LeakyReLU(0.1, inplace=True)
        self.m = nn.Sequential(*[Bottleneck(c_, c_, shortcut, g, e=1.0) for _ in range(n)])
    def forward(self, x):
        y1 = self.cv3(self.m(self.cv1(x)))
        y2 = self.cv2(x)
        return self.cv4(self.act(, y2), dim=1))))

BottleneckCSP module flow chart defined

The input [x] passes through two different modules, then connects with [shortcut], and finally convolutes with [BN] + [LeakyReLU] and ordinary [Conv].
Module 1 (left): halve the number of channels through 1x1 convolution, and then pass through a [Bottleneck] module. The parameter [e] here controls the number of channels in the hidden layer in the Bottleneck module. Then it passes through a [Conv2d] module (without BN and activation function).
Module 2 (right): halve the number of channels through 1x1 convolution (without BN and activation function).

V SSP module

class SPP(nn.Module):
    # Spatial pyramid pooling layer used in YOLOv3-SPP
    def __init__(self, c1, c2, k=(5, 9, 13)):
        super(SPP, self).__init__()
        c_ = c1 // 2  # hidden channels
        self.cv1 = Conv(c1, c_, 1, 1)
        self.cv2 = Conv(c_ * (len(k) + 1), c2, 1, 1)
        self.m = nn.ModuleList([nn.MaxPool2d(kernel_size=x, stride=1, padding=x // 2) for x in k])

    def forward(self, x):
        x = self.cv1(x)
        return self.cv2([x] + [m(x) for m in self.m], 1))

Defined SPP module flow chart

Halve the number of input channels through a [Conv] module, pool the ma x imum values of three different convolution cores to obtain the output of the same size and number of channels, and then concat enate in dimension 1. After splicing, scale the number of channels to the number of channels of c2 through [Conv].

Sampling under maximum pooling of different convolution kernels:

After the maximum pool down sampling of three different convolution cores, the number of output channels is the same, which is convenient for the subsequent splicing operation.

Vi Detect module

The entire class definition code is not given here.

# Class initialization function
self.m = nn.ModuleList(nn.Conv2d(x, *, 1) for x in ch)

# In the forward function
for i in range(
    x[i] = self.m[i](x[i])  # conv
    bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
    x[i] = x[i].view(bs,,, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

Flow chart of detection module defined

Input [x] is the result obtained on three characteristic graphs. The shape is shown in the figure. Each characteristic graph obtains a characteristic graph with the same number of channels but the same size as the original input through different convolution operations [nn.ModuleList]. Then resize the shape to:
(batch_size, num_anchors, h, w, (number of categories + confidence + width and height of prediction frame + horizontal axis coordinates of center point))


Here, the modules used in YOLOV5 are basically introduced.

Topics: Computer Vision Deep Learning yolo