YOLO code interpretation_ model.py

Posted by aunquarra on Sun, 30 Jan 2022 23:38:35 +0100

1 Overview

This document is mainly composed of

Two classes, Darknet yolayer,
Several functions create_modules，get_yolo_layers，load_darknet_weights，save_weights，convert，attempt_download
form

Darknet's constructor first calls parse_model_cfg，create_modules，get_yolo_layers are three functions, so let's start with them and then further analyze them.

2 import library files

There are many important tools in utils, which you should understand later

from utils.google_utils import *
from utils.layers import *
from utils.parse_config import *

3 parse_model_cfg()

3.1 correction path

This is one in parse_ config. A function of Py, which reads the definition of the model from the cfg file
If there is no suffix, the suffix will be added automatically. If there is no previous path, the path will be added automatically

    # Ensure that the path format of the model is correct and will be completed if omitted
    if not path.endswith('.cfg'):  # add .cfg suffix if omitted
        path += '.cfg'
    if not os.path.exists(path) and os.path.exists('cfg' + os.sep + path):  # add cfg/ prefix if omitted
        path = 'cfg' + os.sep + path

3.2 read by line

Read by line,
Remove blank lines, if x
Remove the comment line not x.startswitch ('#'), so the comment needs a separate line
Remove the spaces at both ends. rstrip is the back and lstrip is the front

    with open(path, 'r') as f:
        lines = f.read().split('\n')
    lines = [x for x in lines if x and not x.startswith('#')]
    lines = [x.rstrip().lstrip() for x in lines]  # get rid of fringe whitespaces

3.3 model definition

Before we see how to handle it, we can first see how the cfg file is written
cfg file, separated by blank lines and comments

[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

Process by line, read a new module, insert a dictionary at the end of the list and remove []

    mdefs = []  # module definitions
    for line in lines:
        if line.startswith('['):  # This marks the start of a new block
            mdefs.append({})
            # Remove []
            mdefs[-1]['type'] = line[1:-1].rstrip()

If it is a convolution block, batch normalize is required, which seems to be very important. Let's take a look later

            if mdefs[-1]['type'] == 'convolutional':
                # The convolution layer must have batch_normalize? Is this the preemption pit? Prevent forgetting when defining? Because if there is, it will be rewritten.
                # pre-populate with zeros (may be overwritten later)
                mdefs[-1]['batch_normalize'] = 0

This is to process the contents of each module. Please note that anchor only appears in yolo layer, as shown below

[yolo]
mask = 3,4,5
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=7
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

After reading, you need two as a group, so use reshape

        else:
            key, val = line.split("=")
            key = key.rstrip()
            if key == 'anchors':  # return nparray
                mdefs[-1][key] = np.array([float(x) for x in val.split(',')]).reshape((-1, 2))  # np anchors

The rest is to save the corresponding layer with the corresponding type

            elif (key in ['from', 'layers', 'mask']) or (key == 'size' and ',' in val):  # return array
                mdefs[-1][key] = [int(x) for x in val.split(',')]
            else:
                val = val.strip()
                # TODO: .isnumeric() actually fails to get the float case
                if val.isnumeric():  # return int or float
                    mdefs[-1][key] = int(val) if (int(val) - float(val)) == 0 else float(val)
                else:
                    mdefs[-1][key] = val  # return string

Check whether there are unsupported modules. If there are, it won't work

    # Check all fields are supported
    supported = ['type', 'batch_normalize', 'filters', 'size', 'stride', 'pad', 'activation', 'layers', 'groups',
                 'from', 'mask', 'anchors', 'classes', 'num', 'jitter', 'ignore_thresh', 'truth_thresh', 'random',
                 'stride_x', 'stride_y', 'weights_type', 'weights_normalization', 'scale_x_y', 'beta_nms', 'nms_kind',
                 'iou_loss', 'iou_normalizer', 'cls_normalizer', 'iou_thresh', 'probability']

    f = []  # fields

Here we start with 1, which is the second in the list, because the first cfg file is net, which is not in supported

    for x in mdefs[1:]:
        [f.append(k) for k in x if k not in f]
    u = [x for x in f if x not in supported]  # unsupported fields
    assert not any(u), "Unsupported fields %s in %s. See https://github.com/ultralytics/yolov3/issues/631" % (u, path)

    return mdefs

4 create_modules()

4.1 basic unit

Using the previous model definition, we use create_modules to build modules_ List, here we need NN ModuleList()，nn.Sequential() has certain understand . First, refer to the above link below. Here's a brief introduction.

nn.Sequential() is equivalent to defining a small module that executes in sequence
nn.ModuleList() is equivalent to defining a list to store various small modules. It is similar to list, but it is convenient for parameter filling

Because we study YOLOv3, we study the convolution layer, route, shortcut and YOL;O make a careful analysis of the four

4.2 basic understanding of Yolo network structure

The general network structure is as follows. For details, please refer to this Link 1,this There is some more specific introduction to the residual layer.

The calculation of related dimensions should be understood later

4.3 convolutional

Convolution is relatively simple. See the code comments for details

 if mdef['type'] == 'convolutional':
     bn = mdef['batch_normalize']
     filters = mdef['filters'] # The number of convolution kernels, how many convolution kernels, the depth of the output is
     k = mdef['size']  # kernel size the size of the convolution kernel
     stride = mdef['stride'] if 'stride' in mdef else (mdef['stride_y'], mdef['stride_x'])# The step size of stripe or xy is different
     if isinstance(k, int):  # single-size conv
         # Adding a single type of convolution kernel
         modules.add_module('Conv2d', nn.Conv2d(in_channels=output_filters[-1],
                                                out_channels=filters,
                                                kernel_size=k,
                                                stride=stride,
                                                padding=k // 2 if mdef['pad'] else 0,
                                                groups=mdef['groups'] if 'groups' in mdef else 1,
                                                bias=not bn))
     else:  # multiple-size conv
         # Add mixed convolution kernel
         modules.add_module('MixConv2d', MixConv2d(in_ch=output_filters[-1],
                                                   out_ch=filters,
                                                   k=k,
                                                   stride=stride,
                                                   bias=not bn))

The following are some regularization and activation functions. Special attention should be paid here. If there is no batchnormalize in the convolution layer, it means that the next layer is yolo layer, so it is recorded. The index of the corresponding layer is recorded here

       # If the convolution layer does not have batchnormalize, it means that the next layer is yolo layer, so it is recorded. Here, the index of the corresponding layer is recorded
       if bn:
           modules.add_module('BatchNorm2d', nn.BatchNorm2d(filters, momentum=0.03, eps=1E-4))
       else:
           routs.append(i)  # detection output (goes into yolo layer)

       # Different types of activation functions
       if mdef['activation'] == 'leaky':  # activation study https://github.com/ultralytics/yolov3/issues/441
           modules.add_module('activation', nn.LeakyReLU(0.1, inplace=True))
       elif mdef['activation'] == 'swish':
           modules.add_module('activation', Swish())
       elif mdef['activation'] == 'mish':
           modules.add_module('activation', Mish())

4.4 Upsample

Up sampling is said to enlarge the picture. The specific algorithm is still unknown. At the beginning, it is similar to ONNX_EXPORT is set to false, so as long as you look at the content of else, it is usually doubled.

        elif mdef['type'] == 'upsample':
            if ONNX_EXPORT:  # explicitly state size, avoid scale_factor
                g = (yolo_index + 1) * 2 / 32  # gain
                modules = nn.Upsample(size=tuple(int(x * g) for x in img_size))  # img_size = (320, 192)
            else:
                modules = nn.Upsample(scale_factor=mdef['stride'])

4.5 route

There are two types of this module, as follows

[route]
layers = -4

[route]
layers = -1, 61

When the layers attribute of this layer has only one layer, it will output the characteristic graph of the value index, for example, - 1, indicating the previous layer
When there are two parameters, the depth connection results of the corresponding feature maps of the two layers are output [then the size of these feature maps should be the same]

The specific explanations of the code are in the comments

 elif mdef['type'] == 'route':  # nn.Sequential() placeholder for 'route' layer
     layers = mdef['layers']
     # The output of each layer is append to output_ In filters, so - 1 takes the last one directly,
     # If counting from the beginning, it is + 1 because it starts with 3 channels [this represents input]
     filters = sum([output_filters[l + 1 if l > 0 else l] for l in layers])
     # Which floor is the record? I don't know how to use it later
     routs.extend([i + l if l < 0 else l for l in layers])
     # Here is the feature map mosaic
     modules = FeatureConcat(layers=layers)

4.6 shortcut layer

The function of this layer is similar to that of the residual network. It is the addition of the feature map, [whether the size of the feature map is guaranteed to be the same]

[shortcut]
from=-3
activation=linear

For example, the function of the above module is to add the characteristic diagrams of the previous layer and the first three layers
This activation does not seem to be used in the code
See code comments for specific explanations

 elif mdef['type'] == 'shortcut':  # nn.Sequential() placeholder for 'shortcut' layer
     layers = mdef['from']
     # Because you want to add, you can directly use the output size of the previous layer
     filters = output_filters[-1]
     # Record the two added layers
     routs.extend([i + l if l < 0 else l for l in layers])
     # Weight blending
     modules = WeightedFeatureFusion(layers=layers, weight='weights_type' in mdef)

4.7 YOLO layer

mask selects anchor box es on behalf of the YOLO layer,
anchors, which represents the anchor boxes selected in this experiment
classes stands for several categories
Other parameters are not well understood

[yolo]
mask = 3,4,5
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

Construct the YOLO layer and initialize the parameters of the previous convolution layer

  elif mdef['type'] == 'yolo':
  	# yolo layers plus one
      yolo_index += 1
      stride = [32, 16, 8]  # P5, P4, P3 strides
      # These need to be in reverse order?
      if any(x in cfg for x in ['panet', 'yolov4', 'cd53']):  # stride order reversed
          stride = list(reversed(stride))
      layers = mdef['from'] if 'from' in mdef else []
      # Construct yolo module
      modules = YOLOLayer(anchors=mdef['anchors'][mdef['mask']],  # anchor list
                          nc=mdef['classes'],  # number of classes
                          img_size=img_size,  # (416, 416)
                          yolo_index=yolo_index,  # 0, 1, 2...
                          layers=layers,  # output layers
                          stride=stride[yolo_index])
      # Automatically initialize the bias of the previous volume layer?
      # Initialize preceding Conv2d() bias (https://arxiv.org/pdf/1708.02002.pdf section 3.3)
      try:
          j = layers[yolo_index] if 'from' in mdef else -1
          # If previous layer is a dropout layer, get the one before
          if module_list[j].__class__.__name__ == 'Dropout':
              j -= 1
          bias_ = module_list[j][0].bias  # shape(255,)
          bias = bias_[:modules.no * modules.na].view(modules.na, -1)  # shape(3,85)
          bias[:, 4] += -4.5  # obj
          bias[:, 5:] += math.log(0.6 / (modules.nc - 0.99))  # cls (sigmoid(p) = 1/nc)
          module_list[j][0].bias = torch.nn.Parameter(bias_, requires_grad=bias_.requires_grad)
      except:
          print('WARNING: smart bias initialization failure.')

4.8 record return

Add module to list

        # Register module list and number of output filters
        # Add the module to the list and record the output dimension at the same time
        module_list.append(modules)
        output_filters.append(filters)

Record which layers have rout s

    routs_binary = [False] * (i + 1)
    for i in routs:
        routs_binary[i] = True
    return module_list, routs_binary

5. YOLOLayer()

In create_module calls the class yolayer. Next, let's study this class
Let's see how to call this class first
Firstly, the anchors, number of types, picture size, yolo and layer of the first layer seem to be empty, and the stripe in v3 is 32, 16 and 8 in turn

modules = YOLOLayer(anchors=mdef['anchors'][mdef['mask']],  # anchor list
                    nc=mdef['classes'],  # number of classes
                    img_size=img_size,  # (416, 416)
                    yolo_index=yolo_index,  # 0, 1, 2...
                    layers=layers,  # output layers
                    stride=stride[yolo_index])

5.1 constructor

See the notes for details

  def __init__(self, anchors, nc, img_size, yolo_index, layers, stride):
      super(YOLOLayer, self).__init__()
      self.anchors = torch.Tensor(anchors) # Convert anchor to tensor
      self.index = yolo_index  # index of this layer in layers
      self.layers = layers  # model output layer indices
      self.stride = stride  # Layer stripe, how to use this stripe is very important
      self.nl = len(layers)  # number of output layers (3)
      self.na = len(anchors)  # Number of anchors (3) number of anchors
      self.nc = nc  # number of classes (80)
      # Output dimension = category + 5, where 5 is length, width, x,y and confidence [if you want to modify it, it should be modified here]
      self.no = nc + 5  # number of outputs (85) 
      # Initialize some variables ng, how many grids are there?
      self.nx, self.ny, self.ng = 0, 0, 0  # initialize number of x, y gridpoints
      # Scale the size of the anchor. The stripe should refer to the size of each grid
      self.anchor_vec = self.anchors / self.stride
      self.anchor_wh = self.anchor_vec.view(1, self.na, 1, 1, 2)

      # This doesn't seem to work
      if ONNX_EXPORT:
          self.training = False
          self.create_grids((img_size[1] // stride, img_size[0] // stride))  # number x, y grid points

5.2 feedforward function

The inputs of the feedforward function are p and out. P is the output of the previous network. It is not clear what out is. It is mainly used in ASFF. Let's ignore it first
There are many if judgments. At present, many of them are false, so we directly focus on the key parts,

Part of training
Here, the dimension of tensor is reconstructed without too much processing. We need to see how the output and loss function of the previous layer are processed

  # p.view(bs, 255, 13, 13) -- > (bs, 3, 13, 13, 85)  # (bs, anchors, grid, grid, classes + xywh)
  p = p.view(bs, self.na, self.no, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous()  # prediction

  if self.training:
      return p

It's not part of the training. We can see that the training is the direct output of the results, while the test processes the results according to the original text, which involves how the real value of our comparison is.

 else:  # inference
     io = p.clone()  # inference output
     io[..., :2] = torch.sigmoid(io[..., :2]) + self.grid  # xy
     io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh  # wh yolo method
     io[..., :4] *= self.stride
     torch.sigmoid_(io[..., 4:])
     return io.view(bs, -1, self.no), p  # view [1, 3, 13, 13, 85] as [1, 507, 85]

6 Darknet class

This class is the main content of this file. Other methods and properties are prepared for him

6.1 constructor

Call the previous method to complete the basic construction. The later version, see and info are not clear

    def __init__(self, cfg, img_size=(416, 416), verbose=False):
        super(Darknet, self).__init__()

        self.module_defs = parse_model_cfg(cfg)# Load model definition
        self.module_list, self.routs = create_modules(self.module_defs, img_size, cfg)#Build model list
        self.yolo_layers = get_yolo_layers(self) # What layer is yolo on
        # torch_utils.initialize_weights(self)

        # Darknet Header https://github.com/AlexeyAB/darknet/issues/2914#issuecomment-496675346
        self.version = np.array([0, 2, 5], dtype=np.int32)  # (int32) version info: major, minor, revision
        self.seen = np.array([0], dtype=np.int64)  # (int64) number of images seen during training
        self.info(verbose) if not ONNX_EXPORT else None  # print model description

6.2 forward function

There are two functions, one is forward and the other is forward_once
In the forward function, it is mainly used to judge whether the image enhancement technology is used. If enhanced, the image will be scaled and rotated

    def forward(self, x, augment=False, verbose=False):
        # Determine whether image enhancement technology is needed
        if not augment:
            return self.forward_once(x)
        else:  # Augment images (inference and test only) https://github.com/ultralytics/yolov3/issues/931
            # Zoom and rotate the image, and add three transformations to the original image
            img_size = x.shape[-2:]  # height, width
            s = [0.83, 0.67]  # scales
            y = []
            for i, xi in enumerate((x,
                                    torch_utils.scale_img(x.flip(3), s[0], same_shape=False),  # flip-lr and scale
                                    torch_utils.scale_img(x, s[1], same_shape=False),  # scale
                                    )):
                # cv2.imwrite('img%g.jpg' % i, 255 * xi[0].numpy().transpose((1, 2, 0))[:, :, ::-1])
                y.append(self.forward_once(xi)[0])

            # The output should be processed accordingly to ensure the same
            y[1][..., :4] /= s[0]  # scale
            y[1][..., 0] = img_size[1] - y[1][..., 0]  # flip lr
            y[2][..., :4] /= s[1]  # scale

            y = torch.cat(y, 1)
            return y, None

Call forward for each picture, whether zoomed or not_ once()

    def forward_once(self, x, augment=False, verbose=False):
        img_size = x.shape[-2:]  # height, width
        yolo_out, out = [], []
        if verbose:# If it is true, a lot of things will be output. Verbose means verbose
            print('0', x.shape)
            str = ''

In the test set, some data enhancements will also be made

        # Augment images (inference and test only)
        # This enhancement seems strange, only for testing?
        if augment:  # https://github.com/ultralytics/yolov3/issues/931
            nb = x.shape[0]  # batch size
            s = [0.83, 0.67]  # scales
            x = torch.cat((x,
                           torch_utils.scale_img(x.flip(3), s[0]),  # flip-lr and scale
                           torch_utils.scale_img(x, s[1]),  # scale
                           ), 0)

The main processing part uses verbose to control the output

        for i, module in enumerate(self.module_list):
            name = module.__class__.__name__
            if name in ['WeightedFeatureFusion', 'FeatureConcat']:  # sum, concat
                if verbose:# If it is true, a lot of things will be output. Verbose means verbose
                    l = [i - 1] + module.layers  # layers
                    sh = [list(x.shape)] + [list(out[i].shape) for i in module.layers]  # shapes
                    str = ' >> ' + ' + '.join(['layer %g %s' % x for x in zip(l, sh)])
                x = module(x, out)  # WeightedFeatureFusion(), FeatureConcat()
            elif name == 'YOLOLayer':
                # When you get to the yolo layer, it will be output. Why do you need to pass an out
                yolo_out.append(module(x, out))
            else:  # run module directly, i.e. mtype = 'convolutional', 'upsample', 'maxpool', 'batchnorm2d' etc.
                x = module(x)
            # This out is still confused
            out.append(x if self.routs[i] else [])
            if verbose:# If it is true, a lot of things will be output. Verbose means verbose
                print('%g/%g %s -' % (i, len(self.module_list), name), list(x.shape), str)
                str = ''

If it is a train, return directly

        if self.training:  # train
            return yolo_out
        # This export has not been understood yet
        elif ONNX_EXPORT:  # export
            x = [torch.cat(x, 0) for x in zip(*yolo_out)]
            return x[0], torch.cat(x[1:3], 1)  # scores, boxes: 3780x80, 3780x4

If it is test, because the output is

return io.view(bs, -1, self.no), p  # view [1, 3, 13, 13, 85] as [1, 507, 85]

Therefore, it needs to be processed. First split the output into two parts, splice x, judge whether to process data enhancement, and then output

        else:  # inference or test
        	# Split the output into two parts
            x, p = zip(*yolo_out)  # inference output, training output
            # This is to splice the tensor s in x, and 1 represents splicing by column
            x = torch.cat(x, 1)  # cat yolo outputs
            if augment:  # de-augment results
                x = torch.split(x, nb, dim=0)
                x[1][..., :4] /= s[0]  # scale
                x[1][..., 0] = img_size[1] - x[1][..., 0]  # flip lr
                x[2][..., :4] /= s[1]  # scale
                x = torch.cat(x, 1)
            return x, p

6.3 fuse and info

Fuse is mainly used to fuse Conv2d and BatchNorm2d in the model
info is to output relevant information of the model

7 other functions

load_darknet_weights loads the weight of the model
save_weights save weights
convert weight file format conversion pt and weight
attempt_download attempted to download a model weight that does not exist

Topics: Python Computer Vision Object Detection yolo

Programmer Think