Reproduce the YOLOv3 data class from scratch

Posted by hyabusa on Sun, 31 Oct 2021 09:45:11 +0100

1 data organization

(1) coco128 dataset

Here we use the coco dataset, but the coco dataset is too large. Here we use the coco128 dataset, because there are only 128 pictures, and the speed of copying and decompression can be fast. The download link of the coco128 dataset is: https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip
Create a folder named data in the current directory and put the entire coco128 folder into data. At this time, the project structure is shown in the following figure:

Open data / coco128 / labels / train2017 / 00000000000 9.txt and you can see

The first number in each line is the category number of the target, the second is the abscissa of the center point of the frame, the third is the ordinate, the fourth is the width of the frame, and the fifth is the height of the frame. The coordinates of the center point and the width and height are the normalized results.

In data/coco128/images/train2017, you can see a file named ". DS_Store". Delete this file to prevent it from being used as a sample

(2) Divide the verification set from the training set

A new Python script named "partition dataset. py" is created under data/coco128. The program design idea of the script is: first find out how many pictures there are (use os.listdir to make the picture file names in the picture directory into a list and find the length of the list), and then generate 10 random numbers as indexes to obtain the names of 10 pictures from the list with enough picture names, Write the path of these 10 pictures into a file named "val_path.txt", and then write the other pictures into a file named "train_path.txt". The code is as follows:

import os
import random
random.seed(10)

# Get picture name from picture path
images_path = r"F:\thesis\yolo3_from_scratch\data\coco128\images\train2017"
images_names = os.listdir(images_path)      # Returns the first level subdirectory as a list
images_num = len(images_names)

# Randomly obtain 10 numbers as the index of the picture name list, and then obtain the picture name
num_val = 10                # Number of validation sets
idx_img = random.sample(range(0, images_num), num_val)   # Generate non duplicate num_ A random number

# Generate the name of the validation set sample to form a list
val_names = [images_names[i] for i in idx_img]
# print(val_names)

# Generate the names of training set samples to form a list
train_names = [images_names[i] for i in range(images_num) if i not in idx_img]
# print(train_names)

# Write the path of the training set sample to the txt file
with open(r"F:\thesis\yolo3_from_scratch\data\coco128\train_path.txt", 'w') as f:
    for file_name in train_names:
        f.write(os.path.join(images_path, file_name)+'\n')

# Write the path of the validation set sample to the txt file
with open(r"F:\thesis\yolo3_from_scratch\data\coco128\val_path.txt", 'w') as f:
    for file_name in val_names:
        f.write(os.path.join(images_path, file_name)+'\n')

After the program is executed, the project structure is:

Where, train_ The path.txt file is as follows:

val_ The path.txt file is as follows:

At this point, the data organization is completed.

2 create classes for datasets

In the utils directory, create a new file named datasets.py. At this time, the project structure becomes

In datasets.py, write the required modules first:

import os
from PIL import Image
import torch
from torch.utils.data import Dataset
import torchvision.transforms as transforms
import numpy as np
import random
random.seed(0)

Of course, the above modules are not enough, but the rest can be added gradually in the process of creating classes.

In datasets.py, create a Dataset class that inherits the Dataset class in torch.utils.data. The self-made Dataset class must implement three functions: init and Len__ And__ getitem, which is the initialization class, calculates the length len(obj), and obtains a single sample and its label through the index.

There is nothing to say about the initialization function. Its code is as follows:

class ListDataset(Dataset):
    def __init__(self, list_path, img_size=416, augment=True, multiscale=True, normalized_labels=True):
        '''

        :param list_path: One txt Documents, such as what we wrote earlier train_path.txt and val_path.txt
        :param img_size: Data picture to be converted to high
        :param augment: Use data enhancement
        :param multiscale: Whether to perform multi-scale transformation (see self.collate_fn You can understand its function)
        :param normalized_labels: Has the label been normalized, i.e boundingbox Have the center coordinates, height, width, etc. of been normalized
        '''
        with open(list_path, "r") as file:
            self.img_files = file.readlines()   # Read the contents of txt file and read out the sample path

        # The path of the label can be obtained according to the path of the sample. Just change the images in the path name to labels and the suffix to txt
        self.label_files = [
            path.replace("images", "labels").replace(".png", ".txt").replace(".jpg", ".txt")
            for path in self.img_files
        ]
        self.img_size = img_size    # The height and width of the image after being processed into a square (the image should be processed into a square before being input into the model)
        self.max_objects = 100      # Maximum number of targets in a picture
        self.augment = augment
        self.multiscale = multiscale
        self.normalized_labels = normalized_labels
        self.min_size = self.img_size - 3 * 32  # Minimum scale in multi-scale transformation
        self.max_size = self.img_size + 3 * 32  # Maximum scale in multi-scale transformation
        self.batch_count = 0                    # Count how many batch es have been traversed
        # TODO max_ What are objects for

The length function is also relatively simple, and the code is as follows:

    def __len__(self):
        return len(self.img_files)

The more difficult thing is the getitem function. To return the picture and label of the specified index, the picture and label must be made into tensors before the function returns. It also involves the problem of whether the label is normalized, so it is cumbersome.

Let's deal with the picture first. The code is as follows:

    def __getitem__(self, index):
        img_path = self.img_files[index % len(self.img_files)].rstrip()     # Gets the pathname of the picture

        # Extract image as PyTorch tensor
        img = transforms.ToTensor()(Image.open(img_path).convert('RGB'))    # Read the picture and convert it to torch tensor
        # mage.open(img_path) reads the Image and returns the Image object, which is not an ordinary array
        # convert('RGB ') performs channel conversion because when the image format is RGBA, the format read by Image.open('xxx.jpg') is RGBA

        # Handle images with less than three channels
        if len(img.shape) != 3:
            img = img.unsqueeze(0)
            img = img.expand((3, img.shape[1:]))
        # The image may be a grayscale image, so img.shape is (h, w)
        # After unsqueeze(0), img.shape is (1, h, w)
        # img.expand((3, img.shape[1:]) is img.expand((3, h, w))

        _, h, w = img.shape
        h_factor, w_factor = (h, w) if self.normalized_labels else (1, 1)
        # h_factor, w_factor is used to inverse calculate the specific coordinates of the target in the picture. You can understand it naturally when you see the back of this function
        # If normalized, the scale factor is the true height and width of the image
        # If not normalized, the scale factor is 1

        # Pad to square resolution
        img, pad = pad_to_square(img, 0)        
        _, padded_h, padded_w = img.shape
        

Here comes the pad_to_square function, which passes through letter_box algorithm to turn the picture into a square. We can add pad after the ListDataset class_ to_ Square code:

import torch.nn.functional as F
def pad_to_square(img, pad_value):
    """
    This function expands the picture into a square
    :param img: Picture tensor
    :param pad_value:   The value used to fill in, that is, the left and right or up and down bars
    :return:
    """
    c, h, w = img.shape
    dim_diff = np.abs(h - w)
    # (upper / left) padding and (lower / right) padding
    pad1, pad2 = dim_diff // 2, dim_diff - dim_diff // 2
    
    # Determine padding 
    pad = (0, 0, pad1, pad2) if h <= w else (pad1, pad2, 0, 0)
    # (0, 0, pad1, pad2) and (pad1, pad2, 0, 0), the four values in brackets represent left, right, up and down respectively
    # If h is less than w, it is filled up and down, otherwise it is filled left and right
    # Because the F.pad function is used later, the second parameter is pad, which is a tuple containing four numbers
    
    # Add padding
    img = F.pad(img, pad, "constant", value=pad_value)

    return img, pad

Let's go back to the getitem function of the ListDataset class. Since we write the picture_ After the box is processed, the corresponding label should also be transformed. The code is as follows:

        label_path = self.label_files[index % len(self.img_files)].rstrip() # Get label path

        targets = None
        if os.path.exists(label_path):
            f = open(label_path, 'r')
            if f.readlines() != []:
                # Some pictures have no target, but have label files. There is no content in these label files
                # We only deal with tag files with content. For tag files without content, let targets equal None

                boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))

                # Extract coordinates for unpadded + unscaled image
                # Get bbox the real coordinates of the upper left corner and the lower right corner on the original picture
                x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)
                y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)
                x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)
                y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)

                # Adjust for added padding
                # Since it has been adjusted to be square, the size of pad needs to be added
                x1 += pad[0]
                y1 += pad[2]
                x2 += pad[1]
                y2 += pad[3]

                # Returns (x, y, w, h)
                # Calculate the normalized center point coordinates and height and width
                boxes[:, 1] = ((x1 + x2) / 2) / padded_w
                boxes[:, 2] = ((y1 + y2) / 2) / padded_h
                boxes[:, 3] *= w_factor / padded_w
                boxes[:, 4] *= h_factor / padded_h

                targets = torch.zeros((len(boxes), 6))
                targets[:, 1:] = boxes          # The last five columns are the position, height and width of bbox, and then the classification index
                # target column 0, according to the following collapse_ FN function, you can see that column 0 is the index of the picture in batch

                # Apply augmentations
                # Random horizontal flip
                if self.augment:
                    if np.random.random() < 0.5:
                        img, targets = horisontal_flip(img, targets)

            f.close()

        return img_path, img, targets

Here comes horisontal_flip function, which is a self-defined horizontal flip operation as a data enhancement method. In the utils directory, create a new script named "augmentations.py". The code is as follows:

import torch
def horisontal_flip(images, targets):
    """
    Flip horizontally
    :param images:Picture tensor
    :param targets:label
    :return:
    """
    images = torch.flip(images, [-1])   # Flip according to the specified dimension, - 1 represents the last dimension, that is, width
    targets[:, 2] = 1 - targets[:, 2]   # The abscissa of the center point should also be flipped randomly, because it has been normalized, so just subtract 1
    return images, targets

Remember to add the following sentence to datasets.py:

from utils.augmentations import horisontal_flip

At this time, the project structure is:

Here, we can write a test script datasets under "/ yolo3_from_scratch"_ test.py:

# coding=utf-8
import torch
from utils.datasets import ListDataset

train_path = r"F:\thesis\yolo3_from_scratch\data\coco128\train_path.txt"
dataset = ListDataset(train_path, augment=True, multiscale=True)

dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=2,
    shuffle=True
) 

Output:

The reason why this happens is that the two pictures are different in size. They are passing through letter_box cannot be stacked. Here we need to add a collapse in the ListDataset class_ FN function to scale the two figures to the same size. About collate_fn function, you can see this article:
https://blog.csdn.net/qq_43391414/article/details/120462055

    def collate_fn(self, batch):
        """
        Used to organize data
        :param batch: several times__getitem__A list of the contents returned by the function
                    If say batch_size Is 2, then batch It's a list of two elements,
                    Each element represents once__getitem__Return result of function
                    __getitem__The return value of a function consists of three parts: img_path, img, targets
                    that batch Each element of is a tuple containing img_path, img, targets
        :return:
        """
        paths, imgs, targets = list(zip(*batch))    # zip starts with the parameter * in parentheses, indicating decompression
        # After the last command is executed, paths, IMGs and targets will become tuples,
        # Take paths as an example. After the above command is executed, paths will become a tuple composed of two picture paths

        # Add sample index to targets adds the index of the picture in the batch to the 0th column of the target
        for i, boxes in enumerate(targets):
            if boxes is not None:
                boxes[:, 0] = i     # i represents the ith picture in the current batch

        # Remove empty placeholder targets if some pictures have no targets, the corresponding label is None
        targets = [boxes for boxes in targets if boxes is not None]  # Keep labels other than None

        targets = torch.cat(targets, 0)     # Tags are concatenated. targets is a tuple before conversion

        # Selects new image size every tenth batch
        if self.multiscale and self.batch_count % 10 == 0:  # Every 10 batch es, change the scale randomly
            self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
            # The third parameter of the range function is 32, which can ensure that the new size obtained randomly is a multiple of 32,
            # Because the backbone is 32 times down sampled, if it is not a multiple of 32, the convolution kernel cannot completely scan the image

        # Zooms the picture to the specified size
        imgs = torch.stack([resize(img, self.img_size) for img in imgs])    #
        self.batch_count += 1

        # After the image is scaled, the label does not need to be changed because the label has been normalized, so there is no need to convert

        return paths, imgs, targets

In the above program section, list(zip(*batch)) and resize appear. Let's talk about list(zip(*batch)) first. You can understand its functions through the following program section:

# This code has nothing to do with YOLOv3. It just explains what list(zip(*batch)) implements
a = ('a', 25, 1)
b = ('b', 43, 0)
L= [a, b]
m = zip(*L)     # zip object
print(m)
print(list(m))

Output:

<zip object at 0x00000000026828C0>
[('a', 'b'), (25, 43), (1, 0)]

collate_ The resize function appears in FN. You can add this function to datasets.py:

def resize(image, size):
    """
    Scales the picture to the specified size
    :param image: Picture tensor
    :param size:  The specified size, height and width are all this value, that is, the function scales the square
    :return:
    """
    image = F.interpolate(image.unsqueeze(0), size=size, mode="nearest").squeeze(0)
    return image

Slightly modify the test script and the modified datasets_ The test.py code is as follows:

# coding=utf-8
import torch
from utils.datasets import ListDataset

train_path = r"F:\thesis\PyTorch-YOLOv3\data\coco\trainvalno5k.txt"
dataset = ListDataset(train_path, augment=True, multiscale=True)

dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=dataset.collate_fn,  #
)

for batch_i, (_, imgs, targets) in enumerate(dataloader):
    print(batch_i)
    print("imgs.shape:", imgs.shape)
    print("targets.shape:", targets.shape)
    break

Output:

0
imgs.shape: torch.Size([2, 3, 384, 384])
targets.shape: torch.Size([6, 6])

In addition, in order to test the dataset class's handling of label files without content, you can also write another test script to test whether the label files without targets can be handled correctly when importing data in batches.

# Test whether there is a row with an image index of 1 in the targets of the 18th batch when the batch is 5
# 1 indicates the index of the second picture, i.e. 000000000508.jpg in batch, but there is no target in 000000000508.jpg
# Because 000000000508.jpg is in the train_ Line 87 of path.txt, because the index of this picture is 86, is in the 18th batch
# You can see the image index of the last few lines of the returned targets

import torch
from utils.datasets import ListDataset

train_path = r"F:\thesis\yolo3_from_scratch\data\coco128\train_path.txt"
dataset = ListDataset(train_path, augment=True, multiscale=True)
dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=5,
    shuffle=False,              # You can't mess up here
    collate_fn=dataset.collate_fn,  #
)

for batch_i, (imgs_path, imgs, targets) in enumerate(dataloader):
    if batch_i<17:
        continue

    print(targets[:3, :])   # Only the first three lines are displayed, because the picture with index 85 has only one target
    # If line 2 starts with 2, it indicates that the program has been modified correctly. If it starts with 1, it indicates that there is a problem with the program
    break

# From the output, you can see that the second line of targets starts with 2, indicating that the program handles 00....0508.txt properly.

Output:

Open the file train created when dividing data_ Path.txt, you can see that 502.jpg, 508.jpg and 510.jpg are adjacent

So far, the dataset class has been preliminarily established. We will continue to add methods and classes as needed.

In the next chapter, we will explain the training YOLOv3 model.

Topics: AI Object Detection yolo