MMTracking | how to catch human cubs with single target tracking (SOT)

Posted by Bogart on Tue, 18 Jan 2022 03:46:00 +0100

This article provides guidelines for single target tracking (SOT) tasks in MMTracking. The follow-up Food Guide for single target tracking is also on the way

Content of this article

SOT mission profile

Introduction to SOT dataset

Supported SOT algorithms and data sets

Getting Started Guide

Implementation and analysis of SiameseRPN + +

1. Introduction to sot tasks

SOT focuses more on human-computer interaction. The algorithm needs to be able to continuously track a target of any type and shape.

For example, how can we catch a human cub who "rides on his beloved motorcycle" to skip class——

2. Introduction to sot dataset

At present, the mainstream data sets in the SOT field are OTB100, VOT2018, VOT2020, UAV123, TrackingNet, LaSOT and GOT-10K.

Except that VOT evaluation standard is adopted for VOT series data sets, OPE evaluation standard is generally adopted for other data sets. The main indicators of VOT evaluation standard are EAO, Accuracy and Robustness, and the main indicators of OPE evaluation standard are Success, Norm Precision and Precision.

3. Supported SOT algorithms and data sets

MMTracking currently supports the following SOT algorithms:

- SiameseRPN++ (CVPR 2019)

Link: https://arxiv.org/abs/1812.11703

MMTracking currently supports OTB100, UAV123, TrackingNet and LaSOT datasets.

4. Getting Started Guide

Next, this paper introduces in detail how to run SOT demo, test SOT model and train SOT model in MMTracking.

To use MMTracking, you only need to clone the warehouse on github locally, and then configure the environment according to the installation manual. If you encounter any problems during installation, you can provide issue to MMTracking, and we will answer them as soon as possible.

Installation manual:

https://github.com/open-mmlab/mmtracking/blob/master/docs/install.md

It is assumed that the pre training weight has been placed in the checkpoints / folder under the root directory of MMTracking (the pre training weight can be downloaded on the corresponding configurations page).

Run SOT demo

In the MMTracking root directory, just execute the following command to run SOT demo using SiameseRPN + + algorithm.

python ./demo/demo_sot.py \
    ./configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py \
    --input ${VIDEO_FILE} \
    --checkpoint checkpoints/siamese_rpn_r50_1x_lasot_20201218_051019-3c522eff.pth \
    --output ${OUTPUT} \
    --show

Test SOT model

Use the following command in the MMTracking root directory to test the SOT model on the LaSOT dataset and use the OPE evaluation standard to evaluate the model.

./tools/dist_test.sh \
    configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py 8 \
    --checkpoint checkpoints/siamese_rpn_r50_1x_lasot_20201218_051019-3c522eff.pth \
    --out results.pkl \
    --eval track

Training SOT model

Use the following command in the MMTracking root directory to train the SOT model, and use the OPE evaluation standard to evaluate the model from the 10th epoch to the 20th epoch.

bash ./tools/dist_train.sh \
    ./configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py 8 \
    --work-dir ./work_dirs/

In fact, many SOT models have been supported in MMTracking, and a public checkpoint is provided for everyone to use. It is also introduced in more detail in the quick start tutorial.

Quick start tutorial:

https://mmtracking.readthedocs.io/en/latest/quick_run.html

5. SiameseRPN + + implementation parsing

After the above steps, this paper has introduced how to run the SOT algorithm. Next, we will introduce the implementation of SiameseRPN + + under MMTracking.

Configuration file for SiameseRPN + +

model = dict(type='SiamRPN',backbone=dict(type='SOTResNet'),neck=dict(type='ChannelMapper'),head=dict(type='SiameseRPNHead'),train_cfg=dict(rpn=dict(assigner=dict(type='MaxIoUAssigner'),sampler=dict(type='RandomSampler'),num_neg=16,exemplar_size=127,search_size=255)),test_cfg=dict(exemplar_size=127,search_size=255,context_amount=0.5,center_size=7,rpn=dict(penalty_k=0.05, window_influence=0.42, lr=0.38)))

The configuration file of SiameseRPN + + is shown above. It can be seen that SiameseRPN + + consists of five parts:

(1) backbone: ResNet-50 is used to extract image feature map;

(2) neck: use ChannelMapper (several conv layers) to unify the number of channels of characteristic graphs at different level s of ResNet;

head: use siameserrpnhead to track targets in adjacent frames;

(3)train_cfg: hyperparameters during siameserrpn + + training. The assignor is responsible for allocating positive and negative samples based on IoU, and the sampler samples positive and negative samples based on the assignor; num_neg represents the number of negative samples sampled;

(4)test_cfg: hyperparameters during siameserrpn + + testing. The three hyperparameters in rpn have a great impact on the performance of the algorithm, so it is generally necessary to select different hyperparameters under different data sets.

Analysis of SiameseRPN++ Head

Since head is the focus of SiameseRPN + + algorithm, this paper analyzes the forward part of head and the final tracking bbox part after forward for source code analysis.

The Forward part is relatively simple and mainly includes the following five steps:

The first step is to calculate the weighting coefficients of different level score map s

Step 2: use correlation head to get score map for template feature and search feature of a certain level

Step 3: use the correlation head to get the region map for the template feature and search feature of a certain level

The fourth step is to aggregate score map s of different level s using the weighting coefficients in the first step

In step 5, the weighting coefficients in step 1 are used to aggregate the region maps of different level s

In this paper, these five steps are pasted on the corresponding position of the forward function in the code as comments to facilitate readers to understand the logic of the code, as follows.

The part of obtaining the final tracking box is complex, including some post-processing parts, mainly including the following 10 steps:

Step 1: get anchor

The second step is to obtain the 2D Hamming window as the penalty item of the prediction frame score

Step 3: obtain the score and prediction box corresponding to each anchor

Step 4: calculate the size penalty item of the prediction box

The fifth step is to calculate the penalty term of the width height ratio of the prediction frame

Step 6: use the penalty items in step 4 and step 5 to punish the predicted score

In step 7, the Hamming window obtained in step 2 is used to punish the predicted score

Step 8: according to the score after punishment, select the prediction box with the highest score as the tracking box

Step 9: transform the coordinates of the obtained tracking frame from the search image to the original image

Step 10: use the tracking frame of the previous frame to smooth the currently predicted tracking frame as the final tracking result

This article pastes these 10 steps in the code_ The corresponding position of bbox function is used as a comment to facilitate readers to understand the logic of the code, as follows.

Due to get_ The implementation of bbox is complex. It is recommended that readers view the source code for easy understanding.

Swipe up and right to see the complete code

@HEADS.register_module()
class SiameseRPNHead(BaseModule):
    """Siamese RPN head.

    This module is proposed in
    "SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks.
    `SiamRPN++ <https://arxiv.org/abs/1812.11703>`_.

    Args:
        anchor_generator (dict): Configuration to build anchor generator
            module.
        in_channels (int): Input channels.
        kernel_size (int): Kernel size of convs. Defaults to 3.
        norm_cfg (dict): Configuration of normlization method after each conv.
            Defaults to dict(type='BN').
        weighted_sum (bool): If True, use learnable weights to weightedly sum
            the output of multi heads in siamese rpn , otherwise, use
            averaging. Defaults to False.
        bbox_coder (dict): Configuration to build bbox coder. Defaults to
            dict(type='DeltaXYWHBBoxCoder', target_means=[0., 0., 0., 0.],
            target_stds=[1., 1., 1., 1.]).
        loss_cls (dict): Configuration to build classification loss. Defaults
            to dict( type='CrossEntropyLoss', reduction='sum', loss_weight=1.0)
        loss_bbox (dict): Configuration to build bbox regression loss. Defaults
            to dict( type='L1Loss', reduction='sum', loss_weight=1.2).
        train_cfg (Dict): Training setting. Defaults to None.
        test_cfg (Dict): Testing setting. Defaults to None.
        init_cfg (dict or list[dict], optional): Initialization config dict.
            Defaults to None.
    """


    def __init__(self,
                 anchor_generator,
                 in_channels,
                 kernel_size=3,
                 norm_cfg=dict(type='BN'),
                 weighted_sum=False,
                 bbox_coder=dict(
                     type='DeltaXYWHBBoxCoder',
                     target_means=[0., 0., 0., 0.],
                     target_stds=[1., 1., 1., 1.]
),
                 loss_cls=dict(
                     type='CrossEntropyLoss', reduction='sum',
                     loss_weight=1.0
),
                 loss_bbox=dict(
                     type='L1Loss', reduction='sum', loss_weight=1.2
),
                 train_cfg=None,
                 test_cfg=None,
                 init_cfg=None,
                 *args,
                 **kwargs
):
        super(SiameseRPNHead, self).__init__(init_cfg)
        self.anchor_generator = build_prior_generator(anchor_generator)
        self.bbox_coder = build_bbox_coder(bbox_coder)
        self.train_cfg = train_cfg
        self.test_cfg = test_cfg
        self.assigner = build_assigner(self.train_cfg.assigner)
        self.sampler = build_sampler(self.train_cfg.sampler)
        self.fp16_enabled = False

        self.cls_heads = nn.ModuleList()
        self.reg_heads = nn.ModuleList()
        for i in range(len(in_channels)):
            self.cls_heads.append(
                CorrelationHead(in_channels[i], in_channels[i],
                                2 * self.anchor_generator.num_base_anchors[0],
                                kernel_size, norm_cfg))
            self.reg_heads.append(
                CorrelationHead(in_channels[i], in_channels[i],
                                4 * self.anchor_generator.num_base_anchors[0],
                                kernel_size, norm_cfg))

        self.weighted_sum = weighted_sum
        if self.weighted_sum:
            self.cls_weight = nn.Parameter(torch.ones(len(in_channels)))
            self.reg_weight = nn.Parameter(torch.ones(len(in_channels)))

        self.loss_cls = build_loss(loss_cls)
        self.loss_bbox = build_loss(loss_bbox)


    @auto_fp16()
    def forward(self, z_feats, x_feats):
        """Forward with features `z_feats` of exemplar images and features
        `x_feats` of search images.

        Args:
            z_feats (tuple[Tensor]): Tuple of Tensor with shape (N, C, H, W)
                denoting the multi level feature maps of exemplar images.
                Typically H and W equal to 7.
            x_feats (tuple[Tensor]): Tuple of Tensor with shape (N, C, H, W)
                denoting the multi level feature maps of search images.
                Typically H and W equal to 31.

        Returns:
            tuple(cls_score, bbox_pred): cls_score is a Tensor with shape
            (N, 2 * num_base_anchors, H, W), bbox_pred is a Tensor with shape
            (N, 4 * num_base_anchors, H, W), Typically H and W equal to 25.
        """
        assert isinstance(z_feats, tuple) and isinstance(x_feats, tuple)
        assert len(z_feats) == len(x_feats) and len(z_feats) == len(
            self.cls_heads)

        # The first step is to calculate the weighting coefficients of different level score map s
        if self.weighted_sum:
            cls_weight = nn.functional.softmax(self.cls_weight, dim=0)
            reg_weight = nn.functional.softmax(self.reg_weight, dim=0)
        else:
            reg_weight = cls_weight = [
                1.0 / len(z_feats) for i in range(len(z_feats))
            ]

        cls_score = 0
        bbox_pred = 0
        for i in range(len(z_feats)):
            # Step 2: use correlation head to get score map for template feature and search feature of a certain level
            cls_score_single = self.cls_heads[i](z_feats[i], x_feats[i])
            # Step 3: use the correlation head to get the region map for the template feature and search feature of a certain level
            bbox_pred_single = self.reg_heads[i](z_feats[i], x_feats[i])
            # The fourth step is to aggregate score map s of different level s using the weighting coefficients in the first step
            cls_score += cls_weight[i] * cls_score_single
            # In step 5, the weighting coefficients in step 1 are used to aggregate the region maps of different level s
            bbox_pred += reg_weight[i] * bbox_pred_single

        return cls_score, bbox_pred

    @force_fp32(apply_to=('cls_score', 'bbox_pred'))
    def get_bbox(self, cls_score, bbox_pred, prev_bbox, scale_factor):
        """Track `prev_bbox` to current frame based on the output of network.

        Args:
            cls_score (Tensor): of shape (1, 2 * num_base_anchors, H, W).
            bbox_pred (Tensor): of shape (1, 4 * num_base_anchors, H, W).
            prev_bbox (Tensor): of shape (4, ) in [cx, cy, w, h] format.
            scale_factor (Tensr): scale factor.

        Returns:
            tuple(best_score, best_bbox): best_score is a Tensor denoting the
            score of `best_bbox`, best_bbox is a Tensor of shape (4, )
            with [cx, cy, w, h] format, which denotes the best tracked
            bbox in current frame.
        """
        score_maps_size = [(cls_score.shape[2:])]
        # Step 1: get anchor
        if not hasattr(self, 'anchors'):
            self.anchors = self.anchor_generator.grid_priors(
                score_maps_size, cls_score.device)[0]
            # Transform the coordinate origin from the top left corner to the
            # center in the scaled feature map.
            feat_h, feat_w = score_maps_size[0]
            stride_w, stride_h = self.anchor_generator.strides[0]
            self.anchors[:, 0:4:2] -= (feat_w // 2) * stride_w
            self.anchors[:, 1:4:2] -= (feat_h // 2) * stride_h

        # The second step is to obtain the 2D Hamming window as the penalty item of the prediction frame score
        if not hasattr(self, 'windows'):
            self.windows = self.anchor_generator.gen_2d_hanning_windows(
                score_maps_size, cls_score.device)[0]

        # Step 3: obtain the score and prediction box corresponding to each anchor
        H, W = score_maps_size[0]
        cls_score = cls_score.view(2, -1, H, W)
        cls_score = cls_score.permute(2, 3, 1, 0).contiguous().view(-1, 2)
        cls_score = cls_score.softmax(dim=1)[:, 1]

        bbox_pred = bbox_pred.view(4, -1, H, W)
        bbox_pred = bbox_pred.permute(2, 3, 1, 0).contiguous().view(-1, 4)
        bbox_pred = self.bbox_coder.decode(self.anchors, bbox_pred)
        bbox_pred = bbox_xyxy_to_cxcywh(bbox_pred)

        def change_ratio(ratio):
            return torch.max(ratio, 1. / ratio)

        def enlarge_size(w, h):
            pad = (w + h) * 0.5
            return torch.sqrt((w + pad) * (h + pad))

        # Step 4: calculate the size penalty item of the prediction box
        # scale penalty
        scale_penalty = change_ratio(
            enlarge_size(bbox_pred[:, 2], bbox_pred[:, 3]) / enlarge_size(
                prev_bbox[2] * scale_factor, prev_bbox[3] * scale_factor))

        # The fifth step is to calculate the penalty term of the width height ratio of the prediction frame
        # aspect ratio penalty
        aspect_ratio_penalty = change_ratio(
            (prev_bbox[2] / prev_bbox[3]) /
            (bbox_pred[:, 2] / bbox_pred[:, 3]))

        # Step 6: use the penalty items in step 4 and step 5 to punish the predicted score
        # penalize cls_score
        penalty = torch.exp(-(aspect_ratio_penalty * scale_penalty - 1) *
                            self.test_cfg.penalty_k)
        penalty_score = penalty * cls_score

        # In step 7, the Hamming window obtained in step 2 is used to punish the predicted score
        # window penalty
        penalty_score = penalty_score * (1 - self.test_cfg.window_influence) \
            + self.windows * self.test_cfg.window_influence

        # Step 8: according to the score after punishment, select the prediction box with the highest score as the tracking box
        best_idx = torch.argmax(penalty_score)
        best_score = cls_score[best_idx]
        best_bbox = bbox_pred[best_idx, :] / scale_factor

        final_bbox = torch.zeros_like(best_bbox)

        # Step 9: transform the coordinates of the obtained tracking frame from the search image to the original image
        # map the bbox center from the searched image to the original image.
        final_bbox[0] = best_bbox[0] + prev_bbox[0]
        final_bbox[1] = best_bbox[1] + prev_bbox[1]

        # Step 10: use the tracking frame of the previous frame to smooth the currently predicted tracking frame as the final tracking result
        # smooth bbox
        lr = penalty[best_idx] * cls_score[best_idx] * self.test_cfg.lr
        final_bbox[2] = prev_bbox[2] * (1 - lr) + best_bbox[2] * lr
        final_bbox[3] = prev_bbox[3] * (1 - lr) + best_bbox[3] * lr

        return best_score, final_bbox

Hyper parameter search tool

As mentioned above, during SiameseRPN + + testing, the performance is greatly affected by the three super parameters in rpn, and different super parameters need to be selected under different data sets. Therefore, we provide the super parameter search tool of SiameseRPN + +. The search script is displayed in

$MMTracking/tools/analysis/sot/sot_siamderpn_param_search.py can be found. This article next introduces how to use it.

Use the following command in the MMTracking root directory to search the hyperparameters based on the OPE evaluation criteria on the UAV123 dataset, and the search results will be saved in the ${LOG_FILENAME} file.

./tools/analysis/sot/dist_sot_siamrpn_param_search.sh \
    [${CONFIG_FILE}] [$GPUS] \
    [--checkpoint ${CHECKPOINT}] \
    [--log ${LOG_FILENAME}] \
    [--eval ${EVAL}] \
    [--penalty-k-range 0.01,0.22,0.05] \
    [--lr-range 0.4,0.61,0.05] \
    [--win-infu-range 0.01,0.22,0.05]

Use the following command in the MMTracking root directory to search the hyperparameters based on the OPE evaluation criteria on the OTB100 dataset, and the search results will be saved in the ${LOG_FILENAME} file.

./tools/analysis/sot/dist_sot_siamrpn_param_search.sh \
    [${CONFIG_FILE}] [$GPUS] \
    [--checkpoint ${CHECKPOINT}] \
    [--log ${LOG_FILENAME}] \
    [--eval ${EVAL}] \
    [--penalty-k-range 0.3,0.45,0.02] \
    [--lr-range 0.35,0.5,0.02] \
    [--win-infu-range 0.46,0.55,0.02]

Please also note that all the results provided in MMTracking model zoo are not subject to super parameter search.

As a member of MM series, MMTracking will continue to update and strive to grow into a perfect video target perception platform as soon as possible, and the voice of the community can help us better understand your needs. Therefore, if you encounter any problems, ideas, suggestions, or new data sets, new methods and new tasks you want to support in the process of use, Welcome to the comments section. Please remember that our repo is your forever home!

Programmer Think

MMTracking | how to catch human cubs with single target tracking (SOT)

Hot Topics