This article provides guidelines for single target tracking (SOT) tasks in MMTracking. The follow-up Food Guide for single target tracking is also on the way
Content of this article
SOT mission profile
Introduction to SOT dataset
Supported SOT algorithms and data sets
Getting Started Guide
Implementation and analysis of SiameseRPN + +
1. Introduction to sot tasks
SOT focuses more on human-computer interaction. The algorithm needs to be able to continuously track a target of any type and shape.
For example, how can we catch a human cub who "rides on his beloved motorcycle" to skip class——
2. Introduction to sot dataset
At present, the mainstream data sets in the SOT field are OTB100, VOT2018, VOT2020, UAV123, TrackingNet, LaSOT and GOT-10K.
Except that VOT evaluation standard is adopted for VOT series data sets, OPE evaluation standard is generally adopted for other data sets. The main indicators of VOT evaluation standard are EAO, Accuracy and Robustness, and the main indicators of OPE evaluation standard are Success, Norm Precision and Precision.
3. Supported SOT algorithms and data sets
MMTracking currently supports the following SOT algorithms:
- SiameseRPN++ (CVPR 2019)
Link: https://arxiv.org/abs/1812.11703
MMTracking currently supports OTB100, UAV123, TrackingNet and LaSOT datasets.
4. Getting Started Guide
Next, this paper introduces in detail how to run SOT demo, test SOT model and train SOT model in MMTracking.
To use MMTracking, you only need to clone the warehouse on github locally, and then configure the environment according to the installation manual. If you encounter any problems during installation, you can provide issue to MMTracking, and we will answer them as soon as possible.
Installation manual:
https://github.com/open-mmlab/mmtracking/blob/master/docs/install.md
It is assumed that the pre training weight has been placed in the checkpoints / folder under the root directory of MMTracking (the pre training weight can be downloaded on the corresponding configurations page).
Run SOT demo
In the MMTracking root directory, just execute the following command to run SOT demo using SiameseRPN + + algorithm.
python ./demo/demo_sot.py \ ./configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py \ --input ${VIDEO_FILE} \ --checkpoint checkpoints/siamese_rpn_r50_1x_lasot_20201218_051019-3c522eff.pth \ --output ${OUTPUT} \ --show
Test SOT model
Use the following command in the MMTracking root directory to test the SOT model on the LaSOT dataset and use the OPE evaluation standard to evaluate the model.
./tools/dist_test.sh \ configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py 8 \ --checkpoint checkpoints/siamese_rpn_r50_1x_lasot_20201218_051019-3c522eff.pth \ --out results.pkl \ --eval track
Training SOT model
Use the following command in the MMTracking root directory to train the SOT model, and use the OPE evaluation standard to evaluate the model from the 10th epoch to the 20th epoch.
bash ./tools/dist_train.sh \ ./configs/sot/siamese_rpn/siamese_rpn_r50_1x_lasot.py 8 \ --work-dir ./work_dirs/
In fact, many SOT models have been supported in MMTracking, and a public checkpoint is provided for everyone to use. It is also introduced in more detail in the quick start tutorial.
Quick start tutorial:
https://mmtracking.readthedocs.io/en/latest/quick_run.html
5. SiameseRPN + + implementation parsing
After the above steps, this paper has introduced how to run the SOT algorithm. Next, we will introduce the implementation of SiameseRPN + + under MMTracking.
Configuration file for SiameseRPN + +
model = dict(type='SiamRPN',backbone=dict(type='SOTResNet'),neck=dict(type='ChannelMapper'),head=dict(type='SiameseRPNHead'),train_cfg=dict(rpn=dict(assigner=dict(type='MaxIoUAssigner'),sampler=dict(type='RandomSampler'),num_neg=16,exemplar_size=127,search_size=255)),test_cfg=dict(exemplar_size=127,search_size=255,context_amount=0.5,center_size=7,rpn=dict(penalty_k=0.05, window_influence=0.42, lr=0.38)))
The configuration file of SiameseRPN + + is shown above. It can be seen that SiameseRPN + + consists of five parts:
(1) backbone: ResNet-50 is used to extract image feature map;
(2) neck: use ChannelMapper (several conv layers) to unify the number of channels of characteristic graphs at different level s of ResNet;
head: use siameserrpnhead to track targets in adjacent frames;
(3)train_cfg: hyperparameters during siameserrpn + + training. The assignor is responsible for allocating positive and negative samples based on IoU, and the sampler samples positive and negative samples based on the assignor; num_neg represents the number of negative samples sampled;
(4)test_cfg: hyperparameters during siameserrpn + + testing. The three hyperparameters in rpn have a great impact on the performance of the algorithm, so it is generally necessary to select different hyperparameters under different data sets.
Analysis of SiameseRPN++ Head
Since head is the focus of SiameseRPN + + algorithm, this paper analyzes the forward part of head and the final tracking bbox part after forward for source code analysis.
The Forward part is relatively simple and mainly includes the following five steps:
The first step is to calculate the weighting coefficients of different level score map s
Step 2: use correlation head to get score map for template feature and search feature of a certain level
Step 3: use the correlation head to get the region map for the template feature and search feature of a certain level
The fourth step is to aggregate score map s of different level s using the weighting coefficients in the first step
In step 5, the weighting coefficients in step 1 are used to aggregate the region maps of different level s
In this paper, these five steps are pasted on the corresponding position of the forward function in the code as comments to facilitate readers to understand the logic of the code, as follows.
The part of obtaining the final tracking box is complex, including some post-processing parts, mainly including the following 10 steps:
Step 1: get anchor
The second step is to obtain the 2D Hamming window as the penalty item of the prediction frame score
Step 3: obtain the score and prediction box corresponding to each anchor
Step 4: calculate the size penalty item of the prediction box
The fifth step is to calculate the penalty term of the width height ratio of the prediction frame
Step 6: use the penalty items in step 4 and step 5 to punish the predicted score
In step 7, the Hamming window obtained in step 2 is used to punish the predicted score
Step 8: according to the score after punishment, select the prediction box with the highest score as the tracking box
Step 9: transform the coordinates of the obtained tracking frame from the search image to the original image
Step 10: use the tracking frame of the previous frame to smooth the currently predicted tracking frame as the final tracking result
This article pastes these 10 steps in the code_ The corresponding position of bbox function is used as a comment to facilitate readers to understand the logic of the code, as follows.
Due to get_ The implementation of bbox is complex. It is recommended that readers view the source code for easy understanding.
Swipe up and right to see the complete code
@HEADS.register_module() class SiameseRPNHead(BaseModule): """Siamese RPN head. This module is proposed in "SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. `SiamRPN++ <https://arxiv.org/abs/1812.11703>`_. Args: anchor_generator (dict): Configuration to build anchor generator module. in_channels (int): Input channels. kernel_size (int): Kernel size of convs. Defaults to 3. norm_cfg (dict): Configuration of normlization method after each conv. Defaults to dict(type='BN'). weighted_sum (bool): If True, use learnable weights to weightedly sum the output of multi heads in siamese rpn , otherwise, use averaging. Defaults to False. bbox_coder (dict): Configuration to build bbox coder. Defaults to dict(type='DeltaXYWHBBoxCoder', target_means=[0., 0., 0., 0.], target_stds=[1., 1., 1., 1.]). loss_cls (dict): Configuration to build classification loss. Defaults to dict( type='CrossEntropyLoss', reduction='sum', loss_weight=1.0) loss_bbox (dict): Configuration to build bbox regression loss. Defaults to dict( type='L1Loss', reduction='sum', loss_weight=1.2). train_cfg (Dict): Training setting. Defaults to None. test_cfg (Dict): Testing setting. Defaults to None. init_cfg (dict or list[dict], optional): Initialization config dict. Defaults to None. """ def __init__(self, anchor_generator, in_channels, kernel_size=3, norm_cfg=dict(type='BN'), weighted_sum=False, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0., 0., 0., 0.], target_stds=[1., 1., 1., 1.] ), loss_cls=dict( type='CrossEntropyLoss', reduction='sum', loss_weight=1.0 ), loss_bbox=dict( type='L1Loss', reduction='sum', loss_weight=1.2 ), train_cfg=None, test_cfg=None, init_cfg=None, *args, **kwargs ): super(SiameseRPNHead, self).__init__(init_cfg) self.anchor_generator = build_prior_generator(anchor_generator) self.bbox_coder = build_bbox_coder(bbox_coder) self.train_cfg = train_cfg self.test_cfg = test_cfg self.assigner = build_assigner(self.train_cfg.assigner) self.sampler = build_sampler(self.train_cfg.sampler) self.fp16_enabled = False self.cls_heads = nn.ModuleList() self.reg_heads = nn.ModuleList() for i in range(len(in_channels)): self.cls_heads.append( CorrelationHead(in_channels[i], in_channels[i], 2 * self.anchor_generator.num_base_anchors[0], kernel_size, norm_cfg)) self.reg_heads.append( CorrelationHead(in_channels[i], in_channels[i], 4 * self.anchor_generator.num_base_anchors[0], kernel_size, norm_cfg)) self.weighted_sum = weighted_sum if self.weighted_sum: self.cls_weight = nn.Parameter(torch.ones(len(in_channels))) self.reg_weight = nn.Parameter(torch.ones(len(in_channels))) self.loss_cls = build_loss(loss_cls) self.loss_bbox = build_loss(loss_bbox) @auto_fp16() def forward(self, z_feats, x_feats): """Forward with features `z_feats` of exemplar images and features `x_feats` of search images. Args: z_feats (tuple[Tensor]): Tuple of Tensor with shape (N, C, H, W) denoting the multi level feature maps of exemplar images. Typically H and W equal to 7. x_feats (tuple[Tensor]): Tuple of Tensor with shape (N, C, H, W) denoting the multi level feature maps of search images. Typically H and W equal to 31. Returns: tuple(cls_score, bbox_pred): cls_score is a Tensor with shape (N, 2 * num_base_anchors, H, W), bbox_pred is a Tensor with shape (N, 4 * num_base_anchors, H, W), Typically H and W equal to 25. """ assert isinstance(z_feats, tuple) and isinstance(x_feats, tuple) assert len(z_feats) == len(x_feats) and len(z_feats) == len( self.cls_heads) # The first step is to calculate the weighting coefficients of different level score map s if self.weighted_sum: cls_weight = nn.functional.softmax(self.cls_weight, dim=0) reg_weight = nn.functional.softmax(self.reg_weight, dim=0) else: reg_weight = cls_weight = [ 1.0 / len(z_feats) for i in range(len(z_feats)) ] cls_score = 0 bbox_pred = 0 for i in range(len(z_feats)): # Step 2: use correlation head to get score map for template feature and search feature of a certain level cls_score_single = self.cls_heads[i](z_feats[i], x_feats[i]) # Step 3: use the correlation head to get the region map for the template feature and search feature of a certain level bbox_pred_single = self.reg_heads[i](z_feats[i], x_feats[i]) # The fourth step is to aggregate score map s of different level s using the weighting coefficients in the first step cls_score += cls_weight[i] * cls_score_single # In step 5, the weighting coefficients in step 1 are used to aggregate the region maps of different level s bbox_pred += reg_weight[i] * bbox_pred_single return cls_score, bbox_pred @force_fp32(apply_to=('cls_score', 'bbox_pred')) def get_bbox(self, cls_score, bbox_pred, prev_bbox, scale_factor): """Track `prev_bbox` to current frame based on the output of network. Args: cls_score (Tensor): of shape (1, 2 * num_base_anchors, H, W). bbox_pred (Tensor): of shape (1, 4 * num_base_anchors, H, W). prev_bbox (Tensor): of shape (4, ) in [cx, cy, w, h] format. scale_factor (Tensr): scale factor. Returns: tuple(best_score, best_bbox): best_score is a Tensor denoting the score of `best_bbox`, best_bbox is a Tensor of shape (4, ) with [cx, cy, w, h] format, which denotes the best tracked bbox in current frame. """ score_maps_size = [(cls_score.shape[2:])] # Step 1: get anchor if not hasattr(self, 'anchors'): self.anchors = self.anchor_generator.grid_priors( score_maps_size, cls_score.device)[0] # Transform the coordinate origin from the top left corner to the # center in the scaled feature map. feat_h, feat_w = score_maps_size[0] stride_w, stride_h = self.anchor_generator.strides[0] self.anchors[:, 0:4:2] -= (feat_w // 2) * stride_w self.anchors[:, 1:4:2] -= (feat_h // 2) * stride_h # The second step is to obtain the 2D Hamming window as the penalty item of the prediction frame score if not hasattr(self, 'windows'): self.windows = self.anchor_generator.gen_2d_hanning_windows( score_maps_size, cls_score.device)[0] # Step 3: obtain the score and prediction box corresponding to each anchor H, W = score_maps_size[0] cls_score = cls_score.view(2, -1, H, W) cls_score = cls_score.permute(2, 3, 1, 0).contiguous().view(-1, 2) cls_score = cls_score.softmax(dim=1)[:, 1] bbox_pred = bbox_pred.view(4, -1, H, W) bbox_pred = bbox_pred.permute(2, 3, 1, 0).contiguous().view(-1, 4) bbox_pred = self.bbox_coder.decode(self.anchors, bbox_pred) bbox_pred = bbox_xyxy_to_cxcywh(bbox_pred) def change_ratio(ratio): return torch.max(ratio, 1. / ratio) def enlarge_size(w, h): pad = (w + h) * 0.5 return torch.sqrt((w + pad) * (h + pad)) # Step 4: calculate the size penalty item of the prediction box # scale penalty scale_penalty = change_ratio( enlarge_size(bbox_pred[:, 2], bbox_pred[:, 3]) / enlarge_size( prev_bbox[2] * scale_factor, prev_bbox[3] * scale_factor)) # The fifth step is to calculate the penalty term of the width height ratio of the prediction frame # aspect ratio penalty aspect_ratio_penalty = change_ratio( (prev_bbox[2] / prev_bbox[3]) / (bbox_pred[:, 2] / bbox_pred[:, 3])) # Step 6: use the penalty items in step 4 and step 5 to punish the predicted score # penalize cls_score penalty = torch.exp(-(aspect_ratio_penalty * scale_penalty - 1) * self.test_cfg.penalty_k) penalty_score = penalty * cls_score # In step 7, the Hamming window obtained in step 2 is used to punish the predicted score # window penalty penalty_score = penalty_score * (1 - self.test_cfg.window_influence) \ + self.windows * self.test_cfg.window_influence # Step 8: according to the score after punishment, select the prediction box with the highest score as the tracking box best_idx = torch.argmax(penalty_score) best_score = cls_score[best_idx] best_bbox = bbox_pred[best_idx, :] / scale_factor final_bbox = torch.zeros_like(best_bbox) # Step 9: transform the coordinates of the obtained tracking frame from the search image to the original image # map the bbox center from the searched image to the original image. final_bbox[0] = best_bbox[0] + prev_bbox[0] final_bbox[1] = best_bbox[1] + prev_bbox[1] # Step 10: use the tracking frame of the previous frame to smooth the currently predicted tracking frame as the final tracking result # smooth bbox lr = penalty[best_idx] * cls_score[best_idx] * self.test_cfg.lr final_bbox[2] = prev_bbox[2] * (1 - lr) + best_bbox[2] * lr final_bbox[3] = prev_bbox[3] * (1 - lr) + best_bbox[3] * lr return best_score, final_bbox
Hyper parameter search tool
As mentioned above, during SiameseRPN + + testing, the performance is greatly affected by the three super parameters in rpn, and different super parameters need to be selected under different data sets. Therefore, we provide the super parameter search tool of SiameseRPN + +. The search script is displayed in
$MMTracking/tools/analysis/sot/sot_siamderpn_param_search.py can be found. This article next introduces how to use it.
Use the following command in the MMTracking root directory to search the hyperparameters based on the OPE evaluation criteria on the UAV123 dataset, and the search results will be saved in the ${LOG_FILENAME} file.
./tools/analysis/sot/dist_sot_siamrpn_param_search.sh \ [${CONFIG_FILE}] [$GPUS] \ [--checkpoint ${CHECKPOINT}] \ [--log ${LOG_FILENAME}] \ [--eval ${EVAL}] \ [--penalty-k-range 0.01,0.22,0.05] \ [--lr-range 0.4,0.61,0.05] \ [--win-infu-range 0.01,0.22,0.05]
Use the following command in the MMTracking root directory to search the hyperparameters based on the OPE evaluation criteria on the OTB100 dataset, and the search results will be saved in the ${LOG_FILENAME} file.
./tools/analysis/sot/dist_sot_siamrpn_param_search.sh \ [${CONFIG_FILE}] [$GPUS] \ [--checkpoint ${CHECKPOINT}] \ [--log ${LOG_FILENAME}] \ [--eval ${EVAL}] \ [--penalty-k-range 0.3,0.45,0.02] \ [--lr-range 0.35,0.5,0.02] \ [--win-infu-range 0.46,0.55,0.02]
Please also note that all the results provided in MMTracking model zoo are not subject to super parameter search.
As a member of MM series, MMTracking will continue to update and strive to grow into a perfect video target perception platform as soon as possible, and the voice of the community can help us better understand your needs. Therefore, if you encounter any problems, ideas, suggestions, or new data sets, new methods and new tasks you want to support in the process of use, Welcome to the comments section. Please remember that our repo is your forever home!