Posted by jmdavis on Tue, 18 Jan 2022 15:57:07 +0100

1.6 test

Learning objectives:

  • Understand the process of network testing
  • It can realize network training code writing

The network test is mainly in tools test Py. The contents contained in this file are shown in the following figure:

Next, we will introduce the above contents.


main function is the program entry for network training. Its execution process is to read configuration information - > set log output to log file - > load network model - > load weight file - > load data - > track. During tracking, it is also necessary to judge whether target segmentation is carried out, as shown in the following figure:

The implementation code is as follows:

def main():
    # Get command line parameter information
    global args, logger, v_id
    args = parser.parse_args()
    # Get the configuration information in the configuration file: mainly including network structure, super parameters, etc
    cfg = load_config(args)
    # Initialize the logxi information and input the log information into the disk file
    init_log('global', logging.INFO)
    if args.log != "":
        add_file_handler('global', args.log, logging.INFO)
    # Enter the relevant configuration information into the log file
    logger = logging.getLogger('global')

    # setup model
    # Load network model schema
    if args.arch == 'Custom':
        from custom import Custom
        model = Custom(anchors=cfg['anchors'])
        parser.error('invalid architecture: {}'.format(args.arch))
    # Load network model parameters
    if args.resume:
        assert isfile(args.resume), '{} is not a valid file'.format(args.resume)
        model = load_pretrain(model, args.resume)
    # Use evaluation mode to activate drop, etc
    # Hardware information
    device = torch.device('cuda' if (torch.cuda.is_available() and not args.cpu) else 'cpu')
    model =
    # Load dataset setup dataset
    dataset = load_dataset(args.dataset)

    # These three data support mask VOS or VOT?
    if args.dataset in ['DAVIS2016', 'DAVIS2017', 'ytb_vos'] and args.mask:
        vos_enable = True  # enable Mask output
        vos_enable = False

    total_lost = 0  # VOT
    iou_lists = []  # VOS
    speed_list = []
    # Process data
    for v_id, video in enumerate(dataset.keys(), start=1):
        if != '' and video !=
        # Call track with true_ vos
        if vos_enable:
            # If the test data is ['davis2017 ','ytb_vos'], multi-target tracking will be enabled
            iou_list, speed = track_vos(model, dataset[video], cfg['hp'] if 'hp' in cfg.keys() else None,
                                 args.mask, args.refine, args.dataset in ['DAVIS2017', 'ytb_vos'], device=device)
        # False call track_vot
            lost, speed = track_vot(model, dataset[video], cfg['hp'] if 'hp' in cfg.keys() else None,
                             args.mask, args.refine, device=device)
            total_lost += lost

    # report final result
    if vos_enable:
        for thr, iou in zip(thrs, np.mean(np.concatenate(iou_lists), axis=0)):
  'Segmentation Threshold {:.2f} mIoU: {:.3f}'.format(thr, iou))
    else:'Total Lost: {:d}'.format(total_lost))'Mean Speed: {:.2f} FPS'.format(np.mean(speed_list)))


This function obtains the image window of the tracking target and adjusts the target frame. If the target frame is outside the image, expand the image and modify the coordinates of the target frame. The code is as follows:

def get_subwindow_tracking(im, pos, model_sz, original_sz, avg_chans, out_mode='torch'):
    Get tracking target information(Image window)
    :param im:Tracked template image
    :param pos:Target location
    :param model_sz:Target dimensions required by the model
    :param original_sz: Expanded target size
    :param avg_chans:Average of images
    :param out_mode: Output mode
    if isinstance(pos, float):
        # Target center point coordinates
        pos = [pos, pos]
    # Target size
    sz = original_sz
    # Image size
    im_sz = im.shape
    # Distance from boundary to center after expanding background
    c = (original_sz + 1) / 2
    # Judge whether the target exceeds the image boundary. If it exceeds the boundary, fill the image
    context_xmin = round(pos[0] - c)
    context_xmax = context_xmin + sz - 1
    context_ymin = round(pos[1] - c)
    context_ymax = context_ymin + sz - 1
    left_pad = int(max(0., -context_xmin))
    top_pad = int(max(0., -context_ymin))
    right_pad = int(max(0., context_xmax - im_sz[1] + 1))
    bottom_pad = int(max(0., context_ymax - im_sz[0] + 1))
    # Image filling changes the origin of the image and calculates the coordinates of the filled image block
    context_xmin = context_xmin + left_pad
    context_xmax = context_xmax + left_pad
    context_ymin = context_ymin + top_pad
    context_ymax = context_ymax + top_pad

    # zzp: a more easy speed version
    r, c, k = im.shape
    # In case of filling, the target position needs to be re assigned
    if any([top_pad, bottom_pad, left_pad, right_pad]):
        # Generate an all zero array of the same size as the filled image
        te_im = np.zeros((r + top_pad + bottom_pad, c + left_pad + right_pad, k), np.uint8)
        # Assign a value to the original image area
        te_im[top_pad:top_pad + r, left_pad:left_pad + c, :] = im
        # Assign the filled area to the mean value of the image
        if top_pad:
            te_im[0:top_pad, left_pad:left_pad + c, :] = avg_chans
        if bottom_pad:
            te_im[r + top_pad:, left_pad:left_pad + c, :] = avg_chans
        if left_pad:
            te_im[:, 0:left_pad, :] = avg_chans
        if right_pad:
            te_im[:, c + left_pad:, :] = avg_chans
        # Modify the location of the target based on the fill results
        im_patch_original = te_im[int(context_ymin):int(context_ymax + 1), int(context_xmin):int(context_xmax + 1), :]
        im_patch_original = im[int(context_ymin):int(context_ymax + 1), int(context_xmin):int(context_xmax + 1), :]
    # If the size of the tracking target block is different from the model input size, the image size is modified by opencv
    if not np.array_equal(model_sz, original_sz):
        im_patch = cv2.resize(im_patch_original, (model_sz, model_sz))
        im_patch = im_patch_original
    # cv2.imshow('crop', im_patch)
    # cv2.waitKey(0)
    # If the output mode is Torch, switch its channels; otherwise, output im directly_ patch
    return im_to_torch(im_patch) if out_mode in 'torch' else im_patch


This method generates the anchor of the target, uses the anchor to track the target, and modifies the coordinates of the anchor.

The code is as follows:

def generate_anchor(cfg, score_size):
    Generate anchor: anchor
    :param cfg: anchor Configuration information for
    :param score_size:Scoring results of classification
    :return:Generated anchor
    # Initialize anchor
    anchors = Anchors(cfg)
    # Get generated anchors
    anchor = anchors.anchors
    # Get the upper left and lower right coordinates of each anchor
    x1, y1, x2, y2 = anchor[:, 0], anchor[:, 1], anchor[:, 2], anchor[:, 3]
    # Convert anchor to the form of center point coordinates and width and height
    anchor = np.stack([(x1+x2)*0.5, (y1+y2)*0.5, x2-x1, y2-y1], 1)
    # Gets the scope of the generated anchor
    total_stride = anchors.stride
    # Get the number of anchors
    anchor_num = anchor.shape[0]
    # The anchor group is broadcast and its coordinates are set.
    anchor = np.tile(anchor, score_size * score_size).reshape((-1, 4))
    # After adding ori offset, xx and yy take the image center as the origin
    ori = - (score_size // 2) * total_stride
    xx, yy = np.meshgrid([ori + total_stride * dx for dx in range(score_size)],
                         [ori + total_stride * dy for dy in range(score_size)])
    xx, yy = np.tile(xx.flatten(), (anchor_num, 1)).flatten(), \
             np.tile(yy.flatten(), (anchor_num, 1)).flatten()
    # Get anchor
    anchor[:, 0], anchor[:, 1] = xx.astype(np.float32), yy.astype(np.float32)
    return anchor


Siamese_ Create the target dictionary state in init. The contents of state are shown in the following figure:

TrackerConfig is the configuration information, net is the network model, window is the penalty window, and also includes a series of information of the tracking target.

The code implementation is as follows:

def siamese_init(im, target_pos, target_sz, model, hp=None, device='cpu'):
    Initialize the tracker and build it according to the information of the target state Dictionaries
    :param im: Currently processed image
    :param target_pos: Target location
    :param target_sz: Target size
    :param model: Trained network model
    :param hp: Super parameter
    :param device: Hardware information
    :return: Tracker state Dictionary data

    # Initialize state dictionary
    state = dict()
    # Sets the width and height of the image
    state['im_h'] = im.shape[0]
    state['im_w'] = im.shape[1]
    # Configure the relevant parameters of the tracker
    p = TrackerConfig()
    # Update parameters
    p.update(hp, model.anchors)
    # Update parameters
    # Get network model
    net = model
    # Update the parameters of the tracker according to the network parameters, mainly anchors
    p.scales = model.anchors['scales']
    p.ratios = model.anchors['ratios']
    p.anchor_num = model.anchor_num
    # Generate anchor
    p.anchor = generate_anchor(model.anchors, p.score_size)
    # Average of images
    avg_chans = np.mean(im, axis=(0, 1))
    # Enter the width, height and size of z according to the set context scale
    wc_z = target_sz[0] + p.context_amount * sum(target_sz)
    hc_z = target_sz[1] + p.context_amount * sum(target_sz)
    s_z = round(np.sqrt(wc_z * hc_z))
    # initialize the exemplar
    z_crop = get_subwindow_tracking(im, target_pos, p.exemplar_size, s_z, avg_chans)
    # Converting it to a Variable allows back propagation in Python
    z = Variable(z_crop.unsqueeze(0))
    # Special handling template
    # Set penalty window used
    if p.windowing == 'cosine':
        # Using the outer product of hanning window to generate cosine window
        window = np.outer(np.hanning(p.score_size), np.hanning(p.score_size))
    elif p.windowing == 'uniform':
        window = np.ones((p.score_size, p.score_size))
    # Each anchor has a corresponding penalty window
    window = np.tile(window.flatten(), p.anchor_num)
    # Update information to state dictionary
    state['p'] = p
    state['net'] = net
    state['avg_chans'] = avg_chans
    state['window'] = window
    state['target_pos'] = target_pos
    state['target_sz'] = target_sz
    return state


The method tracks the target according to siamese_init gets the target tracking box and then calls track_. Mask (if segmented) or track (not segmented) for target tracking.

1. Function prototype

def siamese_track(state, im, mask_enable=False, refine_enable=False, device='cpu', debug=False):
    Track the target
    :param state:Target status
    :param im:Tracked image frames
    :param mask_enable:Mask or not
    :param refine_enable:Is feature fusion performed
    :param device:Hardware information
    :param debug: Whether to proceed debug
    :return:Track the status of the target state Dictionaries

2. Current status of target

Get the current state of the target. If you debug, you can draw the state of the target on the image.

  # Get target status
    p = state['p']
    net = state['net']
    avg_chans = state['avg_chans']
    window = state['window']
    target_pos = state['target_pos']
    target_sz = state['target_sz']
    # The width, height and size of the tracking frame containing surrounding information
    wc_x = target_sz[1] + p.context_amount * sum(target_sz)
    hc_x = target_sz[0] + p.context_amount * sum(target_sz)
    s_x = np.sqrt(wc_x * hc_x)
    # Scale of input box size of template model to tracking box
    scale_x = p.exemplar_size / s_x
    # The detection area is obtained using the same proportion as the template branch
    d_search = (p.instance_size - p.exemplar_size) / 2
    pad = d_search / scale_x
    s_x = s_x + 2 * pad
    # Expand the detection box to include surrounding information
    crop_box = [target_pos[0] - round(s_x) / 2, target_pos[1] - round(s_x) / 2, round(s_x), round(s_x)]
    # If debug
    if debug:
        # Copy picture
        im_debug = im.copy()
        # Generate crop_box
        crop_box_int = np.int0(crop_box)
        # Draw it on the picture
        cv2.rectangle(im_debug, (crop_box_int[0], crop_box_int[1]),
                      (crop_box_int[0] + crop_box_int[2], crop_box_int[1] + crop_box_int[3]), (255, 0, 0), 2)
        # Picture display
        cv2.imshow('search area', im_debug)

3. Target tracking

Call track according to whether to segment the target_ Mask or track for target tracking.

# Convert the target position proportionally to the target to be tracked
    x_crop = Variable(get_subwindow_tracking(im, target_pos, p.instance_size, round(s_x), avg_chans).unsqueeze(0))
    #Call network for target tracking
    if mask_enable:
        # Target segmentation
        score, delta, mask = net.track_mask(
        # Only target tracking, no segmentation
        score, delta = net.track(

4. Classification and regression

The results of target regression and classification are achieved through RPN network, as shown in the figure below:

The results returned by the rpn network are not the width, height and position of the real detection frame, but:

Where: delta[0],delta[1],delta[2,],delta[3] is the result returned by the network model. In the following code, it refers to delta[0],delta[1],delta[2,],delta[3] on the right side of the equation. Here, we require the predicted position of the target box, that is, TX, ty, TW, th (Delta [0], Delta [1], Delta [2], Delta [3] on the left of the equation in the following code), ax, ay, aw and ah represent the center point, width and height of the anchor

The following code refers to p.anchor. Convert to the following formula in the forecast:

The code is as follows:

# Target box regression result (convert it to the style of 4 *...)
    delta = delta.permute(1, 2, 3, 0).contiguous().view(4, -1).data.cpu().numpy()
    # Target classification result (convert it to the style of 2 *...)
    score = F.softmax(score.permute(1, 2, 3, 0).contiguous().view(2, -1).permute(1, 0), dim=1).data[:,
    # Calculate the center point coordinates of the target frame, delta[0],delta[1], and width delta[2] and height delta[3].
    delta[0, :] = delta[0, :] * p.anchor[:, 2] + p.anchor[:, 0]
    delta[1, :] = delta[1, :] * p.anchor[:, 3] + p.anchor[:, 1]
    delta[2, :] = np.exp(delta[2, :]) * p.anchor[:, 2]
    delta[3, :] = np.exp(delta[3, :]) * p.anchor[:, 3]

5. Standard punishment

The next step is to select the optimal target using cosine window and scale penalty. Firstly, scale suppression is carried out. k is a super parameter, r is the aspect ratio, and s is the equivalent side length:

    def sz(w, h):
        Calculate equivalent side length
        :param w: wide
        :param h: high
        :return: Equivalent side length
        pad = (w + h) * 0.5
        sz2 = (w + pad) * (h + pad)
        return np.sqrt(sz2)

    def sz_wh(wh):
        Calculate equivalent side length
        :param wh: Array of width and height
        :return: Equivalent side length
        pad = (wh[0] + wh[1]) * 0.5
        sz2 = (wh[0] + pad) * (wh[1] + pad)
        return np.sqrt(sz2)

Next, the penalty is performed, and the non maximum suppression is used to obtain the final target tracking frame.

# Non maximum suppression
def change(r):
        take r And 1/r Bit by bit comparison takes the maximum value
        :param r:
    return np.maximum(r, 1. / r)

# size penalty
    target_sz_in_crop = target_sz*scale_x
    s_c = change(sz(delta[2, :], delta[3, :]) / (sz_wh(target_sz_in_crop)))  # scale penalty
    r_c = change((target_sz_in_crop[0] / target_sz_in_crop[1]) / (delta[2, :] / delta[3, :]))  # ratio penalty
    # p.penalty_k-hyperparameter
    penalty = np.exp(-(r_c * s_c - 1) * p.penalty_k)
    # Punish the classification results
    pscore = penalty * score

    # cos window (motion model)
    # Window penalty: superimpose a window distribution value according to a certain weight
    pscore = pscore * (1 - p.window_influence) + window * p.window_influence
    # Index to obtain the optimal weight
    best_pscore_id = np.argmax(pscore)
    # Map the optimal prediction results back to the original graph
    pred_in_crop = delta[:, best_pscore_id] / scale_x
    # Calculate lr
    lr = penalty[best_pscore_id] * score[best_pscore_id] *  # lr for OTB
    # Calculate the position and size of the target: obtain the position and size of the target according to the predicted offset
    res_x = pred_in_crop[0] + target_pos[0]
    res_y = pred_in_crop[1] + target_pos[1]

    res_w = target_sz[0] * (1 - lr) + pred_in_crop[2] * lr
    res_h = target_sz[1] * (1 - lr) + pred_in_crop[3] * lr
    # Location and size of target
    target_pos = np.array([res_x, res_y])
    target_sz = np.array([res_w, res_h])

6. Segmentation

In this part, the target is segmented, mainly according to whether the refine module is used to segment the image of the module.

# If split
    if mask_enable:
        # Location index for obtaining optimal prediction results: NP unravel_ Index: converts a plane index or a plane index array to a tuple of a coordinate array
        best_pscore_id_mask = np.unravel_index(best_pscore_id, (5, p.score_size, p.score_size))
        delta_x, delta_y = best_pscore_id_mask[2], best_pscore_id_mask[1]
        # Whether to perform feature fusion
        if refine_enable:
            # Call track_refine, run the refine module, as shown in Figure 1 × one × The target mask is obtained from the feature vector of 256 and the feature map before detection down sampling
            mask = net.track_refine((delta_y, delta_x)).to(device).sigmoid().squeeze().view(
                p.out_size, p.out_size).cpu().data.numpy()
            # Mask data is generated directly without fusion
            mask = mask[0, :, delta_y, delta_x].sigmoid(). \
                squeeze().view(p.out_size, p.out_size).cpu().data.numpy()

According to the segmentation results, the location and size of target tracking are further obtained.

First, affine transform the image:

        def crop_back(image, bbox, out_sz, padding=-1):
            Affine transformation of image
            :param image: image
            :param bbox:
            :param out_sz: Output size
            :param padding: Expand
            :return: Results after affine transformation
            # Construct transformation matrix
            # Scale coefficient
            a = (out_sz[0] - 1) / bbox[2]
            b = (out_sz[1] - 1) / bbox[3]
            # Translation 
            c = -a * bbox[0]
            d = -b * bbox[1]
            mapping = np.array([[a, 0, c],
                                [0, b, d]]).astype(np.float)
            # Affine transformation
            crop = cv2.warpAffine(image, mapping, (out_sz[0], out_sz[1]),
            return crop

Then, after affine transformation of the segmentation result, the minimum circumscribed rectangle of its contour is calculated to obtain the position of the target, so that the target frame we get will change adaptively with the movement of the target.

# Ratio of the length of the detection area box to the size of the input model: scaling factor
        s = crop_box[2] / p.instance_size
        # Predicted template area box
        sub_box = [crop_box[0] + (delta_x - p.base_size / 2) * p.total_stride * s,
                   crop_box[1] + (delta_y - p.base_size / 2) * p.total_stride * s,
                   s * p.exemplar_size, s * p.exemplar_size]
        # Scaling factor
        s = p.out_size / sub_box[2]
        # Background box
        back_box = [-sub_box[0] * s, -sub_box[1] * s, state['im_w'] * s, state['im_h'] * s]
        # affine transformation 
        mask_in_img = crop_back(mask, back_box, (state['im_w'], state['im_h']))
        # Mask results are obtained
        target_mask = (mask_in_img > p.seg_thr).astype(np.uint8)
        # Find profile based on cv2 version
        if cv2.__version__[-5] == '4':
            # There are only two parameters returned in opencv4 and four in other versions
            contours, _ = cv2.findContours(target_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
            _, contours, _ = cv2.findContours(target_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
        # Gets the area of the profile
        cnt_area = [cv2.contourArea(cnt) for cnt in contours]
        if len(contours) != 0 and np.max(cnt_area) > 100:
            # Gets the contour with the largest area
            contour = contours[np.argmax(cnt_area)]  # use max area polygon
            # Convert to* Form of 2
            polygon = contour.reshape(-1, 2)
            # pbox = cv2.boundingRect(polygon)  # Min Max Rectangle
            # After getting the smallest circumscribed rectangle, find the four vertices of the rectangle
            prbox = cv2.boxPoints(cv2.minAreaRect(polygon))  # Rotated Rectangle

            # box_in_img = pbox
            # Get tracking box
            rbox_in_img = prbox
        else:  # empty mask
            # The location is obtained according to the predicted target position and size
            location = cxy_wh_2_rect(target_pos, target_sz)
            # Get the four vertices of the tracking box
            rbox_in_img = np.array([[location[0], location[1]],
                                    [location[0] + location[2], location[1]],
                                    [location[0] + location[2], location[1] + location[3]],
                                    [location[0], location[1] + location[3]]])

7. Tracking results

Get the location and size of the target, and update the information into the state object.

# Get the position and size of the target
    target_pos[0] = max(0, min(state['im_w'], target_pos[0]))
    target_pos[1] = max(0, min(state['im_h'], target_pos[1]))
    target_sz[0] = max(10, min(state['im_w'], target_sz[0]))
    target_sz[1] = max(10, min(state['im_h'], target_sz[1]))
    # Update state object
    state['target_pos'] = target_pos
    state['target_sz'] = target_sz
    state['score'] = score[best_pscore_id]
    state['mask'] = mask_in_img if mask_enable else []
    state['ploygon'] = rbox_in_img if mask_enable else []
    return state

6. Network test

During network test, execute the following at the terminal:

cd $SiamMask/experiments/siammask_sharp
bash config_vot.json SiamMask_VOT.pth VOT2016 0

test_ mask_ refine. The contents in SH are as follows:

# Judge whether it is an empty string. If it is empty, you need to enter parameters,
if [ -z "$4" ]
      # echo command is used to display string, and it will be displayed on the terminal
    echo "Need input parameter!"
    echo "Usage: bash `basename "$0"` \$CONFIG \$MODEL \$DATASET \$GPUID"
# fi is the end of the if statement, equivalent to end if 
# Indicates the location of the project, which is the top position
ROOT=/Users/yaoxiaoying/Documents/01-work/ vision /03.Intelligent transportation/04.Single target tracking/SiamMask-master
# Setting environment variables
# Create log path
mkdir -p logs
# Set parameters
# Run test Py, the input parameters are: (1) config file: config_vot.json; (2)mode: SiamMask_VOT.pth; (3)dataset:VOT2016; (4)gpu:0
CUDA_VISIBLE_DEVICES=$gpu python -u $ROOT/tools/ \
    --config $config \
    --resume $model \
    --mask --refine \
    --dataset $dataset 2>&1 | tee logs/test_$dataset.log

# 2> & 1: direct errors to standard output
# tee logs/test_$dataset.log: add the contents of the standard output to the log file

Test results:

[2020-02-10 21:02:20,] Total Lost: 44
[2020-02-10 21:02:20,] Mean Speed: 5.85 FPS


  • Network testing uses data to test the performance of the network. The process is to load data, model and track the targets in the data
  • The network test code mainly includes
    • siammese_init: trace initialization
    • siamese_trask: track the target

