Wen @ 000007
preface
Hello everyone, today we will open a new series of MMDetection articles. It's time to take you to learn some atypical operation skills.
There are various reasons for these atypical operations, some from the needs of internal and community users, and some from the needs of the reproduction algorithm itself. It is hoped that by studying this series of articles, users can be more comfortable when using MMDetection for extension development and easily show all kinds of operations.
This article is the first in a series of articles on atypical operation. The typical operation skills involved are:
- How to set different learning rates for different layers and freeze specific layers
- How to use multi graph data enhancement gracefully in training
- How to adjust data preprocessing process and switch loss in real time during training
Note: This article requires users to have a certain understanding of MMDetection itself, which can be accessed through Official documents perhaps Zhihu copywriting Learn about MMDetection. The atypical operation methods described herein may only apply to MMDetection v2 21.0 and its previous versions, with the continuous update of MMDetection, I believe there will be more elegant solutions in the future.
1 how to set different learning rates for different layers and freeze specific layers
It is often seen that this problem is mentioned in the issue. In fact, MMDetection supports setting different learning rates for different layers and freezing specific layers. The core is realized through the Optimizer Constructor, and the default is provided in MMCV
DefaultOptimizerConstructor to handle most of the requirements that users can encounter at ordinary times.
To set different learning rates for different layers, refer to configurations / DETR / DETR of DETR algorithm_ r50_ 8x2_ 150e_ coco. Py configuration file.
optimizer = dict( type='AdamW', lr=0.0001, weight_decay=0.0001, paramwise_cfg=dict( custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=1.0)}))
The above configuration means to multiply the initial learning rate of the backbone part of the DETR algorithm by 0.1, that is, the learning rate of the backbone is 10 times smaller than that of the head part. Similarly, you can refer to the configurations / swin / mask of the swing transformer algorithm_ rcnn_ swin-t-p4-w7_ fpn_ 1x_ coco. Py configuration file.
optimizer = dict( _delete_=True, type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.05, paramwise_cfg=dict( custom_keys={ 'absolute_pos_embed': dict(decay_mult=0.), 'relative_position_bias_table': dict(decay_mult=0.), 'norm': dict(decay_mult=0.) }))
Set the decay coefficient of the layer containing the specified key to 0, that is, weight decay is not performed.
As for freezing a specific layer, it can only be used for modules without BN layer at present. Fortunately, most FPN and Head modules do not have a BN layer, so in most cases, users can freeze LR in the layer they want to freeze_ Mult is set to 0 to achieve the goal indirectly.
1.1 DefaultOptimizerConstructor
First of all, we should emphasize that the function of OptimizerConstructor is to set different model optimization super parameters for different layers. Generally, the common configurations are:
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
This means that all layer super parameters are treated equally. In fact, when building the optimizer, the code is as follows:
def build_optimizer(model, cfg): optimizer_cfg = copy.deepcopy(cfg) # If the user does not have a custom constructor, use DefaultOptimizerConstructor constructor_type = optimizer_cfg.pop('constructor', 'DefaultOptimizerConstructor') # And take out paramwise_cfg paramwise_cfg = optimizer_cfg.pop('paramwise_cfg', None) # Instantiate DefaultOptimizerConstructor optim_constructor = build_optimizer_constructor( dict( type=constructor_type, optimizer_cfg=optimizer_cfg, paramwise_cfg=paramwise_cfg)) # Returns the optimizer object for pytorch optimizer = optim_constructor(model) return optimizer
The example code of DefaultOptimizerConstructor is:
@OPTIMIZER_BUILDERS.register_module() class DefaultOptimizerConstructor: def __init__(self, optimizer_cfg, paramwise_cfg=None): self.optimizer_cfg = optimizer_cfg self.paramwise_cfg = {} if paramwise_cfg is None else paramwise_cfg # This is the configuration of the optimizer itself self.base_lr = optimizer_cfg.get('lr', None) self.base_wd = optimizer_cfg.get('weight_decay', None) def add_params(self, params, module, prefix='', is_dcn_module=None): # These parameters are important bias_lr_mult = self.paramwise_cfg.get('bias_lr_mult', 1.) bias_decay_mult = self.paramwise_cfg.get('bias_decay_mult', 1.) norm_decay_mult = self.paramwise_cfg.get('norm_decay_mult', 1.) dwconv_decay_mult = self.paramwise_cfg.get('dwconv_decay_mult', 1.) bypass_duplicate = self.paramwise_cfg.get('bypass_duplicate', False) dcn_offset_lr_mult = self.paramwise_cfg.get('dcn_offset_lr_mult', 1.) for name, param in module.named_parameters(recurse=False): param_group = {'params': [param]} if not param.requires_grad: params.append(param_group) continue # Set a new parameter group parameter for the user-defined key ... # Add to parameter group params.append(param_group) # Traverse all modules for child_name, child_mod in module.named_children(): child_prefix = f'{prefix}.{child_name}' if prefix else child_name self.add_params( params, child_mod, prefix=child_prefix, is_dcn_module=is_dcn_module) # When called, the pytorch optimizer object is returned def __call__(self, model): optimizer_cfg = self.optimizer_cfg.copy() # If paramwise_ If the CFG parameter is not specified, the global configuration is used if not self.paramwise_cfg: optimizer_cfg['params'] = model.parameters() return build_from_cfg(optimizer_cfg, OPTIMIZERS) # Set parameter group params = [] self.add_params(params, model) optimizer_cfg['params'] = params return build_from_cfg(optimizer_cfg, OPTIMIZERS)
As can be seen from the above parameters, the functions of DefaultOptimizerConstructor are:
- bias_lr_mult bias for specific layers or all layers_ LR is multiplied by a coefficient
- bias_decay_mult multiplies the decay of a bias module of a specific layer or all layers by a coefficient
- Others are similar
When the user specifies custom_keys, the DefaultOptimizerConstructor will traverse the model parameters, and then check the custom by string matching_ Whether keys is in the model parameter. If yes, the user specified coefficient will be set for the current parameter group. The matching method of the user is custom, so the user determines the matching method of the string_ When keys, pay attention to the uniqueness of keys, otherwise additional matching may occur. For example, the user only wants to customize lr the model module layer a.b.c. if there is a module named a.b.d in the model layer, the user sets custom_ If the key is A.B, it will match with a.b.d at the same time. At this time, there will be additional matching. The brief core code implementation is as follows:
# First sort according to the alphabet, and then sort in reverse according to the length. The shorter one comes first sorted_keys = sorted(sorted(custom_keys.keys()), key=len, reverse=True) for name, param in module.named_parameters(recurse=False): for key in sorted_keys: if key in f'{prefix}.{name}': lr_mult = custom_keys[key].get('lr_mult', 1.) param_group['lr'] = self.base_lr * lr_mult if self.base_wd is not None: decay_mult = custom_keys[key].get('decay_mult', 1.) param_group['weight_decay'] = self.base_wd * decay_mult break
1.2 solutions for freezing specific layers
For modules without BN layer, users can add LR in the layer they want to freeze_ Mult is set to 0. However, once there is BN, lr=0, but the learnable parameters are no longer updated, but the global mean and variance are still changing, and the real freezing is not realized.
If there is no way to directly modify the configuration when facing the freezing demand of BN layer, there are two customization methods at present:
- Customize the OptimizerConstructor or inherit the DefaultOptimizerConstructor, and then handle the logic internally
- Directly set the requirements of the layer to be frozen when building the model_ The grad attribute is False and switches to eval mode
The second most simple and direct method is recommended for ordinary users. If you are a user who can customize OptimizerConstructor, you are recommended to customize it directly, which is more general.
2 how to use multiple graph data enhancement gracefully in training
Typical representatives of data enhancement using multiple graphs at the same time are mosaic and Mixup. Mosaic data enhancement will read 4 pictures at a time, and each picture will be input into the training enhancement pipeline, and finally merged into a large picture for output. Before supporting mosaic, the pipeline of MMDetection does not support this atypical paradigm, and it is difficult for users to directly support it.
Based on the principle of extended development, we hope to support multi graph data enhancement without significantly changing the MMDetection pipeline. Therefore, like ConcatDataset, we have created a multi graph MultiImageMixDataset with the code located in mmdet/datasets/dataset_wrappers.py. Its core implementation is:
@DATASETS.register_module() class MultiImageMixDataset: def __getitem__(self, idx): results = copy.deepcopy(self.dataset[idx]) for (transform, transform_type) in zip(self.pipeline, self.pipeline_types): # If there is get in the current transform_ Indexes method, call if hasattr(transform, 'get_indexes'): # Return multiple picture indexes indexes = transform.get_indexes(self.dataset) if not isinstance(indexes, collections.abc.Sequence): indexes = [indexes] # Then obtain the original data corresponding to multiple graphs mix_results = [ copy.deepcopy(self.dataset[index]) for index in indexes ] results['mix_results'] = mix_results # After transformation, the transformation can receive multiple picture data at one time, so as to enhance and return the merged image at the same time results = transform(results) if 'mix_results' in results: results.pop('mix_results') return results
For example, Mosaic data enhancement needs to receive four graphs and output one graph at a time, so Mosaic class only needs to implement get_indexes returns the indexes of four data, and then calls Mosaic's enhancement function to output a large picture. If users have other similar requirements, they only need to implement get_indexes and__ call__ Method is enough.
Note: get_ The premise that indexes method can be called is that you use MultiImageMixDataset. There is often an issue response. Adding Mosaic to the configuration file does not take effect because you need to use MultiImageMixDataset at the same time.
train_dataset = dict( type='MultiImageMixDataset', dataset=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_train2017.json', img_prefix=data_root + 'train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True) ], filter_empty_gt=False, ), pipeline=train_pipeline) train_pipeline = [ dict(type='Mosaic', img_scale=img_scale, pad_val=114.0), dict( type='RandomAffine', scaling_ratio_range=(0.1, 2), border=(-img_scale[0] // 2, -img_scale[1] // 2)), dict( type='MixUp', img_scale=img_scale, ratio_range=(0.8, 1.6), pad_val=114.0), dict(type='YOLOXHSVRandomAug'), dict(type='RandomFlip', flip_ratio=0.5), # According to the official implementation, multi-scale # training is not considered here but in the # 'mmdet/models/detectors/yolox.py'. dict(type='Resize', img_scale=img_scale, keep_ratio=True), dict( type='Pad', pad_to_square=True, # If the image is three-channel, the pad value needs # to be set separately for each channel. pad_val=dict(img=(114.0, 114.0, 114.0))), dict(type='FilterAnnotations', min_gt_bbox_wh=(1, 1), keep_empty=False), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]
3 how to adjust the data preprocessing process and switch loss in real time during training
This mainly belongs to the demand of reproducing the YOLOX algorithm, but I estimate that some deep users will also have this demand, so this paper focuses on the current practice.
In the YOLOX algorithm, the author adopts data enhancements including Mosaic, MixUp and ColorJit. After 285 epoch, the two data enhancements of Mosaic and MixUp should be turned off, and an L1Loss should be added.
For this demand, the most reasonable way is to customize the corresponding hook. The original intention of hook design is to gracefully solve this expansion demand.
Therefore, we have newly written the YOLOXModeSwitchHook class to realize the above functions.
@HOOKS.register_module() class YOLOXModeSwitchHook(Hook): def __init__(self, num_last_epochs=15, skip_type_keys=('Mosaic', 'RandomAffine', 'MixUp')): self.num_last_epochs = num_last_epochs self.skip_type_keys = skip_type_keys self._restart_dataloader = False def before_train_epoch(self, runner): if (epoch + 1) == runner.max_epochs - self.num_last_epochs: runner.logger.info('No mosaic and mixup aug now!') # The dataset pipeline cannot be updated when persistent_workers # is True, so we need to force the dataloader's multi-process # restart. This is a very hacky approach. # Switch pipeline train_loader.dataset.update_skip_type_keys(self.skip_type_keys) if hasattr(train_loader, 'persistent_workers' ) and train_loader.persistent_workers is True: train_loader._DataLoader__initialized = False train_loader._iterator = None self._restart_dataloader = True runner.logger.info('Add additional L1 loss now!') # Add loss model.bbox_head.use_l1 = True else: # Once the restart is complete, we need to restore # the initialization flag. if self._restart_dataloader: train_loader._DataLoader__initialized = True
The above code also involves the problem that multiple processes of a DataLoader cannot modify the main properties. Because this problem is relatively complex, this paper describes it in detail.
3.1 Dataloader cannot modify internal properties in real time when multiple processes are started. Solution
In order to describe this problem clearly, two important parameters of Dataloader need to be explained first:
- num_ The number of multiple processes opened by worker. If it is set to 0, there is only one main process; If it is greater than 1, multiple sub processes will be started to speed up the dataset iteration, which can significantly speed up the training speed.
- persistent_ Whether the multi process started by workers last time is persistent, that is, whether to recycle resources after the current Dataloader iteration, and then restart it at the next iteration. If it is set to True, it will not recycle and reuse all the time, which can significantly reduce the time-consuming in data loader switching.
Pytorch's recommended best practice is num_worker is set to the number of CPU cores or 1 / 2, while persistent_workers is set to True.
So under the above best practices, what are the problems of modifying pipeline during training? If you have no idea about this problem, look at the following examples first:
from torch.utils.data import Dataset, DataLoader import numpy as np class SimpleDataset(Dataset): def __init__(self): self.img_shape = (10, 10) def __getitem__(self, index): return np.ones(self.img_shape) def __len__(self): return 10 def main(num_worker, persistent_workers): dataset = SimpleDataset() dataloader = DataLoader(dataset, num_workers=num_worker, batch_size=2, persistent_workers=persistent_workers) for _ in range(2): print('start epoch') for i, data_batch in enumerate(dataloader): print(data_batch.shape) if i == 1: # Change the shape when i=1, that is, the second iteration dataloader.dataset.img_shape = (20, 20) print('end epoch') # Change the shape in the second epoch dataloader.dataset.img_shape = (25, 25) if __name__ == '__main__': main(num_worker=2, persistent_workers=True)
We hope to complete the following functions in the above code:
- In each dataloader iteration, change the image shape to (20, 20) in the second iteration
- At the beginning of the second epoch, change the image shape to (25, 25)
In num_ worker=2, persistent_ When workers = true, the program operation output is:
# num_worker=2, persistent_workers=True start epoch torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) end epoch start epoch torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) end epoch
It can be found that neither of the above two expectations has been realized, that is, the Dataloader mentioned in this paper cannot modify the internal properties in real time when multiple processes are started.
If we set num_worker=0, persistent_workers=True, i.e. do not open multiple processes. The effect is:
# num_worker=0, persistent_workers=True ValueError: persistent_workers option needs num_workers > 0
Because persistent_workers must be used with multiple processes, so only num can be set_ worker=0, persistent_workers=False
# num_worker=0, persistent_workers=False start epoch torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 20, 20]) # Meet expectations torch.Size([2, 20, 20]) torch.Size([2, 20, 20]) end epoch start epoch torch.Size([2, 25, 25]) # Meet expectations torch.Size([2, 25, 25]) torch.Size([2, 20, 20]) # Meet expectations torch.Size([2, 20, 20]) torch.Size([2, 20, 20]) end epoch
In num_ worker=0, persistent_ When workers = false, it is found that all requirements are met, which indicates that all problems are in multiple processes.
What if in num_ worker=2, persistent_ What happens when workers = false:
torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) end epoch start epoch torch.Size([2, 25, 25]) # Meet expectations torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) end epoch
Visible num_worker=2, persistent_workers=False only meets one requirement. To sum up:
- Num not allowed_ worker=0, persistent_workers = true because persistent_workers should be used with multiple processes
- num_worker=0, persistent_workers=False can meet all requirements
- num_worker=2, persistent_workers=False can meet requirement 2
- num_worker=2, persistent_workers=True cannot meet any requirements
It can also be seen from here that persistent_workers only affect the startup of multiple processes. Once multiple processes are started, it is useless, and num_worker directly controls the number of multiple processes.
In python, once multiple processes are started, the main process and sub process are completely isolated. Users cannot modify the data of any process and affect the data of other processes unless the data is shared globally.
So in num_worker=2, persistent_workers=True in this case, how can we meet the requirements? In fact, requirement 1 cannot be realized directly by modifying the Dataloader parameters, but requirement 2 can be met.
The solution needs to be from persistent_ Starting with the role of the workers parameter, in the final analysis, when it is set to True, the reason why the requirements cannot be met is that multiple processes are not recreated. If we force them to rebuild.
def main(num_worker, persistent_workers): dataset = SimpleDataset() dataloader = DataLoader(dataset, num_workers=num_worker, batch_size=2, persistent_workers=persistent_workers) for _ in range(2): print('start epoch') for i, data_batch in enumerate(dataloader): print(data_batch.shape) if i == 1: # Change the shape when i=1, that is, the second iteration dataloader.dataset.img_shape = (20, 20) print('end epoch') # Change the shape in the second epoch dataloader.dataset.img_shape = (25, 25) # Add the following code if hasattr(dataloader, 'persistent_workers' ) and dataloader.persistent_workers is True: dataloader._DataLoader__initialized = False dataloader._iterator = None
Run num again_ worker=2, persistent_ Workers = true to get:
start epoch torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) torch.Size([2, 10, 10]) end epoch start epoch torch.Size([2, 25, 25]) # Meet expectations torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) torch.Size([2, 25, 25]) end epoch
The core is to let the iterator rebuild. The requirement of switching pipeline in YOLOX is realized through the above code.
If you have any questions, please leave a message. For example, I am eager to realize requirement 1, so what should I do? This problem is easy to solve! If you are interested, please arrange it later~
4 Summary
This paper focuses on the analysis of three atypical skills involved in MMDetection, mainly including:
- How to set different learning rates for different layers and freeze specific layers
- How to use multi graph data enhancement gracefully in training
- How to adjust data preprocessing process and switch loss in real time during training
I think many people have encountered these three problems when using MMDetection. Therefore, this paper answers them in detail. If you still have doubts, you can leave a message under the article, and we will actively reply and supplement.
The above is only the necessary skills for atypical operation that I think should be highlighted. If you have other comments or want to add items, please leave a message!
In the following articles, we will also bring the interpretation of the following skills. Please look forward to it!
- How to turn on hybrid accuracy training gracefully through configuration
- How to use timm's backbone network in MMDetection
- Why is the pipeline of val workflow from train dataset
- Significance and function of DataContainer
- How to initialize parameters gracefully through configuration
- How to quickly locate the errors that often appear in distributed training that the model parameters are not included in loss
- Correct use of EMA Hook
- How to gracefully add plug-ins to ResNet to improve performance