Training Mask-RCNN 10x Faster with LAMB

By Steve Thomas

Banner

The biggest obstacle to training state of the art object detection models is cycle time. Even with a relatively small dataset like COCO and a standard network like Mask-RCNN with ResNet-50 as its backbone, convergence can take over a week using synchronous stochastic gradient descent (SGD) on 8 NVIDIA Tesla V100s. Since small tweaks to implementations or hyperparameters can lead to drastically different results, it could take years to tune a model trained on a large dataset like the Open Images Detection Dataset V5.

Theoretically, the easiest way to decrease cycle time is to train on more GPUs, therefore increasing the effective batch size, and to scale the learning rate (LR) linearly [2]. In practice, it’s not so simple. Assuming one has the resources and know-how to set up the infrastructure, or an Engine ML account, naively increasing the batch size and LR causes training to diverge.

The goto recipe for combatting large batch instability is to use a LR warmup, starting with a small LR that slowly increases until reaching the target. Although this fixes the instability problem, it has been shown that for many networks, such as BERT or ResNet, after a certain batch size, linear LR scaling and a LR warmup degrades the model’s ability to generalize.

Obviously, this is problematic for us at Engine ML. Our goal is to enable teams of any size to command the power and scale of the largest tech companies by unlocking scalable deep learning. Many machine learning engineers, who often utilize servers with 1-4 GPUs, believe that large scale training leads to bad performance. This is incorrect. For example, the ensemble of networks currently at the top of the COCO leaderboard was trained on 128 GPUs [7] and the winning team of the object detection track in the 2019 Open Images Challenge trained on up to 512 GPUs [1].

Nonetheless, doubt remains. And understandably so, given the lack of access to large GPU clusters. Although much work has been done recently on large batch stochastic optimization methods, notably the development of Layer-wise Adaptive Rate Scaling (LARS) [6] and Layer-wise Adaptive Moments Based (LAMB) [5] optimizers, we notice that object detection has largely been neglected from the literature. Which is why we decided to run a few experiments ourselves and correct any misconceptions. As far as we can tell, we are the first to publish metrics for an object detection network with large batches using LAMB, making our results both exciting and an impetus for further experimentation.

Optimizers

To date, the most popular optimizer used to train convolutional neural networks is SGD with momentum. In plain SGD, each weight in the network is updated by subtracting a fraction, the LR, of the stochastic gradient of the loss with respect to the weight at time t.1

SGD.png

SGD with momentum works by adding a fraction of the previous update step to the current update to reduce the effect of noisy oscillations between minibatches before updating the weights.

momentum_update.png

SGD_with_momentum.png

When the LR is large, like when it is scaled linearly to account for larger batch sizes, the update can be proportionally larger than the weights, potentially causing the loss to diverge. This insight led You et al. to develop LARS. In LARS, each layer of the network uses a "trust coefficient" to calculate a local LR. The "trust coefficient" is the L2 norm of the weights divided by the L2 norm of the gradients.

LARS_localLR.png

The update remains the same as with SGD, except the local LR is used instead of the global LR. Now the magnitude of the update is no longer dependent on the magnitude of the gradient. Layers with larger weights will have larger updates, allowing the model to converge more quickly.

Although LARS was successful at training models with large batch sizes on ImageNet, it did not work well for attention models like BERT. This led You et al. to apply the same technique to Adaptive Moment Estimation (Adam) that was applied to SGD to create LARS. In Adam, the running averages of the first 2 moments of the gradient, the mean and variance, are recorded. Rather than updating the weights by a fraction of the gradient, like in SGD, the weights are updated by a fraction of the moving average of the gradient normalized by the variance.2

adam_mean.png

adam_variance.png

adam_update.png

In LAMB, the "trust coefficient" or "trust ratio" is the L2 norm of the weights divided by the L2 norm of the Adam update, making the local LR as follows.

LAMB_localLR.png

Like LARS, the update remains the same as with Adam, except the local LR is used instead of the global LR.3

Experiments

Using Open MMLab Detection Toolbox and Benchmark [9], we trained Mask-RCNN with ResNet-50 pretrained on ImageNet as the backbone using SGD with momentum, LARS, and LAMB on the COCO dataset for 12 epochs. Like the COCO object detection competition, we used mAP@[.50:.05:.95] to gauge our models ability to generalize. With the MMDetection paper’s numbers as our source of truth, which outperformed the original Mask-RCNN paper, the baseline for Mask-RCNN with ResNet-50 is an mAP of 0.37 on the COCO validation set after training for 12 epochs. This was achieved using 8 GPUs with a batch size of 16. Our results on 64 GPUs with a batch size of 128 can be found below.4 The only hyperparameters we tuned were the LR and the duration of the warmup. All jobs used a step LR decay policy that decreased the LR by 90% after epochs 8 and 11. All other hyperparameters were set using the defaults described in the original papers.5

results-table results-graph

*We trained all models for 12 epochs because our goal was to replicate the results of the MMDetection paper to prove that larger batch sizes do not degrade model performance. Models that train longer continue to learn. For example, one model that trained with SGD for 24 epochs achieved an mAP of 0.382.

Conclusion

With Engine ML, we were able to replicate Mask-RCNN’s baseline performance while reducing training time from a week to half a day without having to tweak anything other than the LR.6 Unlike image classification or NLP networks, where batch sizes on 64 GPUs can be thousands of samples, object detection networks use a batch size of 2 samples per GPU. Therefore, vanilla SGD with momentum with a tuned LR and warmup is enough to stabilize training without degrading the ability to generalize.

That said, LAMB lives up to its claim as “a general purpose optimizer that works for both small and large batches.” With no hyperparameter tuning, LAMB outperforms SGD with momentum and matches the baseline trained on 8 GPUs. LAMB is the first adaptive optimizer that can outperform SGD with momentum with convolutional models.

LARS underperforms, but this is expected. The model trained with LARS is still learning. Like the image classification networks trained in the LARS paper, the baseline performance is obtainable by simply increasing the number of epochs.

In conclusion, we recommend LAMB as an alternative to SGD with momentum in object detection tasks if your current model is unstable or suffers performance degradation when scaling.

If you are interested in running your own large scale experiments on Engine ML, please reach out on our contact page.

Resources

  1. 1st Place Solutions for OpenImage2019 - Object Detection
  2. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
  3. Apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
  4. An overview of gradient descent optimization algorithms
  5. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
  6. Large Batch Training of Convolutional Networks
  7. MegDet: A Large Mini-Batch Object Detector
  8. MMDetection: Open MMLab Detection Toolbox and Benchmark
  9. Open MMLab Detection Toolbox and Benchmark
  10. Pretraining BERT with Layer-wise Adaptive Learning Rates

The title image is licensed under creative commons. It has been modified from its original source.


  1. In all the following optimizer equations, I have not added weight decay for brevity and clarity. Weight decay of 0.0001 was used for SGD and LAMB experiments. Weight decay of 0.0005 was used for LARS experiments.

  2. I did not include bias correction for the sake of brevity. For a more thorough overview of Adam, see [4].

  3. The implementation of LAMB used in the reported experiments is that included in NVIDIA’s Apex library [3]. A discussion of NVIDIA’s implementation of LAMB and how it slightly differs from the paper can be found on NVIDIA’s developer blog [10].

  4. All jobs were run on K80s. Train time can be reduced nearly in half on V100s.

  5. Because of the computational overhead of syncing batch normalization parameters across model replicas, the only batch normalization used was in the backbone, where the means and variances were frozen to their ImageNet numbers. According to the MMDetection paper [8], experiments using sync batch normalization performed no better than without it. Conversely, according to the MegDet paper [7], the current leaders of the COCO object detection track, sync batch normalization was the key to achieving state of the art results. More experiments in this direction remain for a future date.

  6. Because Engine ML allows users to easily run jobs on AWS spot instances and supports automatic restarts so you can resume training from your most recent checkpoint when instances are preempted, all jobs were run for approximately ⅓ of the cost.