Table of Links
4. Experiments
4.1. Experimental Setup
To investigate the early-bird ticket hypothesis in Transformer models, we conducted experiments on four different architectures: ViT, Swin-T, GPT-2, and RoBERTa. The experiments were performed using the following setup:
Hardware The experiments were conducted on the Georgia Tech PACE ICE clusters, utilizing A100, A40, or V100 GPUs with a minimum of 32GB memory, depending on availability.
Software The experiments were implemented using PyTorch and Hugging Face libraries for language models.
Datasets For the vision transformers, we used the CIFAR10 dataset. For fine-tuning the language models, we utilized the IMDB dataset.
4.2. Results and Analysis
4.2.1 ViT
For the ViT model, the early-bird ticket was found around epoch 20 [3a]. When retrained, the pruned model with a
Figure 3. Heatmaps and mask distance plots at p = 0.1 (to the left) and 0.3 (to the right) for all models
pruning ratio of 0.1 (p=0.1) fully recovered the baseline performance [2a], achieving an accuracy of 84.3% compared to the unpruned baseline of 85.11%. The model with a higher pruning ratio of 0.3 (p=0.3) also came close to the baseline, with an accuracy of 82.05%. These results demonstrate the trade-off between model sparsity and performance, indicating that a moderate pruning ratio can lead to significant resource savings while maintaining comparable accuracy.
4.2.2 Swin-T
Similar to ViT, the early-bird ticket for the Swin-T model was found around epoch 20 [3b]. When retrained, the p=0.1 model fully recovered the baseline performance, achieving an accuracy of 89.54% compared to the unpruned baseline of 89.5%. Interestingly, the p=0.3 model also managed to recover the baseline, with an accuracy of 88.95%. These results suggest that the Swin-T architecture is particularly well-suited for the early-bird ticket hypothesis, as it can maintain high performance even with significant pruning.
4.2.3 GPT-2
For GPT-2, we focused on identifying early-bird tickets during the fine-tuning stage. Remarkably, the early-bird ticket was discovered as early as epoch 2 of fine-tuning [3c]. When fine-tuned with pruning, both the p=0.1 and p=0.3 pruned models achieved a validation accuracy of 83.4%, slightly surpassing the unpruned baseline accuracy of 83.3% [2c]. This finding highlights the potential for early-bird tickets to emerge rapidly during the fine-tuning process, enabling efficient adaptation of pre-trained language models to downstream tasks.
4.2.4 RoBERTa
Similar to GPT-2, we identified the early-bird ticket for RoBERTa at epoch 2 [3d] of the fine-tuning stage. When fine-tuned with pruning, the p=0.1 and p=0.3 pruned models achieved validation accuracies of 86.0% and 86.02%, respectively [2d]. Although these accuracies are slightly lower than the unpruned baseline accuracy of 93.9%, the pruned models still maintain a high level of performance while significantly reducing the computational requirements. This phenomnenon could be associated with the architecture of the model and why we see GPT-2 perform in terms of recovered performance. The architectural differences between RoBERTa and GPT-2, such as the use of dynamic masking and a different pre-training objective in RoBERTa, may contribute to the variations in their ability to recover performance after pruning [7].
4.2.5 Memory Usage
In addition to the performance evaluation, we also analyzed the memory usage of the pruned models compared to their unpruned counterparts. Table [1] presents the memory usage comparison for each model. The percent change in memory based on the pruned amounts stayed relatively the same for both 0.1 and 0.3 levels. Moreover, upon testing other metrics such as FLOPs, parameters, and inference time, we noticed no change except for the memory utilization. This is because the models inherently are the same, but their pruned nature reduces the amount of memory that is taken up storing the full precision weights. The unpruned ViT model consumed 157.26 MB of memory, while the pruned models with p=0.1 and p=0.3 required only 83.61 MB, resulting in a significant reduction of 46.8%. Similarly, the Swin-T model achieved a memory usage reduction of 49.0%, with the unpruned model consuming 423.43 MB and the pruned models using 216.03 MB. For the language models, GPT-2 exhibited a memory usage reduction of 20.6%, and RoBERTa demonstrated a notable memory usage reduction of 26.9% comparing the unpruned and p=0.1 pruned models. These results highlight the substantial memory savings achieved through pruning, making the models more efficient in terms of resource utilization while maintaining comparable performance to the unpruned models. A change in pruning method based on specific architecture could yield different results from just pruning the linear layers.
4.2.6 Discussion
The experimental results provide strong evidence for the existence of early-bird tickets in Transformer models across both vision and language domains. The early-bird tickets were consistently found within the first few epochs of training or fine-tuning, indicating the potential for significant resource optimization and cost reduction.
The performance of the pruned models obtained from early-bird tickets was comparable to or even surpassed the unpruned baselines in some cases. This suggests that the early-bird ticket hypothesis holds true for Transformer architectures, and that pruning can be effectively applied to reduce the computational requirements without compromising performance.
Furthermore, the comparative analysis across different Transformer models highlights the generalizability of the early-bird ticket phenomenon. The successful identification of early-bird tickets in ViT, Swin-T, GPT-2, and RoBERTa demonstrates the applicability of this approach to a wide range of Transformer architectures.
However, it is important to note that the optimal pruning ratio may vary depending on the specific model and task. While higher pruning ratios can lead to greater resource savings, they may also result in a slight degradation in performance.
5. Conclusion
In this research, we investigated the early-bird ticket hypothesis in Transformer models across vision and language domains. By employing a methodology based on iterative pruning, masked distance calculation, and selective retraining, we successfully identified early-bird tickets in various Transformer architectures. Our experimental results demonstrate that these early-bird tickets can achieve comparable or even superior performance to the unpruned models while significantly reducing the computational re- quirements. The consistent emergence of early-bird tickets within the first few epochs of training or fine-tuning highlights the potential for substantial resource optimization and cost reduction in Transformer model development. This study contributes to the growing body of research on efficient training strategies for Transformer models and paves the way for further exploration of the early-bird ticket hypothesis across a wider range of architectures and tasks. By leveraging the insights gained from this research, practitioners can develop more efficient and accessible Transformer models, enabling their deployment in resource-constrained environments and accelerating the progress of natural language processing and computer vision applications.
References
[1] Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834– 15846, 2020. 1
[2] Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang, and Jingjing Liu. Earlybert: Efficient bert training via early-bird lottery tickets. arXiv preprint arXiv:2101.00063, 2020. 1
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 1
[4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 1
[5] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018. 1
[6] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015. 1
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. 1, 2, 4
[8] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019. 1
[9] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. 1
[10] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019. 1
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 1
[12] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019. 1
[13] Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G Baraniuk, Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957, 2019. 1, 2
Author:
(1) Shravan Cheekati, Georgia Institute of Technology ([email protected]).
This paper is