Recurrent neural networks (RNNs) and long short term memory (LSTM) networks have been used for quite some time for various natural language processing (NLP) tasks. But these models are large especially because of the input and output embedding parameters. In the past two years, the field of NLP has made significant progress as is evident from the GLUE and SuperGLUE leader boards. Transformer based models like Bidirectional Encoder Representations from Transformers (BERT), Generative Pretraining Transformer (GPT-2), Multi-task Deep Neural Network (MT-DNN), Extra-Long Network (XL-Net), Text-to-text transfer transformer (T5) etc. have been major contributors to this success. But these models are humongous in size: BERT (340M parameters), GPT-2 (1.5B parameters), MegatronLM (8.3B parameters), T5 (11B parameters). The T5 model is 21.7GB in size. Real world applications however demand small model size, low response times and low computational power wattage. In this tutorial, our aim is to discuss six different types of methods for compression of such models for text, in order to enable their deployment in real industry NLP applications and projects. The six types of methods include: pruning, quantization, knowledge distillation, parameter sharing, matrix decomposition and other Transformer based methods. Given the critical need of building applications with efficient and small models, and the large amount of recently published work in this area, we believe that this tutorial is very timely. As can be seen from the long list of referenced papers, in this tutorial, we will organize related work done by the ‘deep learning for NLP’ community in the past few years, present it as a coherent story, and summarize the research advances in the field of model compression for text.
The Tutorial will be offered in CIKM'20.
Detailed Tutorial Outline
Here is a brief outline of the tutorial with relevant references.
- Need for compression of deep learning models for text.
- Broad overview of popular ways of model compression.
- Pruning: Pruning methods aim at sparsifying weight matrices in neural networks. Methods differ based on whatis pruned and the actual logic used to prune. Given amatrix, one can prune some entries, rows/columns (i.e.,neurons), blocks, or heads (matrix itself). We will alsotalk about static versus dynamic pruning.
- Zero-out weights: [10, 13, 20, 22, 23, 35, 51, 67]
- Pruning neurons: [25, 43, 54, 73]
- Pruning blocks: [6, 19, 44]
- Pruning heads: 
Quantization: Network quantization compresses theoriginal network by reducing the number of bits requiredto represent each weight. Weights can be quantized totwo values (binary), three values (ternary) or multiplebits. It could be uniform or non-uniform.
- Binarized networks: [1, 12, 28, 66]
- Ternarized networks: [1, 27, 45, 64]
- Quantized networks: [1, 5, 10, 24, 29, 30, 33, 52, 53,72]
- Knowledge Distillation (KD): Knowledge distillation isa model compression method in which a small modelis trained to mimic a pre-trained, larger model (or en-semble of models). Methods vary based on follow-ing factors: number of teachers, number of students,what is transferred from teacher to student (intermedi-ate representation, soft target distribution, combined lossacross multiple words, etc.), usage of unlabeled data andwhether weights in student are quantized.
- Learning from intermediate representations: 
- Multi-class KD using soft target distribution: [3,26]
- Word-level and sequence-level KD, SequenceLevel Interpolation: 
- Reduced vocabulary in student: 
- Distilling Transformers: [55, 58]
- Quantized distillation: 
- Ensemble to single-model distillation: [26, 38, 57]
- Multiple student models (co-distillation): [2, 70]
- Parameter sharing: Model size can be reduced by sharing parameters. Methods differ depending on which parameters are shared, technique used to share parameters,
and the level at which sharing is performed.
- Cross layer sharing for pretraining/finetuning setting: 
- Cross layer sharing with encoder-decoder tasks: 
- Weight sharing via hash functions: [8, 40]
- Parameter sharing in the embedding matrix: [9, 17, 36, 37, 56]
- Sharing the low-rank factor across layers: 
- Toeplitz-like structured matrices: 
- Matrix decomposition: Network parameters can be significantly reduced by factorizing large matrices into
multiple smaller components. Methods differ in the
type of factorization technique, matrices being factorized, and the property of weight matrix being exploited.
- Low rank factorization: [18, 65]
- Factorized embedding parameterization: 
- Block-Term tensor decomposition: [41, 69]
- Singular Value Decomposition: 
- Joint factorization of recurrent and inter-layer weight matrices: 
- Tensor train decomposition: [18, 31, 60]
- Sparse factorization: 
- Other Transformer compression methods: Recently,
special methods have been proposed specifically for
compression of Transformer based networks. They essentially require structural modifications to basic transformer architecture.
- Vocabulary compression: 
- Quaternion attention model and Quaternion Transformers: 
- Deep equilibrium models: 
- Star Transformers: 
- Applications: In this section, we will discuss application and success of various model compression methods
across various popular NLP tasks.
- Language modeling: [7, 24, 28, 29, 41, 43, 64, 72]
- Machine translation: [5, 32, 41, 51, 53, 57, 73]
- Sentiment analysis: [1, 24, 38, 52, 53]
- Question answering: [20, 33, 52]
- Natural language inference: [38, 52]
- Paraphrasing: 
- Image captioning: [13, 23, 69]
- Handwritten character recognition: 
- Summary and future trends.
Researchers in the field of applied deep learning will benefit the most, as this tutorial will give them an exhaustive overview of the research in the direction of practical deep learning. We believe that the tutorial will give the newcomers a complete picture of the current work, introduce important research topics in this field, and inspire them to learn more. Practitioners and people from the industry will clearly benefit from the discussions both from the methods perspective, as well from the point of view of applications where such mechanisms are starting to be deployed. This tutorial can be considered an intermediate level tutorial where we assume the folks in audience to know some basic deep learning architectures. Prerequisite knowledge includes introductory level knowledge in deep learning, specifically recurrent neural networks models, and transformers. Also, basic understanding of natural language processing and machine learning concepts is expected.
Manish Gupta, Microsoft AI & Research, India (firstname.lastname@example.org)
Vasudeva Verma, IIIT Hyderabad (email@example.com)
Sonam Damani, Microsoft AI & Research, India (firstname.lastname@example.org)
Kedhar Nath Narahari, Microsoft AI & Research, India (email@example.com)
 Md Zahangir Alom, Adam T Moody, Naoya Maruyama, Brian C Van Essen, and Tarek M Taha. Effective quantization approaches for recurrent neural networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.
 Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
 Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014
 Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. arXiv preprint arXiv:1909.01377, 2019.
 Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532, 2019.
 Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. Efficient and effective sparse lstm on fpga with bank-balanced sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pages 63–72. ACM, 2019.
 Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and ChoJui Hsieh. Groupreduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998, 2018.
 Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pages 2285–2294, 2015.
 Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. Compressing neural language models by sparse word representations. arXiv preprint arXiv:1610.03950, 2016.
 Robin Cheong and Robel Daniel. transformers. zip: Compressing transformers with pruning and quantization. Technical report, Technical report, Stanford University, Stanford, California, 2019., 2019.
 Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
 Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123– 3131, 2015.
 Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Grow and prune compact, fast, and accurate lstms. arXiv preprint arXiv:1805.11797, 2018.
 Sonam Damani, Kedhar Nath Narahari, Ankush Chatterjee, Manish Gupta, and Puneet Agrawal. Optimized transformer models for faq answering. In The 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), page To appear, 2020.
 Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004, 2015.
 Artem M Grachev, Dmitry I Ignatov, and Andrey V Savchenko. Compression of recurrent neural networks for efficient language modeling. Applied Soft Computing, 79:354–362, 2019.
 Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224, 2017.
 Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. Reweighted proximal pruning for large-scale language representation. arXiv preprint arXiv:1909.12486, 2019.
 Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Star-transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1315–1325, 2019.
 Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. Dsd: Dense-sparsedense training for deep neural networks. arXiv preprint arXiv:1607.04381, 2016.
 Qinyao He, He Wen, Shuchang Zhou, Yuxin Wu, Cong Yao, Xinyu Zhou, and Yuheng Zou. Effective quantization methods for recurrent neural networks. arXiv preprint arXiv:1611.10176, 2016.
 Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. Reshaping deep neural network for fast decoding by node-pruning. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 245–249. IEEE, 2014.
 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Lu Hou and James T Kwok. Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635, 2018.
 Lu Hou, Quanming Yao, and James T Kwok. Lossaware binarization of deep networks. arXiv preprint arXiv:1611.01600, 2016.
 Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
 Supriya Kapur, Asit Mishra, and Debbie Marr. Low precision rnns: Quantizing rnns without losing accuracy. arXiv preprint arXiv:1710.07706, 2017.
 Valentin Khrulkov, Oleksii Hrinchuk, Leyla Mirvakhabova, and Ivan Oseledets. Tensorized embedding layers for efficient model compression. arXiv preprint arXiv:1901.10787, 2019.
 Yoon Kim and Alexander M Rush. Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
 Maximilian Lam. Word2bits-quantized word vectors. arXiv preprint arXiv:1803.05651, 2018.
 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
 Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
 Xiang Li, Tao Qin, Jian Yang, and Tie-Yan Liu. Lightrnn: Memory and computation-efficient recurrent neural networks. In Advances in Neural Information Processing Systems, pages 4385–4393, 2016.
 Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, and Shuang Wu. Slim embedding layers for recurrent neural language models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.
 Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019.
 Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning compact recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5960–5964. IEEE, 2016.
 Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, and Ming Zhou. A tensorized transformer for language modeling. arXiv preprint arXiv:1906.09777, 2019.
 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? arXiv preprint arXiv:1905.10650, 2019.
 Kenton Murray and David Chiang. Auto-sizing neural networks: With applications to n-gram language models. arXiv preprint arXiv:1508.05051, 2015.
 Sharan Narang, Eric Undersander, and Gregory Diamos. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782, 2017.
 Joachim Ott, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, and Yoshua Bengio. Recurrent neural networks with limited numerical precision. arXiv preprint arXiv:1608.06902, 2016.
 Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
 Rohit Prabhavalkar, Ouais Alsharif, Antoine Bruguier, and Lan McGraw. On the compression of recurrent neural networks with an application to lvcsr acoustic modeling for embedded speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5970–5974. IEEE, 2016.
 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
 Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
 Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274, 2016.
 Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840, 2019.
 Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068, 2017.
 Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
 Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
 Jun Suzuki and Masaaki Nagata. Learning compact neural word embeddings by parameter space sharing. In IJCAI, pages 2046–2052, 2016.
 Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. Multilingual neural machine translation with knowledge distillation. arXiv preprint arXiv:1902.10461, 2019.
 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
 Yi Tay, Aston Zhang, Luu Anh Tuan, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu, and Siu Cheung Hui. Lightweight and efficient neural natural language processing with quaternion networks. arXiv preprint arXiv:1906.04393, 2019.
 Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4451–4458. IEEE, 2017.
 Ehsan Variani, Ananda Theertha Suresh, and Mitchel Weintraub. West: Word encoded sequence transducers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7340–7344. IEEE, 2019.
 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.
 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
 Peiqi Wang, Xinfeng Xie, Lei Deng, Guoqi Li, Dongsheng Wang, and Yuan Xie. Hitnet: hybrid ternary recurrent neural network. In Advances in Neural Information Processing Systems, pages 604–614, 2018.
 Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019.
 Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. Alternating multi-bit quantization for recurrent neural networks. arXiv preprint arXiv:1802.00150, 2018.
 Yafeng Yang, Kaihuan Liang, Xuefeng Xiao, Zecheng Xie, Lianwen Jin, Jun Sun, and Weiying Zhou. Accelerating and compressing lstm based model for online handwritten chinese character recognition. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 110–115. IEEE, 2018.
 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
 Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9378–9387, 2018.
 Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
 Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. Extreme language model compression with optimal subwords and shared projections. arXiv preprint arXiv:1909.11687, 2019.
 Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, 2017.
 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.