## Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

Hsing-Hung Chou, Institute of Communications Engineering, National Tsing Hua University, Taiwan, paul8301526@gmail.com , Ching-Te Chiu, Institute of Communications Engineering, National Tsing Hua University, Taiwan, Yi-Ping Liao, Institute of Computer Science, National Tsing Hua University, Taiwan

Suggested Citation
Hsing-Hung Chou, Ching-Te Chiu and Yi-Ping Liao (2021), "Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e18. http://dx.doi.org/10.1017/ATSIP.2021.16

Publication Date: 17 Nov 2021
© 2021 Hsing-Hung Chou, Ching-Te Chiu and Yi-Ping Liao

Subjects

Keywords
Deep convolutional model compressionKnowledge distillationTransfer learning

#### Journal details

This is published under the terms of the Creative Commons Attribution licence.

 I. INTRODUCTION II. RELATED WORK III. PROPOSED ARCHITECTURE IV. EXPERIMENTAL RESULTS V. DISCUSSION VI. CONCLUSION

#### Abstract

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet$64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

DOI:10.1017/ATSIP.2021.16

#### Related publications

Companion

APSIPA Transactions on Signal and Information Processing Deep Neural Networks: Representation, Interpretation, and Applications: Articles Overview
See the other articles that are part of this special issue.