APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 1

Convolutional Neural Networks Inference Memory Optimization with Receptive Field-Based Input Tiling

Weihao Zhuang, Kobe University, Japan, zhuangweihao@stu.kobe-u.ac.jp , Tristan Hascoet, Kobe University, Japan, Xunquan Chen, Kobe University, Japan, Ryoichi Takashima, Kobe University, Japan, Tetsuya Takiguchi, Kobe University, Japan, Yasuo Ariki, Kobe University, Japan
Suggested Citation
Weihao Zhuang, Tristan Hascoet, Xunquan Chen, Ryoichi Takashima, Tetsuya Takiguchi and Yasuo Ariki (2023), "Convolutional Neural Networks Inference Memory Optimization with Receptive Field-Based Input Tiling", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 1, e3. http://dx.doi.org/10.1561/116.00000015

Publication Date: 18 Jan 2023
© 2023 W. Zhuang, T. Hascoet, X. Chen, R. Takashima, T. Takiguchi and Y. Ariki
Convolutional neural networkmemory optimizationreceptive field


Open Access

This is published under the terms of CC BY-NC.

Downloaded: 829 times

In this article:
The Proposed Method 
Computation vs. Memory Trade-off 
Results and Discussion 


Currently, deep learning plays an indispensable role in many fields, including computer vision, natural language processing, and speech recognition. Convolutional Neural Networks (CNNs) have demonstrated excellent performance in computer vision tasks thanks to their powerful feature-extraction capability. However, as the larger models have shown higher accuracy, recent developments have led to state-of-the-art CNN models with increasing resource consumption. This paper investigates a conceptual approach to reduce the memory consumption of CNN inference. Our method consists of processing the input image in a sequence of carefully designed tiles within the lower subnetwork of the CNN, so as to minimize its peak memory consumption, while keeping the end-to-end computation unchanged. This method introduces a trade-off between memory consumption and computations, which is particularly suitable for high-resolution inputs. Our experimental results show that MobileNetV2 memory consumption can be reduced by up to 5.3 times with our proposed method. For ResNet50, one of the most commonly used CNN models in computer vision tasks, memory can be optimized by up to 2.3 times.