By Songze Li, University of Southern California, USA, songzeli@usc.edu | Salman Avestimehr, University of Southern California, USA, avestimehr@ee.usc.edu
We introduce the concept of “coded computing”, a novel computing paradigm that utilizes coding theory to effectively inject and leverage data/computation redundancy to mitigate several fundamental bottlenecks in large-scale distributed computing, namely communication bandwidth, straggler’s (i.e., slow or failing nodes) delay, privacy and security bottlenecks. More specifically, for MapReduce based distributed computing structures, we propose the “Coded Distributed Computing” (CDC) scheme, which injects redundant computations across the network in a structured manner, such that in-network coding opportunities are enabled to substantially slash the communication load to shuffle the intermediate computation results. We prove that CDC achieves the optimal tradeoff between computation and communication, and demonstrate its impact on a wide range of distributed computing systems from cloud-based datacenters to mobile edge/fog computing platforms.
Secondly, to alleviate the straggler effect that prolongs the executions of distributed machine learning algorithms, we utilize the ideas from error correcting codes to develop “Polynomial Codes” for computing general matrix algebra, and “Lagrange Coded Computing” (LCC) for computing arbitrary multivariate polynomials. The core idea of these proposed schemes is to apply coding to create redundant data/computation scattered across the network, such that completing the overall computation task only requires a subset of the network nodes returning their local computation results. We demonstrate the optimality of Polynomial Codes and LCC in minimizing the computation latency, by proving that they require the least number of nodes to return their results.
Finally, we illustrate the role of coded computing in providing security and privacy in distributed computing and machine learning. In particular, we consider the problems of secure multiparty computing (MPC) and privacy-preserving machine learning, and demonstrate how coded computing can be leveraged to provide efficient solutions to these critical problems and enable substantial improvements over the state of the art.
To illustrate the impact of coded computing on real world applications and systems, we implement the proposed coding schemes on cloud-based distributed computing systems, and significantly improve the run-time performance of important benchmarks including distributed sorting, distributed training of regression models, and privacy-preserving training for image classification. Throughout this monograph, we also highlight numerous open problems and exciting research directions for future work on coded computing.
Recent years have witnessed a rapid growth of large-scale machine learning and big data analytics, facilitating the developments of data intensive applications like voice/image recognition, real-time mapping services, autonomous driving, social networks, and augmented/virtual reality. These applications are supported by cloud infrastructures composed of large datacenters.
The large scale distributed machine learning/data analytics systems provide the necessary processing power to handle these applications, but suffer three major performance bottlenecks; namely, communication, straggler and security.
In this ground-breaking monograph, the authors introduce the novel concept of Coded Computing. Coded Computing exploits coding theory to optimally inject and leverage data/task redundancy in distributed computing systems, creating coding opportunities to overcome the bottlenecks.
After introducing the reader to the core of the problem, the authors describe in detail each of the bottlenecks that can be overcome using Coded Computing. The monograph provides an accessible introduction into how this new technique can be used in developing large-scale computing systems.