Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

Songze Li; Salman Avestimehr

doi:10.1561/0100000103

Foundations and Trends® in Communications and Information Theory > Vol 17 > Issue 1

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

By Songze Li, University of Southern California, USA, songzeli@usc.edu | Salman Avestimehr, University of Southern California, USA, avestimehr@ee.usc.edu

Suggested Citation

Songze Li and Salman Avestimehr (2020), "Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning", Foundations and Trends® in Communications and Information Theory: Vol. 17: No. 1, pp 1-148. http://dx.doi.org/10.1561/0100000103

Publication Date: 20 Aug 2020

Subjects

Coding theory and practice, Information theory and computer science, Distributed computing, Privacy-preserving systems, Security architectures

Journal details

Download article

In this article:

Abstract

We introduce the concept of “coded computing”, a novel computing paradigm that utilizes coding theory to effectively inject and leverage data/computation redundancy to mitigate several fundamental bottlenecks in large-scale distributed computing, namely communication bandwidth, straggler’s (i.e., slow or failing nodes) delay, privacy and security bottlenecks. More specifically, for MapReduce based distributed computing structures, we propose the “Coded Distributed Computing” (CDC) scheme, which injects redundant computations across the network in a structured manner, such that in-network coding opportunities are enabled to substantially slash the communication load to shuffle the intermediate computation results. We prove that CDC achieves the optimal tradeoff between computation and communication, and demonstrate its impact on a wide range of distributed computing systems from cloud-based datacenters to mobile edge/fog computing platforms.

Secondly, to alleviate the straggler effect that prolongs the executions of distributed machine learning algorithms, we utilize the ideas from error correcting codes to develop “Polynomial Codes” for computing general matrix algebra, and “Lagrange Coded Computing” (LCC) for computing arbitrary multivariate polynomials. The core idea of these proposed schemes is to apply coding to create redundant data/computation scattered across the network, such that completing the overall computation task only requires a subset of the network nodes returning their local computation results. We demonstrate the optimality of Polynomial Codes and LCC in minimizing the computation latency, by proving that they require the least number of nodes to return their results.

Finally, we illustrate the role of coded computing in providing security and privacy in distributed computing and machine learning. In particular, we consider the problems of secure multiparty computing (MPC) and privacy-preserving machine learning, and demonstrate how coded computing can be leveraged to provide efficient solutions to these critical problems and enable substantial improvements over the state of the art.

To illustrate the impact of coded computing on real world applications and systems, we implement the proposed coding schemes on cloud-based distributed computing systems, and significantly improve the run-time performance of important benchmarks including distributed sorting, distributed training of regression models, and privacy-preserving training for image classification. Throughout this monograph, we also highlight numerous open problems and exciting research directions for future work on coded computing.

DOI:10.1561/0100000103

Book details

ISBN: 978-1-68083-704-9

162 pp. $99.00

Buy book (pb)

ISBN: 978-1-68083-705-6

162 pp. $140.00

Buy E-book (.pdf)

Table of contents:

1. Introduction

2. Coding for Bandwidth Reduction

3. Coding for Straggler Mitigation

4. Coding for Security and Privacy

Acknowledgements

Appendices

A Proof of Lemma 3.6

References

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

Recent years have witnessed a rapid growth of large-scale machine learning and big data analytics, facilitating the developments of data intensive applications like voice/image recognition, real-time mapping services, autonomous driving, social networks, and augmented/virtual reality. These applications are supported by cloud infrastructures composed of large datacenters.

The large scale distributed machine learning/data analytics systems provide the necessary processing power to handle these applications, but suffer three major performance bottlenecks; namely, communication, straggler and security.

In this ground-breaking monograph, the authors introduce the novel concept of Coded Computing. Coded Computing exploits coding theory to optimally inject and leverage data/task redundancy in distributed computing systems, creating coding opportunities to overcome the bottlenecks.

After introducing the reader to the core of the problem, the authors describe in detail each of the bottlenecks that can be overcome using Coded Computing. The monograph provides an accessible introduction into how this new technique can be used in developing large-scale computing systems.

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

Free Preview:

Share

Journal details

Abstract

Book details

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning