Foundations and Trends® in Theoretical Computer Science > Vol 2 > Issue 4

Algorithms and Data Structures for External Memory

By Jeffrey Scott Vitter, Department of Computer Science, Purdue University, USA, jsv@purdue.edu

 
Suggested Citation
Jeffrey Scott Vitter (2008), "Algorithms and Data Structures for External Memory", Foundations and TrendsĀ® in Theoretical Computer Science: Vol. 2: No. 4, pp 305-474. http://dx.doi.org/10.1561/0400000014

Publication Date: 09 Jun 2008
© 2008 J. S. Vitter
 
Subjects
Design and analysis of algorithms
 

Free Preview:

Download extract

Share

Download article
In this article:
Preface 
1 Introduction 
2 Parallel Disk Model (PDM) 
3 Fundamental I/O Operations and Bounds 
4 Exploiting Locality and Load Balancing 
5 External Sorting and Related Problems 
6 Lower Bounds on I/O 
7 Matrix and Grid Computations 
8 Batched Problems in Computational Geometry 
9 Batched Problems on Graphs 
10 External Hashing for Online Dictionary Search 
11 Multiway Tree Data Structures 
12 Spatial Data Structures and Range Search 
13 Dynamic and Kinetic Data Structures 
14 String Processing 
15 Compressed Data Structures 
16 Dynamic Memory Allocation 
17 External Memory Programming Environments 
Conclusions 
Notations and Acronyms 
References 

Abstract

Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottle-neck. In this manuscript, we survey the state of the art in the design and analysis of algorithms and data structures for external memory (or EM for short), where the goal is to exploit locality and parallelism in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory.

For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices, geometric data, and graphs.

In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide convenient means in online data structures to make effective use of the data accessed from disk. We also re-examine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length such as character strings, when the data structure is compressed to save space, or when the allocated amount of internal memory can change dynamically.

Programming tools and environments are available for simplifying the EM programming task. We report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than other methods used in practice.

DOI:10.1561/0400000014
ISBN: 978-1-60198-106-6
180 pp. $99.00
Buy book (pb)
 
ISBN: 978-1-60198-107-3
180 pp. $125.00
Buy E-book (.pdf)
Table of contents:
1: Introduction
2: Parallel Disk Model (PDM)
3: Fundamental I/O Operations and Bounds
4: Exploiting Locality and Load Balancing
5: External Sorting and Related Problems
6: Lower Bounds and I/O
7: Matrix and Grid Computations
8: Batched Problems in Computational Geometry
9: Batched Problems on Graphs
10: External Hashing for Online Dictionary Search
11: Multiway Tree Data Structures
12: Spatial Data Structures and Range Search
13: Dynamic and Kinetic Data Structures
14: String Processing
15: Compressed Data Structures
16: Dynamic Memory Allocation
17: External Memory Programming Environments
Conclusions
Notations and Acronyms
References

Algorithms and Data Structures for External Memory

Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. Algorithms and Data Structures for External Memory surveys the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. A variety of EM paradigms are considered for solving batched and online problems efficiently in external memory. Algorithms and Data Structures for External Memory describes several useful paradigms for the design and implementation of efficient EM algorithms and data structures. The problem domains considered include sorting, permuting, FFT, scientific computing, computational geometry, graphs, databases, geographic information systems, and text and string processing. Algorithms and Data Structures for External Memory is an invaluable reference for anybody interested in, or conducting research in the design, analysis, and implementation of algorithms and data structures.

 
TCS-014