1 Introduction
1 Introduction
One of the fundamental breakthroughs of Codd’s relational data model [33] was the identification of how to take a declarative, logic-based formulation of a query and convert it into an algebraic query evaluation tree. As described in every database
textbook, this enabled physical data independence and promised many benefits: the database administrator and the DBMS optimizer
became free to choose among many different storage formats and execution plans to answer a declarative query. The challenge,
since then, has been how to deliver on these promises -- regardless of where or how the data is laid out, how complex the
query is, and how unpredictable the operating environment is.
This challenge has spurred 30 years of query processing research. Cost-based query optimization, pioneered by Selinger et al. [102] in System R and refined by generations of database researchers and developers, has been tremendously effective in addressing
the needs of relational DBMS query processing: one can get excellent performance for queries over data with few correlations,
executed in a relatively stable environment, given sufficient statistical information.
However, when even one of these characteristics is not present, the System R-style optimize-then-execute model begins to break
down: as noted in [69], optimizer error begins to build up at a rate exponential in the size of the query. As the database field has broadened to
consider more general data management, including querying autonomous remote data sources, supporting continuous queries over data streams, encoding and retrieving
XML data, supporting OLAP and data mining operations, and combining text search with structured query capabilities, the weaknesses
of the traditional optimization model have begun to show themselves.
In response, there has been a surge of interest in a broad array of techniques termed adaptive query processing (AQP). AQP addresses the problems of missing statistics, unexpected correlations, unpredictable costs, and dynamic data by
using feedback to tune execution. It is one of the cornerstones of so-called autonomic database management systems, although it also generalizes to many other contexts, particularly at the intersection of database
query processing and the Web.
The spectrum of adaptive query processing techniques has been quite broad: they may span multiple query executions or adapt
within the execution of a single query; they may affect the query plan being executed or the scheduling of operations within
the plan; they have been developed for improving performance of local DBMS queries (e.g., [75, 87, 112]), for processing distributed and streaming data (e.g., [6, 72, 88, 92, 101]), and for performing distributed query execution (e.g., [115]).
This survey is an attempt to cover the fundamental issues, techniques, costs, and benefits of adaptive query processing. We
begin with a broad overview of the field, identifying the dimensions of adaptive techniques. Then we focus our analysis on
the spectrum of approaches available to adapt query execution at runtime -- primarily in a non-streaming context. Where possible,
we focus on simplifying and abstracting the key concepts of each technique, rather than reproducing the full details available
in the papers; we consider generalizations of the specific published implementations. Our goal is to identify the strengths
and limitations of the different techniques, demonstrate when they are most useful, and suggest possible avenues of future
research.
In the rest of the section, we present a brief overview of query processing in relational database systems (Section 1.1) and elaborate on the reasons behind the push toward adaptivity (Section 1.2); we then present a road map for the rest of the survey (Section 1.3), and briefly discuss the related surveys of interest (Section 1.4).