[A version of this post appears on the O’Reilly Strata blog.]
Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination of other techniques including data co-partitioning, caching (into main memory), runtime code generation, and columnar storage.
One approach that hasn’t been exploited as much is sampling. By this I mean employing samples to generate approximate answers, and speed up execution. Database researchers have written papers on approximate answers, but few working (downloadable) systems are actually built on this approach.
Approximate query engine from U.C. Berkeley’s Amplab
An interesting, open source database released yesterday0 uses sampling to scale to big data. BlinkDB is a massively-parallel, approximate query system from UC Berkeley’s Amplab. It uses a series of data samples to generate approximate answers. Users compose queries by specifying either error bounds or time constraints, BlinkDB uses sufficiently large random samples to produce answers. Because random samples are stored in memory1, BlinkDB is able to provide interactive response times:
Decision making and uncertainty
BlinkDB uses sampling to provide fast query response times. A side-effect is that query results are accompanied by error bars2. While it’s not the central goal of BlinkDB, approximate query systems change (in a good way) how decision makers think of business intelligence. Since answers in BlinkDB come with error bars, focus shifts from obsessing over the “one true answer”, to recognizing that variability and noise are present in data sets3. This requires a certain amount of reorientation as error representation lead to more nuanced discussions (imagine a scenario4 where an error bar spans a region between “fine” and “disaster”).
Sampling as an alternative
As data sets continue to grow, other interactive query systems may start adopting the sampling approach central to BlinkDB. Other areas of data science routinely use sampling: stats students learn early on the difference between a “statistic” & a “parameter”. Within machine-learning, a popular approach to scaling up algorithms is sampling5. When prototyping a data processing pipeline, I sometimes use small random samples to get a sense6 of the data. Sampling is also starting to appear in commercial big data systems. Users of Datameer build their analysis workflows using samples, and when they’re ready, run their MapReduce jobs against massive data sets.
- Tracking the progress of large-scale Query Engines
- Tightly integrated engines streamline Big Data analysis
- Single server systems can tackle big data
(0) The current release is developer alpha 0.1 – the BlinkDB team will be on a fast release track for the next few of months.
(1) While entire data sets are often too large, samples often fit in memory.
(2) BlinkDB uses statistical closed forms (to estimate values such as AVG, COUNT, SUM, VARIANCE, STDEV) as well as bootstrapping to estimate errors. The current alpha release will only support closed forms and the next release (in early Sept.) will have bootstrap integration. For bootstrapping, BlinkDB will use a recently developed test (“automatic bootstrap test”) to evaluate the quality of bootstrap estimates. In cases when the “automatic bootstrap test” indicates the bootstrap may not produce accurate estimates, BlinkDB may use a different error approximation method, suggest a modification to the query, or ask the user to run the query against the entire data set.
(3) Note that BlinkDB regards the entire database as the truth (true distribution). But by introducing error bars into BI reports, it at least introduces the notion of variability and noise.
(4) In such a scenario, one can imagine a decision maker calling for further investigation to tighten up or further understand the error bars.
(5) One example is to “sample dimensions” in nearest neighbors.
(6) But as Matei Zaharia pointed out to me recently, when working with massive data sets, there aren’t good tools that let you quickly check whether your data wrangling step produced exactly what you wanted.