[A version of this post appears on the O’Reilly Strata blog.]
Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.
Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)
Next up: Applications
Many developers aren’t familiar with the intricacies of deploying, managing1, and tuning distributed systems. The good news is that as the infrastructure gets simpler, companies can start focusing on building interesting applications. I’m starting to hear of many more researchers and startups building interesting solutions on top of one of these integrated platforms (BDAS, Cloudera, and other Hadoop distributions).
One can create Big Data applications by cobbling together different (“best-of-breed”) systems, but it’s usually2 much easier to use engines built on top of the same platform. There’s a tradeoff: in many cases it’s hard3 to beat highly optimized/targeted solutions. It’s easier to use an integrated stack, but you likely have to sacrifice a little bit of performance in exchange. I suspect that in many use cases, the performance of integrated platforms will be “good enough”, and convenience will trump performance. Over time, analytic engines built on top of BDAS and Hadoop will improve and the performance gap will narrow further.
A good place to learn more about interesting Big Data applications (and how they’re built), is at the combined Hadoop World + Strata conference in NYC this October.
(1) The Hadoop community is doing a good job on this front with Ambari, Cloudera Manager and other tools.
(2) If you’re willing to use their services, cloud platforms like Infochimps, Amazon, Google, and Microsoft are starting to make it easier to assemble different systems. In addition, there are companies like Datastax that integrate different systems in their offerings.
(3) Recent example of this performance vs. convenience tradeoff: interactive (Hadoop) query engines vs. MPP databases, and GraphX vs. GraphLab.