[A version of this post appears on the O’Reilly Data blog and Forbes.]
As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark1 and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark2. Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.
Another recent example is Dendrite3 – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:
Users of Spark explore Spark Streaming because similar code for batch (Spark) can, with minor modification, be used for realtime (Spark Streaming) computations. Along these lines, Summingbird – an open source library from Twitter – offers something similar for Hadoop MapReduce and Storm. With Summingbird, programs that look like Scala collection transformations can be executed in batch (Scalding) or realtime (Storm).
In some instances the underlying techniques from a set of tools makes its way into others. The DeepDive team at Stanford just recently revamped their information extraction and natural language understanding system. But already techniques used in DeepDive have found their way into many other systems including MADlib, Cloudera Impala, “a product from Oracle”, and Google Brain.
Related content:
- Upcoming Strata Santa Clara keynotes by Amr Awadallah (of Cloudera) and Matei Zaharia (of Databricks)
- Spark, Spark Streaming and other components of the Berkeley Data Analytics stack will be featured in two Strata Santa Clara tutorials: talks (morning) and training (afternoon)
- Carlos Guestrin, founder and CEO of GraphLab, will lead a Strata Santa Clara tutorial
- DeepDive came out of Chris Re’s research group (now at Stanford): Chris will give a talk at Strata Santa Clara
- Summingbird co-creator Oscar Boykin, will talk about Algebird at the upcoming Hardcore Data Science day.
(1) Full disclosure: I am an advisor to Databricks – a startup commercializing Spark. (2) Some potential applications of Spark and Spark Streaming include stream processing and mining, interactive and iterative computing, machine-learning, and graph analytics. (3) Hat tip to Danny Bickson.