Data Science Tools: Fast, easy to use, and scalable

[A version of this post appears on the O’Reilly Strata blog.]

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

SQL is alive and well
Impala’s well-received launch at Strata NYC last fall confirmed the strong interest in interactive analytics (adhoc query-and-response) within the Hadoop ecosystem. The list of solutions for querying big data (stored in HDFS) continues to grow with CitusDB and Pivotal HD, joining Impala, Shark, Hadapt, and cloud-based alternatives BigQuery, Redshift, and Qubole. I use Shark and have been impressed by its speed and ease of use. (I’ve heard similar things about Impala’s speed recently.)

Business Intelligence reboot (again)
QlikTech and Tableau had combined 2012 revenues of more than $450M. They are easy-to-use analysis tools, that let users visually explore data, and share charts and dashboards. Both use in-memory technologies to speed up query response and visualization rendering times. Both run only on MS-Windows.

Startups that draw inspiration from these two successful companies are targeting much larger data sets – in the case of Datameer, Platfora, and Karmasphere, massive data sets stored in HDFS. Platfora has been generating buzz with its fast in-memory, columnar data store, custom HTML5 visualization package, and emphasis on tools that let users interact with massive data. Datameer continues to quietly rack up sales – it closed 2012 with more than $10M in revenues. Strata Startup showcase (audience choice) winner SiSense, offers a hardware optimized business analytics platform that delivers fast processing times by efficiently utilizing disk, RAM, and CPU.

Scalable machine-learning and analytics are going to get simpler1
H20 is a new, open source, machine-learning platform from 0xdata. It can use data stored in HDFS or flat files and comes with a few distributed algorithms (random forests, GLM, and a few others). H20 also has tools for rudimentary exploratory data analysis and wrangling. Users can navigate the system using a web browser or a command-line interface. Just like Revolution Analytics’ ScaleR, users can interact with H20 using R code (limited to the subset of models and algorithms available). H20 is also available via REST/JSON interfaces.

What I found intriguing2 was SkyTree’s acquisition of AdviseAnalytics – a desktop software product designed to make statistical data analysis accessible. (AdviseAnalytics was founded by Leland Wilkinson, creator of the popular Systat software package and author of the Grammar of Graphics.) The system now called SkyTree Adviser, provides a GUI that emphasizes tasks (cluster, classify, compare, etc.) over algorithms. In addition it produces results that include short explanations of the underlying statistical methods (power users can opt for concise results similar to those produced by standard statistical packages). Finally SkyTree Adviser users benefit from the vast number of algorithms available – the system uses ensembles, or finds optimal algorithms. (The MLbase optimizer will perform the same type of automatic “model selection” for distributed algorithms.)

SkyTree now offers users an easy-to-use tool for analytic explorations over medium sized data sets (SkyTree Adviser), and a server product for building and deploying algorithms against massive amounts of data. Throw in MLbase and Hazy, and I can see the emergence of several large-scale machine-learning tools3 for non-technical users.

Reproducibility of Data Science Workflows

Data scientists tend to use many tools and the frequent context-switching is a drag on their productivity. An important side-effect is that it’s often challenging to document and reproduce analysis projects that involve many steps and tools.

Data scientists who rely on the Python data stack (Numpy, SciPy, Pandas, nltk, etc.) should check out Wakari from Continuum Analytics. It’s a cloud-based service that takes care of many details including data management, package and version management, while insulating the user from the intricacies of Amazon Web Services.

Loom is a just-released, data management system that initially targets users of Hadoop (and R). By letting users track lineage and data provenance, Loom makes it easier to recreate multi-step data analysis projects.

Next up: See you at PyData Silicon Valley, March 19-20.

(1) In previous posts I detailed why I like GraphChi/GraphLab and why I’m excited about MLbase. Two other open source projects are worth highlighting: Mahout has many more algorithms but VW generates more enthusiastic endorsements from users I’ve spoken with. However the sparse documentation and the many command-line options makes it tough to get going in VW. (A forthcoming O’Reilly book should make VW more accessible.) For users who want to roll their own, I’ve written a few simple distributed, machine-learning algorithms in Spark, and found it quite fast for batch training and scoring.
(2) Update (3/18/2013): I removed this from the original version of this post, and re-inserted it following the official launch of SkyTree Adviser.
(3)) BI tools like Datameer already come with simple analytic functions available through a GUI.

Leave a Reply