[A version of this post appears on the O’Reilly Strata blog.]
A new set of tools make it easier to do a variety of data analysis tasks. Some require no programming, while other tools make it easier to combine code, visuals, and text in the same workflow. They enable users who aren’t statisticians or data geeks, to do data analysis. While most of the focus is on enabling the application of analytics to data sets, some tools also help users with the often tricky task of interpreting results. In the process users are able to discern patterns and evaluate the value of data sources by themselves, and only call upon expert1 data analysts when faced with non-routine problems.
Visual Analysis and Simple Statistics
Three SaaS startups – DataHero, DataCracker, Statwing – make it easy to perform simple data wrangling, visual analysis, and statistical analysis. All three (particularly DataCracker) appeal to users who analyze consumer surveys. Statwing and DataHero simplify the creation of Pivot Tables2 and suggest3 charts that work well with your data. StatWing users are also able to execute and view the results of a few standard statistical tests in plain English (detailed statistical outputs are also available).
Statistics and Machine-learning
BigML and Datameer’s Smart Analytics are examples of recent tools that make it easy for business users to apply machine-learning algorithms to data sets (massive data sets, in the case of Datameer). It makes sense to offload routine data analysis tasks to business analysts and I expect other vendors such as Platfora and ClearStory to provide similar capabilities in the near future.
In an earlier post I described Skytree Adviser, a tool that lets users apply statistics and machine-learning techniques on medium-sized data sets. It provides a GUI that emphasizes tasks (cluster, classify, compare, etc.) over algorithms, and produces results that include short explanations of the underlying statistical methods (power users can opt for concise results similar to those produced by standard statistical packages). Users also benefit from not having to choose optimal algorithms (Skytree Adviser automatically uses ensembles or finds optimal algorithms). As MLbase matures it will include a declarative4 language that will shield users from having to select and code specific algorithms. Once the declarative language is hidden behind a UI, it should feel similar to Skytree Adviser. Furthermore MLbase implements distributed algorithms so it scales to much larger data sets (terabytes) than Skytree Adviser.
Several commercial databases offer in-database analytics – native (possibly distributed) analytic functions that let users perform computations (via SQL) without having to move data to another tool. Along those lines, MADlib is an open source library of scalable analytic functions, currently deployable on Postgres and Greenplum. MADlib includes functions for doing clustering, topic modeling, statistics, and many other tasks.
Notebooks: Unifying code, text, and visuals
Tools have also gotten better for users who don’t mind doing some coding. IPython notebooks are popular among data scientists who use the Python programming language. By letting you intermingle code, text, and graphics, IPython is a great way to conduct and document data analysis projects. In addition pydata (“python data”) enthusiasts have access to many open source data science tools, including scikit-learn (for machine-learning) and StatsModels (for statistics). Both are well-documented (scikit-learn has documentation that other open source projects would envy) making it super easy for users to apply advanced analytic techniques to data sets.
IPython technology isn’t tied to Python and other frameworks are beginning to leverage this popular interface (there are early efforts from the GraphLab, Spark, and R communities). With a startup focused on further improving its usability, IPython integration and a Python API are the first of many features designed to make GraphLab accessible to a broader user base.
One language that integrates tightly with IPython is Julia – a high-level, high-performance dynamic programming language for technical computing. IJulia is backed by a full IPython kernel that lets you interact with Julia and build graphical notebooks. In addition Julia now has many libraries for doing simple to advanced data analysis (to name a few: GLM, Distributions, Optim, GARCH). In particular, Julia boasts over 200 packages, a package manager, active mailing lists, and great tools for working with data (e.g., DataFrames and read/writedlm). IJulia should help this high-performance programming language reach an even wider audience.
- Statwing simplifies data analysis
- MLbase: Scalable machine-learning made accessible
- Improving options for unlocking your graph data
- 11 Essential Features that Visual Analysis Tools Should Have
(1) Many routine data analysis tasks will soon be performed by business analysts, using tools that require little to no programming. I’ve recently noticed that the term data scientist is being increasingly used to refer to folks who specialize in analysis (machine-learning or statistics). With the advent of easy-to-use analysis tools, a data scientist will hopefully once again mean someone who possesses skills that cut across several domains.
(2) Microsoft PowerPivot allows users to work with large data sets (billion of rows), but as far as I can tell, mostly retains the Excel UI.
(3) Users often work with data sets with many variables so “suggesting a few charts” is something that many more visual analysis tools should start doing (DataHero highlights this capability). Yet another feature I wish more visual analysis tools would provide: novice users would benefit from having brief descriptions of charts they’re viewing. This idea comes from playing around with BrailleR.
(4) The initial version of their declarative language (MQL) and optimizer are slated for release this winter.