[A version of this post appears on the O’Reilly Strata blog.]
An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).
Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)
Some tools that cover a range of data science tasks
More tools that integrate different data science tasks are starting to appear. SAS has long provided tools for data management and wrangling, business intelligence, visualization, statistics, and machine-learning. For massive5 data sets, a new alternative to SAS is ScaleR from Revolution Analytics. Within ScaleR programmers use R for data wrangling (rxDataStep), data visualization (basic viz functions for big data), and statistical analysis (it comes with a variety of scalable statistical algorithms).
Startup Alpine Data Labs lets users connect to a variety of data sources, manage their data science workflows, and access a limited set of advanced algorithms. Upstart BI vendors Datameer and Platfora provide data wrangling and visualization tools. Datameer also provides easy data integration to a variety of structured/unstructured data sources, analytic functions and PMML to execute predictive analytics. The release of MLbase this summer adds machine-learning to the BDAS/Spark stack – which currently covers data processing, interactive (SQL) and streaming analysis.
What does your data science toolkit look like? Do you mainly use one stack or do you tend to “mix and match”?
(1) This usually includes matplotlib or Bokeh, Scikit-learn, Pandas, SciPy, and NumPy. But as a general purpose language, you can even use it for data acquisition (e.g. web crawlers or web services).
(2) An example would be using R for viz or stats.
(3) This pertains to all data scientists, but is particularly important to those among us who use a wide variety of tools. Unless you document things properly, when you’re using many different tools the results of very recent analysis projects can be hard to redo.
(4) Regardless of the tools you use, everything starts with knowing something about the lineage and provenance of your data set – something Loom attempts to address.
(5) A quick and fun tool for exploring smaller data sets is the just released SkyTree Adviser. After users perform data processing and wrangling in another tool, SkyTree Adviser exposes machine-learning, statistics, and statistical graphics through an interface that is accessible to business analysts.