Data Science tools: Are you “all in” or do you “mix and match”?

[A version of this post appears on the O’Reilly Strata blog.]

An integrated data stack boosts productivity
As I noted in my previous post, Python programmers willing to go “all in”, have Python tools to cover most of data science. Lest I be accused of oversimplification, a Python programmer still needs to commit to learning a non-trivial set of tools1. I suspect that once they invest the time to learn the Python data stack, they tend to stick with it unless they absolutely have to use something else. But being able to stick with the same programming language and environment is a definite productivity boost. It requires less “setup time” in order to explore data using different techniques (viz, stats, ML).

Multiple tools and languages can impede reproducibility and flow
On the other end of the spectrum are data scientists who mix and match tools, and use packages and frameworks from several languages. Depending on the task, data scientists can avail of tools that are scalable, performant, require less2 code, and contain a lot of features. On the other hand this approach requires a lot more context-switching, and extra effort is needed to annotate long workflows. Failure to document things properly makes it tough to reproduce3 analysis projects, and impedes knowledge transfer4 within a team of data scientists. Frequent context-switching also makes it more difficult to be in a state of flow, as one has to think about implementation/package details instead of exploring data. It can be harder to discover interesting stories with your data, if you’re constantly having to think about what you’re doing. (It’s still possible, you just have to concentrate a bit harder.)

Continue reading

Python data tools just keep getting better

[A version of this post appeared on the O’Reilly Strata blog.]

Here are a few observations inspired by conversations I had during the just concluded PyData conference1.

The Python data community is well-organized:
Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The list of supported projects are currently Python-based, but in principle NumFOCUS is an entity that can be used to support related efforts from other communities.

It’s getting easier to use the Python data stack:
There are tools that facilitate the dissemination and sharing of code and programming environments. IPython2 notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: Anaconda and cloud-based Wakari from Continuum Analytics, and cloud computing platform PiCloud.

There are many more visualization tools to choose from:
The 2D plotting tool matplotlib is the first tool enthusiasts turn to, but as I learned at the conference, there are a number of other options available. Continuum Analytics recently introduced companion packages Bokeh and Bokeh.js that simplify the creation of static and interactive visualizations using Python. In particular Bokeh is the equivalent of ggplot (it even has an interface that mimics ggplot). With Nodebox, programmers use Python code to create sketches and interactive visualizations that are similar to those produced by Processing.

Continue reading

No single DBMS will meet all your needs

Only a few years ago many companies that I encountered used MySQL (or Postgres) for everything! Folks got things to work, but had problems running simple queries against their big data sets. Shortly after that a new generation of MPP database startups came along (Greenplum, Asterdata, Netezza), then a flurry of NoSQL databases, and Hadoop emerged. Nowadays companies have a variety of systems optimized for the different workloads they have.

Christmas 2004 seems to have marked the turning point for Amazon. A crisis during the critical holiday season, led to the creation of DynamoDB – a system that went on to influence other NoSQL databases like Riak and Voldemort.

We now believe that when it comes to selecting a database, no single database technology – not even one as widely used and popular as a relational database like Oracle, Microsoft SQL Server or MySQL – will meet all database needs. A combination of NoSQL and relational database may better service the needs of a complex application. Today, DynamoDB has become very widely used within Amazon and is used every place where we don’t need the power and flexibility of relational databases like Oracle or MySQL. As a result, we have seen enormous cost savings, on the order of 50% to 90%, while achieving higher availability and scalability as our internal teams have moved many of their workloads onto DynamoDB.
Werner Vogels, CTO of Amazon

Continue reading

Data Science Tools: Fast, easy to use, and scalable

[A version of this post appears on the O’Reilly Strata blog.]

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention
I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

Continue reading