[A version of this post appeared on the O’Reilly Strata blog.]
Here are a few observations inspired by conversations I had during the just concluded PyData conference1.
The Python data community is well-organized:
Besides conferences (PyData, SciPy, EuroSciPy), there is a new non-profit (NumFOCUS) dedicated to supporting scientific computing and data analytics projects. The list of supported projects are currently Python-based, but in principle NumFOCUS is an entity that can be used to support related efforts from other communities.
It’s getting easier to use the Python data stack:
There are tools that facilitate the dissemination and sharing of code and programming environments. IPython2 notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: Anaconda and cloud-based Wakari from Continuum Analytics, and cloud computing platform PiCloud.
There are many more visualization tools to choose from:
The 2D plotting tool matplotlib is the first tool enthusiasts turn to, but as I learned at the conference, there are a number of other options available. Continuum Analytics recently introduced companion packages Bokeh and Bokeh.js that simplify the creation of static and interactive visualizations using Python. In particular Bokeh is the equivalent of ggplot (it even has an interface that mimics ggplot). With Nodebox, programmers use Python code to create sketches and interactive visualizations that are similar to those produced by Processing.
Large-scale data processing and wrangling tools have improved:
Pandas and PyTables are already popular, and there was very strong interest in the forthcoming Blaze project at the conference. Other options include the Disco Project, a data processing platform that includes an implementation of Map/Reduce, and PySpark, the Python API for the Spark data analytics framework.
There are viable tools for large-scale data analytics:
Scikit-learn (machine-learning library) and scikit-image (image processing) are used by many academic research groups and companies. Both have extensive libraries of algorithms, and come with lots of examples to help users get started3. Another tool written in Python focuses on deployment4: Augustus is an open source system for building and scoring, scalable data mining and statistical algorithms. Augustus produces and consumes PMML, and includes components for simple data wrangling (users can embed Python code for data processing in their PMML files).
In addition, new tools like H20 and wise.io plan to make their massively scalable algorithms accessible via Python. Frameworks that expose distributed algorithms to Python programmers include GraphLab (Python/Jython interface) and Spark (algorithms5 in Scala that are accessed via PySpark). Finally, there are also tools that let Python programmers target GPU’s for parallel programming: NumbaPro and PyCUDA
Next up: see you at the SAS Global forum next month.
(1) The event drew about 300 attendees and is one of three PyData conferences scheduled this year (Boston in the summer, NYC in the fall).
(2) The new language Julia, and IPython are starting to work well together.
(3) In practice these tools let Python programmers efficiently develop prototypes that are later re-implemented (in another language) and optimized before being deployed to production.
(4) Data scientists tend not to focus on the deployment and maintenance of “models”. The Hazy project may change this mindset.
(5) A suite of distributed algorithms will be available upon the release of MLbase on Spark.