As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew out of the scientific computing community and has slowly added features that appeal to data scientists.
Roots in academic scientific computing
As IPython creator Fernando Perez noted in his “historical retrospective”, exploratory analysis in a scientific setting requires a solid interactive environment. After years of development IPython has become a great tool for interacting with data. IPython also addresses other important pain points for scientists – reproducibility and collaboration – issues that are equally important to data scientists working in industry.
IPython is more than just Python
With an interactive widget architecture that’s 100% language-agnostic, these days IPython is used by many other programming language communities2, including Julia, Haskell, F#, Ruby, Go, and Scala. If you’re a data scientist who likes to mix-and-match languages, you can create, maintain, and share multi-language data projects in IPython:
IPython is routinely used with tools for data wrangling, advanced analytics, and large-scale computing
Data visualization options continue to improve
Users have long used Python visualization tools (e.g., matplotlib and MayaVi) within IPython. More recently, I’ve seen impressive IPython notebooks with embedded interactive visualizations that were built with tools like plotly and bokeh.
Sharing and collaboration is easy
IPython notebooks are gaining rapid adoption for instructional purposes, and it’s no surprise that it’s already being used in many Programming and Data Science courses (e.g., CS109 at Harvard, Python for Data Science at UC Berkeley, and Software Carpentry). More recently, entire books have been written with it as well: each chapter of a recent3 Signal Processing textbook was originally an IPython notebook. Sharing is facilitated by simple, built-in tools for exporting IPython notebooks into different formats including slides, HTML, LaTex, and JSON.
Community, Ecosystem, and Funding
IPython has an active community of developers4 and users, that come from academia, the public sector, and industry. Checkout the many interesting notebooks listed in this community gallery – a recent favorite is this notebook that replicated XKCD sketches!
Development is funded by individual contributors (tax-deductible donations are handled by Numfocus), the Alfred P. Sloan Foundation, the NSF, Microsoft, and the Simons Foundation. In addition, companies (including Enthought, Microsoft, Continuum.io, GraphLab, Dataiku) and institutions (e.g., MIT’s StarCluster) continue to build tools that integrate IPython.
Learn more at Strata Santa Clara 2014
- 2013 Data Science Salary Survey: Tools, Trends, What Pays (and What Doesn’t) for Data Professionals – Python was popular among survey respondents.
- Reproducing Data Projects
- Why is building custom recommender systems hard? Does it have to be?
(0) This post is based on an extended conversation with IPython’s creator, Fernando Perez.
(1) Fernando Perez has a nice summary of the evolution of IPython.
(2) IPython for R and Matlab are in the planning stages. IPython language kernels include IJulia, IHaskell, IFSharp, IRuby, IScala.
(3) Another textbook worth citing: Probabilistic Programming & Bayesian Methods for Hackers. O’Reilly’s Mining the Social Web book is accompanied by an extensive collection of IPython notebooks.
(4) IPython has 300 committers.