[A version of this post appears on the O’Reilly Data blog.]
I use a variety of tools for advanced analytics, most recently I’ve been using Spark (and MLlib), R, scikit-learn, and GraphLab. When I need to get something done quickly, I’ve been turning to scikit-learn for my first pass analysis. For access to high-quality, easy-to-use, implementations1 of popular algorithms, scikit-learn is a great place to start. So much so that I often encourage new and seasoned data scientists to try it whenever they’re faced with analytics projects that have short deadlines.
I recently spent a few hours with one of scikit-learn’s core contributors Olivier Grisel. We had a free flowing discussion were we talked about machine-learning, data science, programming languages, big data, Paris, and … scikit-learn! Along the way, I was reminded by why I’ve come to use (and admire) the scikit-learn project.
Commitment to documentation and usability
One of the reasons I started2 using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). Contributions to scikit-learn are required to include narrative examples along with sample scripts that run on small data sets. Besides good documentation there are other core tenets that guide the community’s overall commitment to quality and usability: the global API is safeguarded, all public API’s are well documented, and when appropriate contributors are encouraged to expand the coverage of unit tests.
Models are chosen and implemented by a dedicated team of experts
scikit-learn’s stable of contributors includes experts in machine-learning and software development. A few of them (including Olivier) are able to devote a portion of their professional working hours to the project.
Covers most machine-learning tasks
Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.). And since scikit-learn is developed by a large community of developers and machine-learning experts, promising new techniques tend to be included in fairly short order.
As a curated library, users don’t have to choose from multiple competing implementations of the same algorithm (a problem that R users often face). In order to assist users who struggle to choose between different models, Andreas Muller created a simple flowchart for users:
Python and Pydata
I recently wrote about Python’s popularity among data scientists and engineers in the SF Bay Area, and it does appear to be the language preferred by many data scientists. Python’s interpreter allows users to interact and play with data sets, and from the outset this made the language attractive to data analysts. More importantly an impressive set of Python data tools (pydata) have emerged over the last few years (I wrote about the pydata ecosystem early this year).
Many data scientists work regularly with several3 pydata tools including scikit-learn, IPython, and matplotlib. A common practice when using scikit-learn is to create matplotlib charts to evaluate data quality or debug a model. Users are also starting to share multi-step analytic projects, using IPython notebooks that embed results and outputs from different pydata components4.
One other sign that Python has emerged as the preferred language of data scientists: new analytic tools like Spark (PySpark), GraphLab (GraphLab notebook), and Adatao all support Python.
scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface. This means that hard choices have to be made as to what fits into the project. For example the community recently decided that Deep Learning had enough specialized requirements (large5 number of hyper-parameters; computation on GPU introduces new complex software dependencies) that it was best included in a new project. scikit-learn developers have instead opted to implement baseline neural networks as building blocks (Multilayer Perceptron and Restricted Boltzmann Machines).
scikit-learn scales to most data problems
The knock on Python is speed6 and scale. It turns out that while scale can be a problem, it may not come up as often as some detractors claim7. Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems. Other techniques like sampling or ensemble8 learning can also be used to train models on massive data sets.
But there are occasions when the combination of raw data size and workflow dictates my choice of tools. I sometimes turn to machine-learning tools that integrate with my data wrangling and ETL tool (e.g., Spark, MapReduce, Scalding). For advanced analytics against really large data sets, I end up using distributed frameworks like Spark9.
Learn more at Strata Santa Clara 2014
If you’re new to machine-learning and are interested in learning about pydata tools, you should consider attending Olivier’s upcoming tutorial at Strata Santa Clara. Olivier is a popular speaker and instructor within the pydata community, and in his tutorial you’ll learn how to train, evaluate, and tune several machine-learning models using scikit-learn and other pydata components.
(1) Models and algorithms are implemented by experts and are peer reviewed by the scikit-learn developer community.
(2) I needed a RandomForest package and read what scikit-learn had to offer – I was off and running pretty quickly.
(3) As I noted in my earlier post, installing multiple components of the pydata stack has gotten a lot easier.
(4) Speaking of reproducibility: Paris startup Dataiku has a nice product that, among other things, automatically generates an IPython notebook.
(5) Deep learning model parameters typically involve tree-structured configuration files instead of a flat list of parameter values used in other scikit-learn models. More recently, some scikit-learn developers decided that structured prediction models (typically used in computer vision and NLP) are better suited for a new project called PyStruct.
(6) It tends to be slower than JVM languages, Go, Julia, and C/C++. OTOH, the use of tools like numpy and Cython can make Python as fast for numerical processing workloads “without sacrificing its high level language features and expressiveness”.
(7) As I noted in a post about GraphChi, well-designed software that work with sparse matrices (that “don’t store zeroes”) can handle really large problems.
(8) Ensemble learning involves partitioning data into smaller subsets, fitting simple models on those subsets independently, and aggregating models to build a final, complex model. Olivier also mentioned recent work within scikit-learn “… to optimize multi-core and memory usage when training forests of randomized trees in parallel on a single node”.
(9) Assuming I have access to the algorithm I need – via MLlib or something I’ve written for my own use. But more often than not, choosing to do ML from within Spark is more about having a consistent workflow, not data size: usually the data set one gets after data wrangling (filtering and aggregating) and numerical feature construction, is small enough to fit in the memory of a single (big memory) server.