Structured streaming comes to Apache Spark 2.0

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Michael Armbrust on enabling users to perform streaming analytics, without having to reason about streaming.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

Tackling these problems at large-scale, in a popular framework with many, many production deployments is a challenging problem. So think of Spark 2.0 as the opening salvo. Just as it took a few versions before a majority of Spark users moved over to Spark SQL, I expect the new structured streaming framework to improve and mature over the next releases of Spark.

Here are some highlights from our conversation:

Continue reading

Don’t overlook simpler techniques and algorithms

Even in areas and domains where deep learning excels, simpler approaches are worth examining.

A while back I noted that there are several considerations when evaluating machine learning models:

Accuracy is the main objective and a lot of effort goes towards raising it. But in practice tradeoffs have to be made, and other considerations play a role in model selection. Speed (to train/score) is important if the model is to be used in production. Interpretability is critical if a model has to be explained for transparency reasons (“black-boxes” are always an option, but are opaque by definition). Simplicity is important for practical reasons: if a model has “too many knobs to tune” and optimizations have to be done manually, it might be too involved to build and maintain it in production.

While deep learning has emerged as a technique capable of producing state-of-the-art results across several domains and types of data it’s far from the optimal choice in some situations. Simple techniques can sometimes produce comparable results and they register better along the other dimensions listed above (interpretability and speed).

A few weeks ago I tweeted a few examples where simpler approaches outperformed deep learning (or in the case of the last tweet, bag-of-words + CNNs outperforming RNNs). These examples struck a chord which prompted me to collect them into a post. It also got me thinking that I should solicit similar examples from my readers: so please leave your favorite example in the comments below, and I will update this post with the best suggestions.


Recent trends in recommender systems

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Danny Bickson on recommenders, data science, and applications of machine learning.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Danny Bickson, co-founder and VP at Dato, and the principal organizer of the Data Science Summit (full disclosure: I’m a member of the conference organizing committee). Among machine learning students and practitioners, recommender systems have become somewhat of a canonical use case and application. One of the early and popular building blocks was GraphLab’s collaborative filtering toolkit, a library originally written and maintained by Bickson. He has continued to keep tabs on the latest developments in recommenders and continues to help organize workshops on related topics throughout the world.

In recent years Bickson has turned his focus toward helping companies deploy machine learning systems in production in a wide range of real-world settings. Here are some highlights from our conversation:

Building a toolkit for collaborative filtering

It was kind of accidental. I was working on my Ph.D.—a lot of linear models, like linear systems of equations and interactive solvers. Matrix factorization, which is the base algorithm behind collaborative filtering, is very related to linear systems. It can be thought of as some kind of extension, and it’s more powerful. … When I was at CMU, I heard a lecture by a guy who’s now a researcher at Facebook, who actually worked on what they call Bayesian Tensor Factorization. This work drew me toward the domain, and I started to look into it. His code was in Matlab, so I tried to re-implement it on our system, GraphLab.

Initially, when we started the project, we had what we call a framework, which is like an API for graph analytics. But we found out that not many people are interested in just writing code for a framework because it’s a very low level and it’s not that intuitive. … Once we started to package algorithms on top of the framework, then we became way more popular because people wanted to use pre-made building blocks.

One of the reasons behind the success of this toolkit was that we started to compete in what was a relatively known competition called ACM KDD Cup. It was back in 2011. … When we started to compete using our code, we actually did something that was counter-intuitive: we shared our code during the competition, and then people, if they downloaded it, could improve their own results. That got us very quickly to hundreds of downloads, and a lot of companies were involved in this competition, so that opened a lot of doors for us in industry.

Recommender systems

The pillar stone of recommender systems research started with the Netflix competition, which, I guess, most of us know. … That was for movie recommenders. Their main assumptions were that you echoed information about user-to-movie interaction and their scores. That’s a kind of program that we are all very familiar with. There are hundreds of research papers. It is an explored domain where we are very good.

The areas that need a bit more attention are those where you have additional data. … [where] you also know the day of the week, and the time, and which type of iPhone the user had, and what the user’s age and zip code are, what the item color and price is, and so on. Once you throw in more information, of course you can build richer models, but then the complexity goes up.

… You can have models that rely on user behavior. You can have separate models that rely on activity data, like finding similar cars to the ones previously sold, and so on. There are models based on text description of products. We have models based on user reviews of products, text reviews, and sentiments. There are models that even take into account images of products. But the most interesting models are hybrid models that combine a lot of types of inputs because companies have very rich information. Currently, they’re using just a small fraction of that information to make the predictions, but once they gather more information they can have better models and more accurate models. That is what’s most interesting to me personally.

Deep learning using Dato

As you know, deep learning is one of the hottest techniques in machine learning, so we did want to have a foot in this domain. So far, we have an initial version, which supports convolutional neural networks. But the good news is that we hired some of the people behind MXNet, which is one of the emerging deep learning platforms, and there you have a lot of other algorithms, including RNN, and you also have features like support of multiple GPUs.

Editor’s note: Danny Bickson will present a talk entitled New trends in recommender systems at Strata + Hadoop World London 2016.

Related resources: