Practical applications of reinforcement learning in industry

[A version of this post appears on the O’Reilly Radar.]

An overview of commercial and industrial applications of reinforcement learning.

The flurry of headlines surrounding AlphaGo Zero (the most recent version of DeepMind’s AI system for playing Go) means interest in reinforcement learning (RL) is bound to increase. Next to deep learning, RL is among the most followed topics in AI. For most companies, RL is something to investigate and evaluate but few organizations have identified use cases where RL may play a role. As we enter 2018, I want to briefly describe areas where RL has been applied.

RL is confusingly used to refer to a set of problems and a set of techniques, so let’s first settle on what RL will mean for the rest of this post. Generally speaking, the goal in RL is learning how to map observations and measurements to a set of actions while trying to maximize some long-term reward. This usually involves applications where an agent interacts with an environment while trying to learn optimal sequences of decisions. In fact, many of the initial applications of RL are in areas where automating sequential decision-making have long been sought. RL poses a different set of challenges from traditional online learning, in that you often have some combination of delayed feedback, sparse rewards, and (most importantly) the agents in question are often able to affect the environments with which they interact.
Continue reading “Practical applications of reinforcement learning in industry”

Machine learning at Spotify: You are what you stream

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Christine Hung on using data to drive digital transformation and recommenders that increase user engagement.

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams.

I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.

Here are some highlights from our conversation:

Recommenders at Spotify

For us, engagement always comes first. At Spotify, we have a couple hundred people who are just focused on user engagement, and this is the group that creates personalized playlists, like Discover Weekly or your Daily Mix for you. We know our users love discovery and see Spotify as a very important platform for them to discover something new, but there are also times when people just want to have some music played in the background that fits the mood. But again, we don’t have a specific agenda in terms of what we should push for. We want to give you what you want so that you are happy, which is why we invested so much in understanding people through music. If we believe you might like some “long tail” content, we will recommend it to you because it makes you happy, but we can also do the same for the top 100 track if we believe you will enjoy them.

Music is like a mirror

Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true. As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preference and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models.

Just to give you my personal example to illustrate how it works, you can learn a lot about me just by me telling you what I stream. You will see that I use my running playlist only during the weekend in early mornings, and I have a lot of children’s songs streamed at my house between 5 p.m. and 7 p.m. I also have a lot of tango and salsa playlists that I created and followed. So what does that tell you? It tells you that I am probably a weekend runner, which means I have some kind of affiliation for fitness; it tells you that I am probably a mother and play songs for my child after I get home from work; it also tells you that I somehow like tango and salsa, so I am probably a dancer, too. As you can see, we are investing a lot into understanding people’s context and preference so we can start capturing different moments of their lives. And, of course, the more we understand your context, your preference, and what you are looking for, the better we can customize your playlists for you.

Related resources:

The current state of Apache Kafka

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “The Age of Machine Learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Here are some highlights from our conversation:

The first engineering project that made use of Apache Kafka

If I remember correctly, we were putting Hadoop into a place at LinkedIn for the first time, and I was on the team that was responsible for that. The problem was that all our scripts were actually built for another data warehousing solution. The questions was, are we going to rewrite all of those scripts and now sort of make them Hadoop specific? And what happens when a third and a fourth and a fifth system is put into place?

So, the initial motivating use case was: ‘we are putting this Hadoop thing into place. That’s the new-age data warehousing solution. It needs access to the same data that is coming from all our applications. So, that is the thing we need to put into practice.’ This became Kafka’s very first use case at LinkedIn. From there, because that was very easy and I actually helped move one of the very first workloads to Kafka, it was hardly difficult to convince the rest of the LinkedIn engineering team to start moving over to Kafka.
Continue reading “The current state of Apache Kafka”

Building a natural language processing library for Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Here are some highlights from our conversation:

The state of NLP in Spark

Here are your two choices today. Either you want to leverage all of the performance and optimization that Spark gives you, which means you want to stay basically within the JVM, and you want to use a Java-based library. In which case, you have options that include OpenNLP, which is open source, or Stanford NLP, which requires licensing in order to use in a commercial product. These are older and more academically oriented libraries. So, they have limitations in performance and what they do.
Continue reading “Building a natural language processing library for Apache Spark”

Machine intelligence for content distribution, logistics, smarter cities, and more

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Rhea Liu on technology trends in China.

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores).

I wanted to consult local market researchers to help make sense of some of the things I’ve been observing from afar. Liu and her colleagues have put out a series of interesting reports highlighting some of these important trends. They also have an annual report—Trends & Predictions for China’s Tech Industry in 2018—that Liu will discuss in her keynote and talk at Strata Data Singapore in December.

Here are some highlights from our conversation:
Continue reading “Machine intelligence for content distribution, logistics, smarter cities, and more”

How companies can navigate the age of machine learning

[A version of this post appears on the O’Reilly Radar.]

To become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.

Over the last few years, the data community has focused on gathering and collecting data, building infrastructure for that purpose, and using data to improve decision-making. We are now seeing a surge in interest in advanced analytics and machine learning across many industry verticals.

In this post, I share slides and notes from a talk I gave this past September at Strata Data NYC offering suggestions to companies interested in adding machine learning capabilities. The information stems from conversations with practitioners, researchers, and entrepreneurs at the forefront of applying machine learning across many different problem domains.
Continue reading “How companies can navigate the age of machine learning”

Vehicle-to-vehicle communication networks can help fuel smart cities

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Bruno Fernandez-Ruiz on the importance of building the ground control center of the future.

In this episode of the Data Show, I spoke with Bruno Fernandez-Ruiz, co-founder and CTO of Nexar. We first met when he was leading Yahoo! technical teams charged with delivering a variety of large-scale, real-time data products. His new company is helping build out critical infrastructure for the emerging transportation sector.

While some question whether V2X communication is necessary to get to fully autonomous vehicles, Nexar is already paving the way by demonstrating how a vehicle-to-vehicle (V2V) communication network can be built efficiently. As Fernandez-Ruiz points out, there are many applications for such a V2V network (safety being the most obvious one). I’m particularly fascinated by what such a network, and the accompanying data, opens up for future, smarter cities. As I pointed out in a post on continuous learning, simulations are an important component of training AI applications. It seems reasonable to expect that the data sets collected by V2V networks will be useful for smart city planners of the future.

Continue reading “Vehicle-to-vehicle communication networks can help fuel smart cities”