How machine learning will accelerate data management systems

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Tim Kraska on why ML will change how we build core algorithms and data structures.

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.

Some common examples include:

  • B-Trees—used for range requests (e.g., assemble all sales orders within a certain time frame)
  • Hash maps—used for key-based lookups
  • Bloom filters—used to check whether an element or piece of data is present in a set

Index structures take up space in a database, so you need to be selective about what to index, and they do not take advantage of the underlying data distributions. I’ve worked in settings where an administrator or expert user carefully implements a strategy for building indexes for a data warehouse based on important and common queries.

Indexes are really models or mappings—for instance, a Bloom filter can be thought of as a classification problem. In a recent paper, Kraska and his collaborators approach indexing as a learning problem. As such, they are able to build indexes that take into account underlying data distributions, are smaller in size (thus allowing for a more liberal indexing strategy), and their indexes execute faster. Software and hardware for computation are getting cheaper and better, so using machine learning to create index structures is something that may indeed become routine.
Continue reading “How machine learning will accelerate data management systems”

The state of AI adoption

[A version of this post appears on the O’Reilly Radar.]

An overview of adoption, and suggestions to companies interested in AI technologies.

Artificial intelligence (AI) has attracted a lot of media coverage recently, and companies are rushing to figure out how AI technologies will impact them. Much of the coverage is devoted to research breakthroughs or new product offerings. But how are companies integrating AI into their underlying businesses? In this post, we share slides and notes from a talk we gave this past September at the AI Conference in San Francisco, offering an overview of the state of adoption and some suggestions to companies interested in implementing AI technologies.

Slide courtesy of Ben Lorica. Data source: Google Trends

Much of the renewed interest in AI can be attributed to deep learning. Breakthroughs in deep learning (particularly as applied to computer vision and speech) have excited people about the possibilities of modern AI applications. The result is that companies are beginning to examine applications of deep learning to data they are familiar with, while considering data types (such as images, audio, video) of which they have yet to take advantage.
Continue reading “The state of AI adoption”

Practical applications of reinforcement learning in industry

[A version of this post appears on the O’Reilly Radar.]

An overview of commercial and industrial applications of reinforcement learning.

The flurry of headlines surrounding AlphaGo Zero (the most recent version of DeepMind’s AI system for playing Go) means interest in reinforcement learning (RL) is bound to increase. Next to deep learning, RL is among the most followed topics in AI. For most companies, RL is something to investigate and evaluate but few organizations have identified use cases where RL may play a role. As we enter 2018, I want to briefly describe areas where RL has been applied.

RL is confusingly used to refer to a set of problems and a set of techniques, so let’s first settle on what RL will mean for the rest of this post. Generally speaking, the goal in RL is learning how to map observations and measurements to a set of actions while trying to maximize some long-term reward. This usually involves applications where an agent interacts with an environment while trying to learn optimal sequences of decisions. In fact, many of the initial applications of RL are in areas where automating sequential decision-making have long been sought. RL poses a different set of challenges from traditional online learning, in that you often have some combination of delayed feedback, sparse rewards, and (most importantly) the agents in question are often able to affect the environments with which they interact.
Continue reading “Practical applications of reinforcement learning in industry”

Machine learning at Spotify: You are what you stream

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Christine Hung on using data to drive digital transformation and recommenders that increase user engagement.

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams.

I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.

Here are some highlights from our conversation:

Recommenders at Spotify

For us, engagement always comes first. At Spotify, we have a couple hundred people who are just focused on user engagement, and this is the group that creates personalized playlists, like Discover Weekly or your Daily Mix for you. We know our users love discovery and see Spotify as a very important platform for them to discover something new, but there are also times when people just want to have some music played in the background that fits the mood. But again, we don’t have a specific agenda in terms of what we should push for. We want to give you what you want so that you are happy, which is why we invested so much in understanding people through music. If we believe you might like some “long tail” content, we will recommend it to you because it makes you happy, but we can also do the same for the top 100 track if we believe you will enjoy them.

Music is like a mirror

Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true. As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preference and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models.

Just to give you my personal example to illustrate how it works, you can learn a lot about me just by me telling you what I stream. You will see that I use my running playlist only during the weekend in early mornings, and I have a lot of children’s songs streamed at my house between 5 p.m. and 7 p.m. I also have a lot of tango and salsa playlists that I created and followed. So what does that tell you? It tells you that I am probably a weekend runner, which means I have some kind of affiliation for fitness; it tells you that I am probably a mother and play songs for my child after I get home from work; it also tells you that I somehow like tango and salsa, so I am probably a dancer, too. As you can see, we are investing a lot into understanding people’s context and preference so we can start capturing different moments of their lives. And, of course, the more we understand your context, your preference, and what you are looking for, the better we can customize your playlists for you.

Related resources: