A framework for building and evaluating data products

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Pinterest data scientist Grace Huang on lessons learned in the course of machine learning product launches.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

Here are some highlights from our conversation:

Using machine learning to strengthen content ecosystems

Pinterest content is a giant, complicated corpus, that has a very rich meta data associated with it. If you build a recommendation system where there’s a lot of bias in it, over time you can start showing just a particular corner of that corpus to the world—because you think your user might find a piece of that corner of content particularly engaging. This is an issue when you’re basing your algorithms only on your existing users.

When Pinterest first started out, we had a very strong user base around particular user demographics. That part of the content corpus becomes very well curated, which makes those content pieces rank really high in our machine learning products. Then we had to start consciously thinking about how to combat that problem because otherwise, over time, you’re just going to build a product that only appeals to that segment of users.

From the user perspective, you want to make sure you’re creating a corpus that covers enough in terms of topics and interests, in terms of different languages people speak, in terms of different cultural backgrounds. Then, I think on the content side, we have the same problem where fresher, newer content may have trouble competing with older content that’s been around for a long time and has really good historical performance.

Maintaining this healthy ecosystem involves creating mechanisms to jump start new content so we can show it enough times to quickly learn whether or not it’s high quality. And whether or not it might be relevant for certain segments of users. We then want to be able to use that information very efficiently to drive our downstream products.

Building data products: Three anti-patterns

The first one is, do not build a model for users today. You have to think about your users tomorrow as well. Second, it’s really easy to build a system where the rich get richer. There are a lot of techniques out there to prevent that from happening; it’s often not by design. It’s very subtle, and it takes a long time to observe this rich-get-richer effect and for it to build up. You have to be very vigilant about it. … The third anti-pattern is that you might find yourself optimizing not quite the right thing. You can get exactly what you wish for with a machine learning system. It’s very good at optimizing a goal that you specify. But that goal may not necessarily correlate with the ultimate goal. Keeping your ultimate goal in mind and evaluating your products with the ultimate goal, instead your intermediate goal, is really important. For example, I think short-term metrics are easier to optimize toward. But they may or may not correlate with a long-term goal like retention.

Related resources: