Topic Models: Past, Present, Future

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: David Blei, co-creator of one of the most popular tools in text mining and machine learning.

I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).

I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.

Generating features for other machine learning tasks

Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:

“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”

Continue reading

Time-turner: Strata San Jose 2015, day 2

[Our friends at Dato created an interesting content-based, Strata session recommender. Check it out here.]

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 2 (maybe 3) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day would look:

10:40 a.m.

11:30 a.m.

1:30 p.m.

2:20 p.m.

4 p.m.

Time-turner: Strata San Jose 2015, day 1

[Our friends at Dato created an interesting content-based, Strata session recommender. Check it out here.]

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 2 (maybe 3) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day would look:

10:40 a.m.

11:30 a.m.

1:30 p.m.

2:20 p.m.

4 p.m.

4:50 p.m.

Hardcore Data Science: 2015 California

Ben Recht and I hosted another great edition of Hardcore Data Science yesterday. From the very first talk, the room was full, the audience was attentive, and the energy in the room was high. It remained that way throughout the day.

This time around, I spent more time documenting the day on Twitter – enjoy!

Update (2015-02-25): Related tweets from the conference.

Update (2015-03-01): The largest Spark production deployments are in China. We were fortunate to have a speaker from Tencent willing to the take the time to fly over and present at the conference during Chinese New Year.

Forecasting events, from disease outbreaks to sales to cancer research

[A version of this post appears on the O’Reilly Radar blog.]

The O’Reilly Data Show Podcast: Kira Radinsky on predicting events using machine learning, NLP, and semantic analysis.

Editor’s note: One of the more popular speakers at Strata + Hadoop World, Kira Radinsky was recently profiled in the new O’Reilly Radar report, Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education.

When I first took over organizing Hardcore Data Science at Strata + Hadoop World, one of the first speakers I invited was Kira Radinsky. Radinsky had already garnered international recognition for her work forecasting real-world events (disease outbreak, riots, etc.). She’s currently the CTO and co-founder of SalesPredict, a start-up using predictive analytics to “understand who’s ready to buy, who may buy more, and who is likely to churn.”

I recently had a conversation with Radinsky, and she took me through the many techniques and subject domains from her past and present research projects. In grad school, she helped build a predictive system that combined newspaper articles, Wikipedia, and other open data sets. Through fine-tuned semantic analysis and NLP, Radinsky and her collaborators devised new metrics of similarity between events. The techniques she developed for that predictive software system are now the foundation of applications across many areas. Continue reading

Ask Us Anything at Strata+Hadoop World

One of the most popular sessions at last year’s Strata+Hadoop World in Barcelona was Spark Camp. Midway through this sold-out immersion day, it occurred to me that having an additional Q&A session would be helpful to the attendees. With minimal prodding from Paco Nathan, the other instructors of Spark Camp gladly agreed and so we ended up offering the first Ask Us Anything session (AUA) at one of our conferences.

We received great feedback from that first AUA session and we decided to try again. At next week’s Strata+Hadoop World in San Jose, several groups of instructors will offer AUA sessions:

If you plan to attend Strata San Jose and are interested in any of the topics above, make time to meet the experts in these subjects and ask your questions in person.

Network structure and dynamics in online social systems

Understanding information cascades, viral content, and significant relationships.

[A version of this post appears on the O’Reilly Radar blog.]

I rarely work with social network data, but I’m familiar with the standard problems confronting data scientists who work in this area. These include questions pertaining to network structure, viral content, and the dynamics of information cascades.

At last year’s Strata + Hadoop World NYC, Cornell Professor and Nevanlinna Prize Winner Jon Kleinberg walked the audience through a series of examples from social network analysis, looking at the content of shared photos and text, as well as the structures of the networks. It was a truly memorable presentation from one of the foremost experts in network analysis. Each of the problems he discussed would be of interest to marketing professionals, and the analytic techniques he described were accessible to many data scientists. What struck me is that while these topics are easy to describe, framing the right question requires quite a bit of experience with the underlying data.

Predicting whether an information cascade will double in size

Can you predict if a piece of information (say a photo) will be shared only a few times or hundreds (if not thousands) of times? Large cascades are very rare, making the task of predicting eventual size difficult. You either default to a pathological answer (after all most pieces of information are shared only once), or you create a balanced data set (comprised of an equal number of small and large cascades) and end up solving an artificial task.

Thinking of a social network as an information transport layer, Kleinberg and his colleagues instead set out to track the evolution of cascades. In the process, they framed an interesting balanced algorithmic prediction problem: given a cascade of size k, predict whether it will reach size 2k (it turns out 2k is roughly the median size of a cascade conditional on whether it reaches size k). Continue reading