[A version of this post appears on the O’Reilly Radar blog.]
The O’Reilly Data Show Podcast: Kira Radinsky on predicting events using machine learning, NLP, and semantic analysis.
Editor’s note: One of the more popular speakers at Strata + Hadoop World, Kira Radinsky was recently profiled in the new O’Reilly Radar report, Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education.
When I first took over organizing Hardcore Data Science at Strata + Hadoop World, one of the first speakers I invited was Kira Radinsky. Radinsky had already garnered international recognition for her work forecasting real-world events (disease outbreak, riots, etc.). She’s currently the CTO and co-founder of SalesPredict, a start-up using predictive analytics to “understand who’s ready to buy, who may buy more, and who is likely to churn.”
I recently had a conversation with Radinsky, and she took me through the many techniques and subject domains from her past and present research projects. In grad school, she helped build a predictive system that combined newspaper articles, Wikipedia, and other open data sets. Through fine-tuned semantic analysis and NLP, Radinsky and her collaborators devised new metrics of similarity between events. The techniques she developed for that predictive software system are now the foundation of applications across many areas.
The challenges of prediction: from news headlines to cholera
Early versions of a predictive system did not yield interesting results, until Radinsky and her collaborators discovered the additional insights they could derive from correlations:
“The problem was, when we were looking at only patterns of causality, we used to have only trivial things. I’ll give an example. An Iranian professor was killed, and the system would output, and the funeral would be held, which is correct. You would even find it in the news … The problem is that when you build a system based on only those causality patterns, you only train it on what people already know … The next step that we did was add correlations in addition to what people already know, as cause and effect. We had this graph of causality, and we added additional correlations. Again, we don’t know their cause and effect, but we are going to use them when trying to predict future news events.
… Cholera is a waterborne disease … The system knew that storms can cause cholera. Again, not all storms cause cholera. So, we did storyline detection. We took all the news, and were trying to align the stories in a way that the same storyline or the same articles in the same topic would be aligned. This is a very well-studied academic topic, and we applied it in a way that would actually work for finding correlations of predictions and would look for correlations in similar storylines. What we found is that in all the storylines, in the discussed storms that eventually caused cholera, you would find that two years before that, you would have a drought in those areas. This is very surprising. The thing is, it was based on around six examples from Angola since 2006, which is not a lot of examples.
… In Bangladesh, since I think 1964, there were 90 significant cases of cholera. In 84% of them, before that, you had a drought. The thing is, what’s in common between Bangladesh and Angola? What we found out is that in countries with low GDP, not surprising, poor countries have high chances of cholera. Countries with low concentrations of water, they have this pattern of drought and then two years later storms, and then cholera. This is very surprising because, again, cholera is a waterborne disease. I would expect it to happen in places that have a lot of water.”
Predictive analytics for sales: no black boxes
In some domains, you need models that are easy to explain and interpret. Radinsky explained:
“The way the sales process usually works between two businesses, is they get a big list of potential leads, potential people that can buy from them — either people registering on their website, people giving them their business card, random names sometimes. They get, let’s say, a list of 20,000 people, but there are only five sales reps. They need to start calling them and generating opportunities to actually start closing deals with them. This is how this world works. … [Our system] tells them which lead is going to close and the size of the deal they can expect from them so they can actually manage the pricing. It’s similar with customers you already have: what’s the probability of churn. The issue with that is that when you’re building a prediction system for somebody to use … it has to be [explainable] in natural language … Nobody likes black boxes. Even when you try to predict future news events and you don’t explain why or what’s the pattern behind that, there’s no action item that they act on.”
Cancer research: same algorithms, new predictions
We closed by discussing recent applications of predictive analytics to medicine. Radinsky described how she recently teamed up with medical researchers to see if her techniques and tools can be used in the fight against cancer:
“Today, we’re working with doctors to try to predict different types of cancers using exactly the same algorithms. They’re providing us data about patients since 1975, like blood samples that were taken for those patients every year — similar to a sales process where you get some kind of input from your customers on a yearly basis, if you have a long period of interaction with them. Based on those, we’re trying to predict who’s going to have cancer or not in the next 20 or 30 years, based on this historical data.”