From search to distributed computing to large-scale information extraction

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems.

Here are a few snippets from our conversation:

The early days of Nutch and Hadoop

In the timeframe of the first year of Nutch’s existence, basically 2002 to 2003, we had a good amount of success in building up the rudiments of the search engine … we had the crawler; the indexer, which was based substantially on Lucene; the front end; the ranker; and so on. … We felt that was decent, but what was really holding us back was the scalability of the index construction. We were really held back by the fact that we were doing it on just a single machine.

… We spent about a summer working on this distributed indexing mechanism … We finished it, and I felt pretty good about it, and then something like two hours later, we read the Google File System paper and realized, “Boy, actually that would be pretty handy; we could really use that.” We threw out a chunk of it, implemented a very early version of the ideas in that paper, which we called HDFS, the Hadoop Distributed File System.

A decade later: Watching Hadoop mature and get adopted

I guess on the one hand, I never would’ve thought that [Hadoop would be so popular 10 years later]. … In the short term, it looked like it was impossible, in many ways, that it could work. On the other hand, I always thought it had to happen. The techniques were useful enough that I knew they would get broad acceptance. The surprising thing to me is that it happened through Hadoop rather than some other mechanism.

… The experience with Nutch that really helped us [was] when we saw those original papers, it was clear to me that was the right way to do it and it would have very broad applications. That I never really had any question in my mind about. I am kind of surprised that it was code that we wrote, and eventually many other people rewrote, that eventually did the trick.

Large-scale information extraction at very high accuracy

DeepDive is another mechanism for information extraction. This area as an area of academic interest has been around at least since the early 90s but DeepDive to me is really a remarkable project because of its ability to get structured data out of unstructured information at very high accuracy.

… Under the covers, we use statistical methods just like everyone else, and there’s some training. The emphasis in DeepDive is a little bit different from a lot of machine learning projects in a few different ways. First of all, from the user’s point of view, in many ways it doesn’t look like a machine learning project because we try to put forward the idea that the emphasis is on features rather than the machine learning internals. We want the user to focus as much as possible on getting high accuracy by writing great features and not try to muck with the guts of the internals.

… We talk about DeepDive focusing on bringing the features first and really allowing the user of DeepDive to focus on: what is the piece of information that will most improve my extractions? The reason that that’s possible is a probabilistic inference core that can go to very, very large numbers of variables — more than previous efforts have been able to obtain.

Related resources:

Subscribe to the O’Reilly Data Show Podcast

Stitcher, TuneIn, iTunes, SoundCloud, RSS

You can listen to our entire interview in the SoundCloud player above, or subscribe through Stitcher, SoundCloud, TuneIn, or iTunes.

Leave a Reply

%d bloggers like this: