[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Mike Tung on large-scale structured data extraction, intelligent systems, and the importance of knowledge databases.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.
Extracting structured information from semi-structured or unstructured data sources (“dark data”) is an important problem. One can take it a step further by attempting to automatically build a knowledge graph from the same data sources. Knowledge databases and graphs are built using (semi-supervised) machine learning, and then subsequently used to power intelligent systems that form the basis of AI applications. The more advanced messaging and chat bots you’ve encountered rely on these knowledge stores to interact with users.
In this episode of the Data Show, I spoke with Mike Tung, founder and CEO of Diffbot – a company dedicated to building large-scale knowledge databases. Diffbot is at the heart of many web applications, and it’s starting to power a wide array of intelligent applications. We talked about the challenges of building a web-scale platform for doing highly accurate, semi-supervised, structured data extraction. We also took a tour through the AI landscape, and the early days of self-driving cars.
Here are some highlights from our conversation:
Building the largest structured database of knowledge
If you think about the Web as a virtual world, there are more pixels on the surface area of the Web than there are square millimeters on the surface of the earth. As a surface for computer vision and parsing, it’s amazing, and you don’t have to actually build a physical robot in order to traverse the Web. It is pretty tricky though.
… For example, Google has a knowledge graph team—I’m sure your listeners are aware from a startup that was building something called Freebase, which is crowdsourced, kind of like a Wikipedia for data. They’ve continued to build upon that at Google adding more and more human curators. … It’s a mix of software, but there’s definitely thousands and thousands of people that actually contribute to their knowledge graph. Whereas in contrast, we are a team of 15 of the top AI people in the world. We don’t have anyone that’s curating the knowledge. All of the knowledge is completely synthesized by our AI system. When our customers use our service, they’re directly using the output of the AI. There’s no human involved in the loop of our business model.
… Our high level goal is to build the largest structured database of knowledge. The most comprehensive map of all of the entities and the facts about those entities. The way we’re doing it is by combining multiple data sources. One of them is the Web, so we have this crawler that’s crawling the entire surface area of the Web.
Knowledge component of an AI system
If you look at other groups doing AI research, a lot of them are focused on very much the same as the academic style of research, which is coming out of new algorithms and publishing to sort of the same conferences. If you look at some of these industrial AI labs—they’re doing the same kind of work that they would be doing in academia—whereas what we’re doing, in terms of building this large data set, would not have been created otherwise without starting this effort. … I think you need really good algorithms, and you also need really good data.
… One of the key things we believe is that it might be possible to build a human-level reasoning system. If you just had enough structured information to do it on.
… Basically, the semantic web vision never really got fully realized because of the chicken-and-egg problem. You need enough people to annotate data, and annotate it for the purpose of the semantic web—to build a comprehensiveness of knowledge—and not for the actual purpose, which is perhaps showing web pages to end users.
Then, with this comprehensiveness of knowledge, people can build a lot of apps on top of it. Then the idea would be this virtuous cycle where you have a bunch of killer apps for this data, and then that would prompt more people to tag more things. That virtual cycle never really got going in my view, and there have been a lot of efforts to do that over the years with RDS/RSS and things like that.
… What we’re trying to do is basically take the annotation aspect out of the hands of humans. The idea here is that these AI algorithms are good enough that we can actually have AI build the semantic web.
Leveraging open source projects: WebKit and Gigablast
… Roughly, what happens when our robot first encounters a page is we render the page in our own customized rendering engine, which is a fork of WebKit that’s basically had its face ripped off. It doesn’t have all the human niceties of a web browser, and it runs much faster than a browser because it doesn’t need those human-facing components. … The other difference is we’ve instrumented the whole rendering process. We have access to all of the pixels on the page for each XY position. … [We identify many] features that feed into our semi-supervised learning system. Then millions of lines of code later, out comes knowledge.
… Our VP of search, Matt Wells, is the founder of the Gigablast search engine. Years ago, Gigablast competed against Google and Inktomi and AltaVista and others. Gigablast actually had a larger real-time search index than Google at that time. Matt is a world expert in search and has been developing his C++ crawler Gigablast for, I would say, almost a decade. … Gigablast scales much, much better than Lucene. I know because I’m a former user of Lucene myself. It’s a very elegant system. It’s a fully symmetric, masterless system. It has its own UDP-based communications protocol. It includes a full web crawler, indexer. It has real-time search capability.
Editor’s note: Mike Tung is on the advisory committee for the upcoming O’Reilly Artificial Intelligence conference.