Commercial speech recognition systems in the age of big data and deep learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Yishay Carmiel on applications of deep learning in text and speech.

Building intelligent applications with deep learning and TensorFlow

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Rajat Monga on the current state of TensorFlow and training large-scale deep neural networks.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Rajat Monga, who serves as a director of engineering at Google and manages the TensorFlow engineering team. We talked about how he ended up working on deep learning, the current state of TensorFlow, and the applications of deep learning to products at Google and other companies.

Here are some highlights from our conversation:
Continue reading

Using AI to build a comprehensive database of knowledge

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Mike Tung on large-scale structured data extraction, intelligent systems, and the importance of knowledge databases.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

Extracting structured information from semi-structured or unstructured data sources (“dark data”) is an important problem. One can take it a step further by attempting to automatically build a knowledge graph from the same data sources. Knowledge databases and graphs are built using (semi-supervised) machine learning, and then subsequently used to power intelligent systems that form the basis of AI applications. The more advanced messaging and chat bots you’ve encountered rely on these knowledge stores to interact with users.

In this episode of the Data Show, I spoke with Mike Tung, founder and CEO of Diffbot – a company dedicated to building large-scale knowledge databases. Diffbot is at the heart of many web applications, and it’s starting to power a wide array of intelligent applications. We talked about the challenges of building a web-scale platform for doing highly accurate, semi-supervised, structured data extraction. We also took a tour through the AI landscape, and the early days of self-driving cars.

Here are some highlights from our conversation:

Building the largest structured database of knowledge

If you think about the Web as a virtual world, there are more pixels on the surface area of the Web than there are square millimeters on the surface of the earth. As a surface for computer vision and parsing, it’s amazing, and you don’t have to actually build a physical robot in order to traverse the Web. It is pretty tricky though.

… For example, Google has a knowledge graph team—I’m sure your listeners are aware from a startup that was building something called Freebase, which is  crowdsourced, kind of like a Wikipedia for data. They’ve continued to build upon that at Google adding more and more human curators. … It’s a mix of software, but there’s definitely thousands and thousands of people that actually contribute to their knowledge graph. Whereas in contrast, we are a team of 15 of the top AI people in the world. We don’t have anyone that’s curating the knowledge. All of the knowledge is completely synthesized by our AI system. When our customers use our service, they’re directly using the output of the AI. There’s no human involved in the loop of our business model.

… Our high level goal is to build the largest structured database of knowledge. The most comprehensive map of all of the entities and the facts about those entities. The way we’re doing it is by combining multiple data sources. One of them is the Web, so we have this crawler that’s crawling the entire surface area of the Web.

Knowledge component of an AI system

If you look at other groups doing AI research, a lot of them are focused on very much the same as the academic style of research, which is coming out of new algorithms and publishing to sort of the same conferences. If you look at some of these industrial AI labs—they’re doing the same kind of work that they would be doing in academia—whereas what we’re doing, in terms of building this large data set, would not have been created otherwise without starting this effort. … I think you need really good algorithms, and you also need really good data.

… One of the key things we believe is that it might be possible to build a human-level reasoning system. If you just had enough structured information to do it on.

… Basically, the semantic web vision never really got fully realized because of the chicken-and-egg problem. You need enough people to annotate data, and annotate it for the purpose of the semantic web—to build a comprehensiveness of knowledge—and not for the actual purpose, which is perhaps showing web pages to end users.

Then, with this comprehensiveness of knowledge, people can build a lot of apps on top of it. Then the idea would be this virtuous cycle where you have a bunch of killer apps for this data, and then that would prompt more people to tag more things. That virtual cycle never really got going in my view, and there have been a lot of efforts to do that over the years with RDS/RSS and things like that.

… What we’re trying to do is basically take the annotation aspect out of the hands of humans. The idea here is that these AI algorithms are good enough that we can actually have AI build the semantic web.

Leveraging open source projects: WebKit and Gigablast

… Roughly, what happens when our robot first encounters a page is we render the page in our own customized rendering engine, which is a fork of WebKit that’s basically had its face ripped off. It doesn’t have all the human niceties of a web browser, and it runs much faster than a browser because it doesn’t need those human-facing components. … The other difference is we’ve instrumented the whole rendering process. We have access to all of the pixels on the page for each XY position. … [We identify many] features that feed into our semi-supervised learning system. Then millions of lines of code later, out comes knowledge.

… Our VP of search, Matt Wells, is the founder of the Gigablast search engine. Years ago, Gigablast competed against Google and Inktomi and AltaVista and others. Gigablast actually had a larger real-time search index than Google at that time. Matt is a world expert in search and has been developing his C++ crawler Gigablast for, I would say, almost a decade. … Gigablast scales much, much better than Lucene. I know because I’m a former user of Lucene myself. It’s a very elegant system. It’s a fully symmetric, masterless system. It has its own UDP-based communications protocol. It includes a full web crawler, indexer. It has real-time search capability.

Editor’s note: Mike Tung is on the advisory committee for the upcoming O’Reilly Artificial Intelligence conference.

Related resources:

Hardcore Data Science, California 2016

Ben Recht and I organized another great edition of Hardcore Data Science in San Jose today. As I was preparing to host the track, I had an inkling we had another outstanding sequence of presentations. The day covered hot topics like deep neural networks, practical advice on how to do data science & machine learning at scale, feature engineering, graphs, anomaly detected, structured data extraction, and many other topics at the heart of A.I. From the very first talk, sessions were well attended, the audience was attentive, and the energy in the room was high – and it remained that way throughout the day. A summary can be found below.

Continue reading

Welcome to Intelligence Matters

Casting a critical eye on the exciting developments in the world of AI

[A version of this post appears on the O’Reilly Radar blog and Forbes.]

Editor’s note: this post was co-authored by Ben Lorica and Roger Magoulas

Today the O’Reilly Radar is kicking off Intelligence Matters (IM), a new series exploring current issues in artificial intelligence, including the connection between artificial intelligence, human intelligence and the brain. IM offers a thoughtful take on recent developments, including a critical, and sometimes skeptical, view when necessary.

True AI has been “just around the corner” for 60 years, so why should O’Reilly start covering AI in a big way now? As computing power catches up to scientific and engineering ambitions, and as our ability to learn directly from sensory signals — i.e., big data — increases, intelligent systems are having a real and widespread impact. Every Internet user benefits from these systems today — they sort our email, plan our journeys, answer our questions, and protect us from fraudsters. And, with the Internet of Things, these system have already started to keep our houses and offices comfortable and well-lit, our data centers running more efficiently, our industrial processes humming, and even are driving our cars.

Of course, these systems don’t exist in a vacuum; in fact, some of the most fascinating aspects of machine intelligence arise from their deep interconnections with other technologies. The impact of big data and the Internet of Things will both be magnified once these massive information streams can be interpreted and acted upon by truly intelligent systems. Continue reading