[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Rajat Monga on the current state of TensorFlow and training large-scale deep neural networks.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.
In this episode of the O’Reilly Data Show, I spoke with Rajat Monga, who serves as a director of engineering at Google and manages the TensorFlow engineering team. We talked about how he ended up working on deep learning, the current state of TensorFlow, and the applications of deep learning to products at Google and other companies.
Here are some highlights from our conversation:
Deep learning at Google
There’s not going to be too many areas left that run without machine learning that you can program. The data is too much, there’s just too much for humans to handle. … Over the last few years, and this is something we’ve seen at Google, we’ve seen hundreds of products move to deep learning, and gain from that. In some cases, these are products that were actually applying machine learning that had been using traditional methods for a long time and had experts. For example, search, we had hundreds of signals in there, and then we applied deep learning. That was the last two years or so.
For somebody who is not familiar with deep learning, my suggestion would be to start from an example that is closest to your problem, and then try to adapt it to your problem. Start simple, don’t go to very complex things, there are many things you can do, even with simple models.
TensorFlow makes deep learning more accessible
At Google, I would say there are the machine learning researchers who are pushing machine learning research, then there are data scientists who are focusing on applying machine learning to their problems … We have a mix of people—some are people applying TensorFlow to their actual problems.
They don’t always have a machine learning background. Some of them do, but a large number of them don’t. They’re usually developers who are good at writing software. They know maybe a little bit of math so they can pick it up, in some cases not that much at all, but who can take these libraries if there are examples. They start from those examples, maybe ask a few questions on our internal boards, and then go from there. In some cases they may have a new problem, they want some inputs on how to formulate that problem using deep learning, and we might guide them or point them to an example of how you might approach their problem. Largely, they’ve been able to take TensorFlow and do things on their own. Internally, we are definitely seeing these tools and techniques being used by people who have never done machine learning before.
Synchronous and asynchronous methods for training deep neural networks
When we started out back in 2011, everybody was using stochastic gradient descent. It’s extremely efficient in what it does, but when you want to scale beyond 10 or 20 machines, it makes it hard to scale, so what do we do? At that time there were a couple of papers. One was on the HOGWILD! approach that people had done on a single machine … That was very interesting. We thought, can we make this work across the network, across many, many machines? We did some experiments and started tuning it, and it worked well. We were actually able to scale it to a large number of workers, hundreds of workers in some cases across thousands of machines, and that worked pretty well. Over time, we’d always had another question: is the asynchronous nature actually helping or making things worse? Finally last year, we started to experiment and try to understand what’s happening, and as part of that, we realized if we could do synchronous well, it actually is better.
… With the asynchronous stuff, we had these workers and they would work completely independently of each other. They would just update things on the parameter server when they had gradients, they would send it back to the parameter server, it would update, and then fetch the next set of parameters.
… From a systems perspective, it’s nice, because it scales very, very well. It’s okay if a few workers died, that’s fine, all the others will continue to make progress. Now, with the synchronous approach, what we want to do is to send parameters out to all the workers, have them compute gradients, send those back, combine those together, and then apply them. Now, across many machines, you can do this, but the issue is if some of them start to slow down or fail, what happens then? That’s always a tricky thing with the synchronous approach, and that’s hard to scale. That’s probably the biggest reason people hadn’t pushed toward this earlier.
- Hello, TensorFlow: Building and training your first TensorFlow graph from the ground up
- TensorFlow for poets: How to build your own image classifier with no coding
- In my conversation with Rajat Monga, I alluded to these recent papers on Asynchronous and Synchronous methods for training deep neural networks: (1) Revisiting Distributed Synchronous SGD, (2) Asynchrony begets Momentum, with an Application to Deep Learning, (3) Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs