[A version of this post appears on the O’Reilly Radar.]
The O’Reilly Data Show Podcast: Christopher Nguyen on the early days of Apache Spark, deep learning for time-series and transactional data, innovation in China, and AI.
In this episode of the O’Reilly Data Show, I spoke with Christopher Nguyen, CEO and co-founder of Arimo. Nguyen and Arimo were among the first adopters and proponents of Apache Spark, Alluxio, and other open source technologies. Most recently, Arimo’s suite of analytic products has relied on deep learning to address a range of business problems.
Here are some highlights from our conversation:
The early days of Arimo and Apache Spark
When we started Arimo (our company name then was Adatao), the vision was about big data and machine learning. At the time, the industry had just refactored itself into what I call the ‘big data layer’—big data in the sense of the layer at the bottom, the storage layer. I knew there needed to be the ‘big compute’ layer. This was obvious from looking from the Google perspective—there’s a need for a big compute system. But a big compute system didn’t yet exist outside Google, so we were going to build one at Adatao in order to enable applications on top of it.
I knew that a big compute system had to take advantage of memory (‘in-memory’), and memory costs had been dropping to a level where it could be adopted in large quantities. The timing was right, and that helped the decision on Spark. When we did a survey before we started architecting and building the system, we came across many, many systems, and Spark stood out in particular; we looked at probably 15 different permutations out there and found that Spark had the right architecture. We were quite excited.
… I’m big proponent of another AMPLab project, Alluxio (formerly know as Tachyon). My take on Alluxio is that today, we tend to think of it as a memory-based distributed storage. I think the future of Alluxio is much brighter when you flip the adjective and the noun and say that it’s actually a storage-backed distributed memory system, a shared memory. When we have full data-center-scale computing, we will need a shared memory layer to serve as the shared memory for all compute units.
Deep learning can be applied to many business problems
Companies should care about deep learning because it will increasingly become the critical competitive weapon. It is a machine learning technique on the one hand, but it’s also going to be encompassing all of society. You hear a lot about AI advancements and so on. I think it behooves companies to, at the very least, pay attention to it, begin to apply it, and have it as part of their DNA going forward. Because, unlike many things where there are a hundred things to bet on, there are a few things that you know very clearly. Given the right mechanics and the right perspective, you know that deep learning is going to be the way of the future. I can tell you deep learning is definitely part of it.
… There are a lot of classes of problems that apply to all companies. For example, not every company has an image-recognition problem of scale, but I’ll bet every company has transactional data, time series, a transaction log. There’s a lot of insight you can gain as there are a lot of patterns hidden in that transaction log (the time-series data), that companies can learn from. Intuitively, you know the patterns are in there. Either you have some basic tools to discern those patterns or you don’t have tools at all. Deep learning is a way to extract insight from those patterns and make predictions about the next likely behavior—for example, the probability of purchase or future cash flows from transaction data.
Let’s divide the world into before deep learning on time series and after deep learning on time series. Techniques exist for processing time series, yet I would say they’re too hard. It’s possible, but it costs too much or it’s too hard using traditional techniques, and it yields a value that hasn’t been worth the investment. Deep learning flips that around and gets you much larger value for less effort.
Here’s why I see deep learning applied to time series as being fundamentally different as a technology. There’s a lot of discussion about how deep learning is doing 5% to 10% better than previous techniques. The area where it is significantly better is in time-series modeling. People may have a lot of experience in techniques like ARIMA and signal processing, and so on, but the reason you need a lot of that staff is because those techniques require them. Don’t get me wrong, that staff is still very valuable.
With deep learning, particularly recurrent neural networks like Long Short-term Memories (LSTM), relatively new applied techniques that can model time series in a much more natural way, you don’t have to specify arbitrary windows. You don’t have to look for five-and-a-half or six-day patterns. With these recurrent networks, you’re able to feed all of the time series into the network, and it’ll figure out where the patterns are. It actually does (relatively speaking) reduce the need for the staff that were needed for other techniques.
- Innovation from China: what is means for machine intelligence and AI (Christopher Nguyen’s keynote at Strata + Hadoop World Beijing)
- The Future of Machine Intelligence: perspectives from leading practitioners
- Deep Learning: A practitioner’s approach
- The Deep Learning video collection: 2016
- Hands-on machine learning with scikit-learn and TensorFlow