Deep learning that’s easy to implement and easy to scale

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Anima Anandkumar on MXNet, tensor computations and deep learning, and techniques for scaling algorithms.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Anima Anandkumar, a leading machine learning researcher, and currently a principal research scientist at Amazon. I took the opportunity to get an update on the latest developments on the use of tensors in machine learning. Most of our conversation centered around MXNet—an open source, efficient, scalable deep learning framework. I’ve been a fan of MXNet dating back to when it was a research project out of CMU and UW, and I wanted to hear Anandkumar’s perspective on its recent progress as a framework for enterprises and practicing data scientists.

Here are some highlights from our conversation:
Continue reading “Deep learning that’s easy to implement and easy to scale”

Architecting big data applications in the cloud

The O’Reilly Data Show podcast: Jai Ranganathan on the Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.


[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Architecting data applications in the cloud

There are some fundamental design principles behind the original HDFS implementation, which don’t actually work in the cloud. For example, this notion that data locality is fundamental to this system design; it starts changing in the cloud when you’re looking at these large cloud providers — they are doing all these software-defined networking tricks and they can do bisectional bandwidth, like 40 gigs per second, across their data center … suddenly, you’re talking about moving hundreds of terabytes of data back and forth from a storage to a compute layer without any huge performance penalties. Suddenly, their performance is disadvantageous to this, but it’s not as bad as you think. Some of the core design principles in Hadoop have to change when you think about this kind of new data center design. … The cloud part is really interesting, but really what to me is interesting is there’s a fundamental shift in the way data centers are being designed, which we have to make sure that Hadoop stays designed to capitalize on.

… A lot of the work we do on the cloud is to optimize working with these object stores effectively. Obviously, you still need some local storage for things like spill, but that’s not really the same as a distributed file system. Then, it’s really a question of getting all the frameworks to run really effectively against an object store.

Paying attention to hardware trends

When I joined Cloudera, a customer who was going crazy and buying the most expensive hardware was buying 64 gigabytes of RAM. On that 64 gigabytes of RAM, they also had 12 disk spindles with two terabytes each and 24 terabytes of disk. At this point, today, many of my customers buy 246 gigabytes of RAM or even potentially 384 gigabytes to 512 gigabytes of RAM. The amount of disk is still exactly the same. Because disks don’t spin faster and you still want a certain level of throughput, you’re still looking at 24 terabytes of disk in your machine. Already in just two years, we have seen it go from 64 to 512, potentially. I don’t think this trend is going to stop, and we are suddenly going to be looking at, within three years, one-terabyte RAM machines.

… What we’re finding is that in a lot of the things we do at Cloudera, like Kudu or Impala, fundamentally, we really care about wringing performance out of the CPU. A lot of this will be like, ‘can I do vectorize operations?’ and ‘can I make sure to take advantage of my L2 cache mode effectively?’ because that allows my CPU to spend more efficiently. It really changes the
bottleneck from the I/O subsystem to the CPU subsystem, and everything you can do to eke out performance there really matters.

… Project Tungsten is basically in the Spark community to do more CPU-efficient things, whether that’s vectorizing stuff, whether that’s actually effectively moving away from managed memory to managing by buffers, so you can actually have much more efficient handling of memory, so you can get better CPU efficiency as well.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Financial analytics as a service

[A version of this post appears on the O’Reilly Strata blog.]

In relatively short order Amazon’s internal computing services has become the world’s most successful cloud computing platform. Conceived in 2003 and launched in 2006, AWS grew quickly and is now the largest web hosting company in the world. With the recent addition of Kinesis (for stream processing), AWS continues to add services and features that make it an attractive platform for many enterprises.

A few other companies have followed a similar playbook: technology investments that benefit a firm’s core business, is leased out to other companies, some of whom may operate in the same industry. An important (but not well-known) example comes from finance. A widely used service provides users with clean, curated data sets and sophisticated algorithms with which to analyze them. It turns out that the world’s largest asset manager makes its investment and risk management systems available to over 150 pension funds, banks, and other institutions. In addition to the $4 trillion managed by BlackRock, the company’s Aladdin Investment Management system is used to manage1 $11 trillion in additional assets from external managers.

BlackRock: Aladdin

Just as AWS has been adopted by e-commerce companies, some of Aladdin’s users are BlackRock’s peers in the asset-management industry. In the case of Aladdin2, asset managers have come to value it’s collection3 of high-quality historical data and analytics (including Monte-Carlo simulations and stress tests). In recent years, the amount of assets that rely on Aladdin grew by about $1 trillion per year. To put these numbers in context, the 6,000 computers that comprise Aladdin keep an eye on about “7% of the world’s $225 trillion of financial assets”. About 17,000 traders worldwide have access to Aladdin.

Continue reading “Financial analytics as a service”