Deep learning for Apache Spark

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Jason Dai on BigDL, a library for deep learning on existing data frameworks.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Jason Dai, CTO of big data technologies at Intel, and co-chair of Strata + Hadoop World Beijing. Dai and his team are prolific and longstanding contributors to the Apache Spark project. Their early contributions to Spark tended to be on the systems side and included Netty-based shuffle, a fair-scheduler, and the “yarn-client” mode. Recently, they have been contributing tools for advanced analytics. In partnership with major cloud providers in China, they’ve written implementations of algorithmic building blocks and machine learning models that let Apache Spark users scale to extremely high-dimensional models and large data sets. They achieve scalability by taking advantage of things like data sparsity and Intel’s MKL software. Along the way, they’ve gained valuable experience and insight into how companies deploy machine learning models in real-world applications.

When I predicted that 2017 would be the year when the big data and data science communities start exploring techniques like deep learning in earnest, I was relying on conversations with many members of those communities. I also knew that Dai and his team were at work on a distributed deep learning library for Apache Spark. This evolution from basic infrastructure, to machine learning applications, and now applications backed by deep learning models is to be expected.

Once you have a platform and a team that can deploy machine learning models, it’s natural to begin exploring deep learning. As I’ve highlighted in recent episodes of this podcast (here and here), companies are beginning to apply deep learning to time-series data, event data, text, and images. Many of these same companies have already invested in big data technologies (many of which are open source) and employ data scientists and data engineers who are comfortable with these tools.
Continue reading

The key to building deep learning solutions for large enterprises

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Adam Gibson on the importance of ROI, integration, and the JVM.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

As data scientists add deep learning to their arsenals, they need tools that integrate with existing platforms and frameworks. This is particularly important for those who work in large enterprises. In this episode of the Data Show, I spoke with Adam Gibson, co-founder and CTO of Skymind, and co-creator of Deeplearning4J (DL4J). Gibson has spent the last few years developing the DL4J library and community, while simultaneously building deep learning solutions and products for large enterprises.

Here are some highlights:

Continue reading

How big compute is powering the deep learning rocketship

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Greg Diamos on building computer systems for deep learning and AI.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

Specialists describe deep learning as akin to a rocketship that needs a really big engine (a model) and a lot of fuel (the data) in order to go anywhere interesting. To get a better understanding of the issues involved in building compute systems for deep learning, I spoke with one of the foremost experts on this subject: Greg Diamos, senior researcher at Baidu. Diamos has long worked to combine advances in software and hardware to make computers run faster. In recent years, he has focused on scaling deep learning to help advance the state-of-the-art in areas like speech recognition.

A big model, combined with big data, necessitates big compute—and at least at the bleeding edge of AI, researchers have gravitated toward high-performance computing (HPC) or supercomputer-like systems. Most practitioners use systems with multiple GPUs (ASICs or FPGAs) and software libraries that make it easy to run fast deep learning models on top of them.

In keeping with the convenience versus performance tradeoff discussions that play out in many enterprises, there are other efforts that fall more in the big data, rather than HPC, camp. In upcoming posts, I’ll highlight groups of engineers and data scientists who are starting to use these techniques and are creating software to run them on existing software and hardware infrastructure common in the big data community.

Continue reading

2017 will be the year the data science and big data community engage with AI technologies

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: A look at some trends we’re watching in 2017.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

This episode consists of excerpts from a recent talk I gave at a conference commemorating the end of the UC Berkeley AMPLab project. This section pertained to some recent trends in Data and AI. For a complete list of trends we’re watching in 2017, as well as regular doses of highly curated resources, subscribe to our Data and AI newsletters.

As 2016 draws to a close, I see the big data and data science community beginning to engage with AI-related technologies, particularly deep learning. By early next year, there will be new tools that specifically cater to data scientists and data engineers who aren’t necessarily experts in these techniques. While the AI research community continues to tackle fundamental problems, these new sets of tools will make some recent breakthroughs in AI much more accessible and convenient to use for the data community.

Related resources:

Introducing model-based thinking into AI systems

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Vikash Mansinghka on recent developments in probabilistic programming.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode I spoke with Vikash Mansinghka, research scientist at MIT, where he leads the Probabilistic Computing Project, and co-founder of Empirical Systems. I’ve long wanted to introduce listeners to recent developments in probabilistic programming, and I found the perfect guide in Mansinghka.

Probability is the mathematical language to represent, model, and manipulate uncertainty, and probabilistic programming provides frameworks for representing probabilistic models as computer programs. This family of tools and techniques distinguishes between models and the inference procedures, and in the process, encourages the kind of model-based thinking that may inform the design of future artificial intelligence systems and supplement current data and compute-intensive systems that rely primarily on large-scale pattern recognition.

Below are highlights from my conversation with Mansinghka:
Continue reading

Visual tools for overcoming information overload

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Dafna Shahaf on information cartography and AI, and Sam Wang on probabilistic methods for forecasting political elections.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this special two-segment episode of the Data Show, I spoke with Dafna Shahaf, assistant professor at the School of Computer Science and Engineering at the Hebrew University of Jerusalem. Her area of research is focused on tools and techniques for overcoming information overload, an area of increasing importance in an attention economy. With the upcoming U.S. Presidential Elections right around the corner, I included a conversation between Jenn Webb, host of the O’Reilly Radar Podcast, and Sam Wang, co-founder of the Princeton Election Consortium and professor of neuroscience and molecular biology at Princeton University.

Below are highlights from my conversation with Dafna Shahaf:
Continue reading

Why businesses should pay attention to deep learning

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Christopher Nguyen on the early days of Apache Spark, deep learning for time-series and transactional data, innovation in China, and AI.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Christopher Nguyen, CEO and co-founder of Arimo. Nguyen and Arimo were among the first adopters and proponents of Apache Spark, Alluxio, and other open source technologies. Most recently, Arimo’s suite of analytic products has relied on deep learning to address a range of business problems.

Continue reading