Combine the development experience of a laptop with the scale of the cloud

Ray Summit

Highlights from opening keynotes at the 2021 Ray Summit.

A newly released report from McKinsey forecasts an upcoming explosion in AI applications across all industries and domains. With this surge in demand to incorporate machine learning (ML) and AI into software, developers now have an array of open source and commercial software tools and components at their disposal. But building, deploying, and maintaining AI applications using disparate components remains challenging for most teams. Opening keynotes at the Ray Summit highlighted recent developments in the Ray ecosystem that included tools to allow engineers to combine dev tools on their laptops with the scale of the cloud.

Ray is a general-purpose distributed computing platform and an ecosystem that is emerging as an ideal substrate for building machine learning and AI applications. In his opening keynote at today’s Ray Summit, Ray’s co-creator and Anyscale co-founder Robert Nishihara pointed out the growing number of libraries integrating with Ray. The benefits of Ray integration flow directly to users of these libraries, as these examples mentioned by Nishihara illustrate:

  • Uber sped up machine learning model training by 5X by deploying XGBoost on Ray.

  • Amazon managed to speed up an important Dask data processing workload by up to 13X by deploying Dask on Ray. They managed to scale the cluster they were using to one that was seven times larger. By using Dask on Ray, the end-to-end process now is only 30% of the original cost, and Amazon saved over half a million dollars as a result.

Nishihara listed several other examples of how Ray brings common infrastructure to the production ML ecosystem. He noted that by integrating with Ray, libraries can utilize Ray’s elasticity and autoscaling, and hook into other libraries, including Tune for hyperparameter tuning or Ray Serve for model serving.

Improving performance at a single stage or for a single library is compelling, but the ultimate goal is to simplify end-to-end development of ML and AI applications. To illustrate how difficult it is to build, deploy, and manage AI applications, UC Berkeley Professor and Anyscale co-founder Ion Stoica walked through three types of applications that many companies encounter:

Three distinct examples to illustrate the complexity of AI applications. From Ion Stoica’s keynote “Making Distributed Computing Easy” at the 2021 raysummit.org.

As Stoica noted, there are two general approaches to building machine learning and AI applications: (1) build a complete application from scratch, or (2) stitch together several libraries and distributed systems. The first option requires significant and continuing investments in talent and tooling, and is thus beyond reach for most companies. The more common route is to stitch together frameworks, particularly since there are now many open source and commercial solutions that address the various aspects of AI applications. As we noted in a previous post, the second option requires mastering different APIs and tools in order to build, test, deploy, monitor, and manage your application. While this is still challenging, it’s more accessible than the first option for most companies.

Another important consideration when stitching together frameworks is performance. As Stoica noted, developers incur overhead when exchanging data between systems that store data in different formats. Travis Addair, who previously led the Uber team responsible for building Uber’s deep learning infrastructure, raised similar points in a recent conversation we had:

    “When I was the tech lead at the deep learning training team at Uber, one of the things I did was to shift a lot of our platform off of a bifurcated Spark/Horovod model into a single Ray-based model. The goal was to get to a point where feature processing and distributed training all happens through a single Ray-based pipeline. So, you can get rid of the separate stages—some stages involve Spark, others involve Horovod—which lets you do more optimization between stages, such as feature transformation and model training. Fusing these steps together into a single graph definition results in a single entity that we can then serve for real-time serving as well.”

The case for building AI applications with Ray and Anyscale

This morning’s keynotes focused on a central theme: it’s much easier and more efficient to build machine learning applications with Ray. Both Nishihara and Stoica highlighted the growing number of libraries that integrate with Ray, which gives developers options for each stage of machine learning development, testing, and the deployment lifecycle:

Ray Ecosystem and the Anyscale platform
A growing number of libraries integrate with Ray. From Ion Stoica’s keynote “Making Distributed Computing Easy” at the 2021 raysummit.org.

A key benefit of Ray is that integrated libraries do not exist in isolation; they can be used together seamlessly in a common application. This allows developers to use a single system (Ray) to implement any machine learning and AI application, end to end. A growing number of companies companies are even using Ray as the foundation of their ML platforms. Because developers only need to deal with a single system, applications are easier to develop, test, deploy, and manage.

Applications built with Ray are also faster and more efficient, as data can be exchanged between libraries using Ray’s in-memory object store. And as Uber learned, fusing an application into a single computational graph also leads to better optimizations at development and deployment.

Image: QuantumBlack’s end-to-end RL application development process for simulating hydrofoil designs for sailboats, relied heavily on the Ray Ecosystem; from Nic Hohn’s keynote (“Making Boats Fly with AI on Ray”) at raysummit.org

Toward an infinite laptop

At last year’s Ray Summit, Anyscale previewed tools that offered the convenience and ease of development on a laptop combined with the power of the cloud. This morning’s keynotes provided updates on this vision.

Nishihara highlighted Ray Client, a new feature that streamlines workflows for developers building distributed applications. Ray Client makes it easy to burst from a laptop (or your web server, or your CI testing environment) to a cluster. The latest release of Ray also introduces tools (“Environments”) that make it easy to configure the environment, libraries, and dependencies that an application requires on a cluster.

The opening keynotes were capped off with a demo illustrating how easy it is to build a complex end-to-end ML application on the Anyscale platform. Anyscale engineer Edward Oakes, showed how to build a low-latency support chatbot that can scale to thousands of concurrent users. This chatbot combined machine learning, NLP, and custom business logic. Oakes created an initial version on his laptop, scaled the same code to the cloud, and deployed and monitored a working app all within his 20-minute talk.

The vision is to create an “infinite laptop”—an environment where the ease of development on a laptop is combined with the power of the cloud. How does the “infinite laptop” vision stack up with reality? Over the last few weeks, I’ve been using the Anyscale platform to train and tune machine learning and deep learning models. By using Ray and the Anyscale platform for model training and hyperparameter tuning, for all intents and purposes, I really do have an “infinite laptop.” I am able to use my favorite dev tools and libraries, and initially develop my program on my laptop. When I’m ready to burst to the cloud, I simply add one line of code and the exact same program (training and hyperparameter tuning) runs on the cloud using whatever amount of compute resources I specify. There’s no need for DevOps or navigating cloud platform configuration details.

Developers want the ease of developing on a laptop combined with the scale of the cloud. From Ion Stoica’s keynote “Making Distributed Computing Easy” at the 2021 raysummit.org.

What this means for ML and AI application development teams

The kinds of libraries that have integrated with Ray now span the entire machine learning lifecycle. This morning’s keynotes spelled out such clear benefits to library creators and users that it seems reasonable to expect more libraries to integrate with Ray in the future (Apache Airflow just announced an integration).

As more libraries and frameworks join the Ray ecosystem, developers can expect their favorite frameworks to seamlessly and efficiently work together. Most importantly, developers will be able to build, debug, deploy, and manage applications using a single distributed platform (Ray). Moreover, the Ray community and Anyscale are building tools that will make it easier for developers to bridge two worlds: the development experience of laptops and desktops, with the scale of the cloud.

Ben Lorica is an advisor to Anyscale and is co-chair of the Ray Summit.

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading