ML platform designers need to meet current challenges and plan for future workloads.
By Ben Lorica and Ion Stoica.
[This post originally appeared on the Anyscale blog.]
As machine learning gains a foothold in more and more companies, teams are struggling with the intricacies of managing the machine learning lifecycle.
The typical starting point is to give each data scientist a Jupyter notebook backed by a GPU instance in the cloud and to have a separate team manage deployment and serving, but this approach breaks down as the complexity of the applications and the number of deployments grow.
As a result, more teams are looking for machine learning platforms. Several startups and cloud providers are beginning to offer end-to-end machine learning platforms including AWS (SageMaker), Azure (Machine Learning Studio), Databricks (MLflow), Google (Cloud AI Platform), and others. Many other companies choose to build their own including Uber (Michelangelo), Airbnb (BigHead), Facebook (FBLearner), Netflix, and Apple (Overton).
Figure 1: Ray can be used to build pluggable components for machine learning platforms. The boxes in blue are components for which there are some libraries built on top of Ray.In our previous post we examined the reasons behind the growing popularity of Ray. Many Ray users are building ML tools and platforms. This is an active space: several startups and cloud platforms are beginning to offer end-to-end machine learning platforms including Amazon, Databricks, Google, and Microsoft. But conversations with people in the community keep driving users to Ray to solve many challenges left unsolved.
Ray’s collection of libraries can be used alongside other ML platform components. And unlike monolithic platforms, users have the flexibility to use one or more of the existing Ray libraries, or to use Ray to build their own libraries. A partial list of machine learning platforms that already incorporate Ray include:
- Azure Machine Learning
- Amazon SageMaker
- Futurewei uses Ray in its ML cloud services.
- Ant Financial uses Ray for streaming and online learning.
- Facebook’s Classy Vision uses Ray for distributed training.
- Intel’s Analytic Zoo allows users to directly run Ray programs on Apache Hadoop/YARN.
In this post, we share insights derived from conversations with many ML platform builders. We list features that will be critical to ensuring that your ML platform is well-positioned for modern AI applications. We also make the case that developers building ML platforms and ML components should consider Ray because it has an ecosystem of standalone libraries that can be used to address some of the items we list below.
Developers and machine learning engineers use a variety of tools and programming languages (R, Python, Julia, SAS, etc.). But with the rise of deep learning, Python has become the dominant programming language for machine learning. So if anything, an ML platform needs to support Python and the Python ecosystem.
As a practical matter, developers and machine learning engineers rely on many different Python libraries and tools. The most widely used libraries include deep learning tools (TensorFlow, PyTorch), machine learning and statistical modeling libraries (scikit-learn, statsmodel), NLP tools (spaCy, Hugging Face, AllenNLP), and model tuning (Hyperopt, Tune).
Because it integrates seamlessly in the Python ecosystem, many developers are leveraging Ray for building machine learning tools. It is a general purpose distributed computing platform that can be used to easily scale existing Python libraries and applications. Ray also has a growing collection of standalone libraries available to Python developers.
As we noted in a previous post (“The Future of Computing is Distributed”) the “demands of machine learning applications are increasing at a breakneck speed”. The rise of deep learning and new workloads means that distributed computing will be common for machine learning. Unfortunately many developers have relatively little experience in distributed computing.
Scaling and distributed computation are areas where Ray has been helpful to many users we’ve spoken with. Ray allows developers to focus on their applications instead of on the intricacies of distributed computing. Using Ray brings several benefits to developers needing to scale machine learning applications:
- Ray is a platform that lets you easily scale your existing applications to a cluster. This could be as simple as scaling your favorite library to a compute cluster (see this recent post on using Ray to scale scikit-learn). Or it could involve using the Ray API to scale an existing program to a cluster. The latter scenario is one we’ve seen happen for applications in NLP, online learning, fraud detection, financial time-series, OCR, and many other use cases.
- RaySGD simplifies distributed training for PyTorch and TensorFlow. This is good news for the many companies and developers who struggle with training or tuning large neural networks.
- Instead of spending time on DevOps, a built-in cluster launcher makes it simple to setup a Ray cluster.
Extensibility for New Workloads
Modern AI platforms are notoriously compute hungry. In a previous post, we noted that model tuning is an important part of the machine learning development process:
“You don’t train a model just once. Typically, the quality of a model depends on a variety of hyperparameters, such as the number of layers, the number of hidden units, and the batch size. Finding the best model often requires searching among various hyperparameter settings. This process is called hyperparameter tuning, and it can be very expensive.”
Developers can choose from several libraries for tuning models. One of the more popular tools is Tune, a scalable hyperparameter tuning library built on top of Ray. Tune runs on a single node or on a cluster, and has quickly become one of the more popular libraries in the Ray ecosystem.
Reinforcement learning (RL) is another area worth highlighting. Many of the recent articles about RL pertain to game play (Atari, Go, multiplayer video games) or to applications in industrial settings (e.g., data center efficiency). But as we noted in a previous post, there are emerging applications in recommendations and personalization, simulation and optimization, financial time series, and public policy.
RL is compute intensive, complex to implement and scale, and as such many developers will want to simply use libraries. Ray provides a simple, highly scalable library (RLlib) that developers and machine learning engineers across several organizations are already using in production.
Tools Designed for Teams
As companies begin to use and deploy more machine learning models, teams of developers will need to be able to collaborate with each other. They will need access to platforms that enable both sharing and discovery. When considering an ML platform, consider the key stages of model development and operations, and assume that teams of people with different backgrounds will collaborate during each of those phases.
For example, feature stores (first introduced by Uber in 2017) are useful because they allow developers to share and discover features that they might otherwise not have thought about. Teams also need to be able to collaborate during the model development lifecycle. This includes managing, tracking, and reproducing experiments. The leading open source project in this area is MLflow, but we’ve come across users of other tools like Weights & Biases and comet.ml, as well as users who have built their own tools to manage ML experiments.
Enterprises will require additional features – including security and access control. Model governance and model catalogs (analogs of similar systems for managing data) are also going to be required as teams of developers build and deploy more models.
First Class MLOps
MLOps is a set of practices focused on productionizing the machine learning lifecycle. It is a relatively new term that draws ideas from continuous integration (CI) and continuous deployment (CD), two widely used software development practices. A recent Thoughtworks post listed some key considerations for establishing CD for machine learning (CD4ML). Some key items for CI/CD for machine learning include reproducibility, experiment management and tracking, model monitoring and observability, and more. There are a few startups and open source projects that offer MLOps solutions including Datatron, Verta, TFX, and MLflow.
Ray has components that would be useful for companies moving towards CI/CD or building CI/CD tools for machine learning. It already has libraries for key stages of the ML lifecycle: training (RaySGD), tuning (Tune), and deployment (Serve). Having access to libraries that work seamlessly together will allow Ray users to more readily bring CI/CD methods into their MLOps practice.
We listed key elements that machine learning platforms should possess. Our baseline assumptions are that Python will remain the language of choice for ML, distributed computing will increasingly be needed, new workloads like hyperparameter tuning and RL will need to be supported, and tools for enabling collaboration and MLOps need to be available.
With these assumptions in mind, we believe that Ray will be the foundation of future ML platforms. Many developers and engineers are already using Ray for their machine learning applications, including training, optimization, and serving. Ray not only addresses current challenges like scale and performance but is well positioned to support future workloads and data types. Ray, and the growing number of libraries built on top of it, will help scale ML platforms for years to come. We look forward to seeing what you build on top of Ray!