Click to learn more about co- author Ion Stoica.
Click to learn more about co- author Ben Lorica.
Machine learning platform designers need to meet current challenges and plan for future workloads.
As machine learning gains a foothold in more and more companies, teams are struggling with the intricacies of managing the machine learning lifecycle.
The typical starting point is to give each data scientist a Jupyter notebook backed by a GPU instance in the cloud and to have a separate team manage deployment and serving, but this approach breaks down as the complexity of the applications and the number of deployments grow.
As a result, more teams are looking for machine learning platforms. Several startups and cloud providers are beginning to offer end-to-end machine learning platforms, including AWS (SageMaker), Azure (Machine Learning Studio), Databricks (MLflow), Google (Cloud AI Platform), and others. Many other companies choose to build their own, including Uber (Michelangelo), Airbnb (BigHead), Facebook (FBLearner), Netflix, and Apple (Overton).
Many users are building ML tools and platforms. This is an active space: Several startups and cloud platforms are beginning to offer end-to-end machine learning platforms, including Amazon, Databricks, Google, and Microsoft. But conversations with people in the community keep driving users back to this tool to solve many challenges left unsolved.
Ray’s collection of libraries can be used alongside other ML platform components. And unlike monolithic platforms, users have the flexibility to use one or more of the existing libraries or can use it to build their own libraries. A partial list of machine learning platforms that already incorporate such features include:
- Azure Machine Learning
- Amazon SageMaker
- Futurewei uses it in its ML cloud services
- Ant Financial uses it for streaming and online learning
- Facebook’s Classy Vision utilizes it for distributed training
- Intel’s Analytics Zoo allows users to directly run the programs on Apache Hadoop/YARN
- Flambé
In this post, we will share insights derived from conversations with many ML platform builders. We will list features that will be critical to ensuring that your ML platform is well-positioned for modern AI applications. We will also make the case that developers building ML platforms and ML components should consider such a platform because it has an ecosystem of standalone libraries that can be used to address some of the items we list below.
Ecosystem Integration
Developers and machine learning engineers use a variety of tools and programming languages (R, Python, Julia, SAS, etc.). But with the rise of deep learning, Python has become the dominant programming language for machine learning. So if anything, an ML platform needs to support Python and the Python ecosystem.
As a practical matter, developers and machine learning engineers rely on many different Python libraries and tools. The most widely used libraries include deep learning tools (TensorFlow, PyTorch), machine learning and statistical modeling libraries (scikit-learn, statsmodels), NLP tools (spaCy, Hugging Face, AllenNLP), and model tuning (Hyperopt, Tune).
Because it integrates seamlessly in the Python ecosystem, many developers are leveraging Ray for building machine learning tools. It is a general-purpose distributed computing platform that can be used to easily scale existing Python libraries and applications. It also has a growing collection of standalone libraries available to Python developers.
Easy Scaling
As we noted in The Future of Computing is Distributed, the “demands of machine learning applications are increasing at a breakneck speed.” The rise of deep learning and new workloads means that distributed computing will be common for machine learning. Unfortunately, many developers have relatively little experience in distributed computing.
Scaling and distributed computation are areas where Ray has been helpful to many users we’ve spoken with. It allows developers to focus on their applications instead of on the intricacies of distributed computing. Using it brings several benefits to developers needing to scale machine learning applications:
- It’s a platform that lets you easily scale your existing applications to a cluster. This could be as simple as scaling your favorite library to a compute cluster (see this recent post on using it to scale scikit-learn). Or it could involve using the API to scale an existing program to a cluster. The latter scenario is one we’ve seen happen for applications in NLP, online learning, fraud detection, financial time-series, OCR, and many other use cases.
- RaySGD simplifies distributed training for PyTorch and TensorFlow. This is good news for the many companies and developers who struggle with training or tuning large neural networks.
- Instead of spending time on DevOps, a built-in cluster launcher makes it simple to set up a cluster.
Extensibility for New Workloads
Modern AI platforms are notoriously compute hungry. In The Future of Computing is Distributed post referenced above, we noted that model tuning is an important part of the machine learning development process:
“You don’t train a model just once. Typically, the quality of a model depends on a variety of hyperparameters, such as the number of layers, the number of hidden units, and the batch size. Finding the best model often requires searching among various hyperparameter settings. This process is called hyperparameter tuning, and it can be very expensive.” Ion Stoica
Developers can choose from several libraries for tuning models. One of the more popular tools is Tune, a scalable hyperparameter tuning library built on top of Ray. Tune runs on a single node or on a cluster and has quickly become one of the more popular libraries in this ecosystem.
Reinforcement learning (RL) is another area worth highlighting. Many of the recent articles about RL pertain to gameplay (Atari, Go, multiplayer video games) or to applications in industrial settings (e.g., data center efficiency). But, as we’ve previously noted, there are emerging applications in recommendations and personalization, simulation and optimization, financial time series, and public policy.
RL is compute-intensive, complex to implement and scale, and, as such, many developers will want to simply use libraries. Ray provides a simple, highly scalable library (RLlib) that developers and machine learning engineers across several organizations are already using in production.
Tools Designed for Teams
As companies begin to use and deploy more machine learning models, teams of developers will need to be able to collaborate with each other. They will need access to platforms that enable both sharing and discovery. When considering an ML platform, consider the key stages of model development and operations, and assume that teams of people with different backgrounds will collaborate during each of those phases.
For example, feature stores (first introduced by Uber in 2017) are useful because they allow developers to share and discover features that they might otherwise not have thought about. Teams also need to be able to collaborate during the model development lifecycle. This includes managing, tracking, and reproducing experiments. The leading open source project in this area is MLflow, but we’ve come across users of other tools like Weights & Biases and Comet, as well as users who have built their own tools to manage ML experiments.
Enterprises will require additional features — including security and access control. Model governance and model catalogs (analogs of similar systems for managing data) are also going to be required as teams of developers build and deploy more models.
First Class MLOps
MLOps is a set of practices focused on productionizing the machine learning lifecycle. It is a relatively new term that draws ideas from continuous integration (CI) and continuous deployment (CD), two widely used software development practices. A recent Thoughtworks post listed some key considerations for establishing CD for machine learning (CD4ML). Some key items for CI/CD for machine learning include reproducibility, experiment management and tracking, model monitoring and observability, and more. There are a few startups and open source projects that offer MLOps solutions, including Datatron, Verta, TFX, and MLflow.
Ray has components that would be useful for companies moving towards CI/CD or building CI/CD tools for machine learning. It already has libraries for key stages of the ML lifecycle: training (RaySGD), tuning (Tune), and deployment (Serve). Having access to libraries that work seamlessly together will allow users to more readily bring CI/CD methods into their MLOps practice.
Summary
We listed key elements that machine learning platforms should possess. Our baseline assumptions are that Python will remain the language of choice for ML, distributed computing will increasingly be needed, new workloads like hyperparameter tuning and RL will need to be supported, and tools for enabling collaboration and MLOps need to be available.
With these assumptions in mind, we believe that Ray will be the foundation of future ML platforms. Many developers and engineers are already using it for their machine learning applications, including training, optimization, and serving. It not only addresses current challenges like scale and performance but is well-positioned to support future workloads and data types. Ray, and the growing number of libraries built on top of it, will help scale ML platforms for years to come. We look forward to seeing what you build.