RayML using Kubernetes operator

RayML + Kubernetes = Finally, a truly scalable Distributed ML solution

Background (From Ray website): Ray is an open-source distributed execution framework that makes it easy to scale your applications and to leverage state of the art machine learning libraries. Ray provides a simple, universal API for building distributed applications (supports Python and Java API).

I first came across RayML on Software Engineering daily podcast in July 2020.
In September 2020, I attended Ray Summit 2020 organized by Anyscale (A startup founded by the creators of Ray from the UC Berkeley RISELab, the successor to the AMPLab, that created Apache Spark and Databricks).
In one of the Ray summit sessions about running on Ray on k8s, several attendees requested Anyscale/Ray team to revive the k8s operator. Which they did in the short span of 3 months and now it’s available in betaish phase.
There is still engineering work being done for different use-cases. Anyscale/Ray team allocated time and resources to make enhancements to enable k8s as a cluster manager.

Review: This post is a primarily about my experience with setup and running ML workloads using Ray k8s operator.

Benefits of using Ray k8s operator as a cluster manager:
1. Provides ability to run production grade ML workloads.
2. Autoscaler will horizontally scale pods.
3. Dashboard provides CPU/Memory resource usage and logs.
4. Network and Security policies can be set at the namespace and/or cluster level.
5. Relatively easier dependency management, you can pick and choose versions and underlying package dependencies.
6. More efficient allocation and usage of resources
7. Single tenant dedicated clusters.

Setup : I used GKE cluster with 5 nodes (each with 2 cores & 8 GB) and Helm to setup Jupyter pod.

Ray Dashboard: Ray dashboard dynamically refreshes summary (CPU, Memory, Disk) for head and worker pods. You can also see the status of each pod, jobs and very detailed logs in one place. Below, I have 1 head pod and 2 worker pods. (you can pick min and max workers when creating your single tenant cluster)

Ray dashboard dynamically refreshes node/pod summary, status and logs

Ray Workload: I tried a relatively simple hyper parameter tuning code with num_evaluations=20. The autoscaler kicked in and added a new worker pod dynamically.

Ray Dashboard with ML run: Below, I have 1 head pod & 3 worker pods while hyper parameter tuning code is running.

After the run is completed. The cluster automatically scaled down to 2 worker pods.

Summary: Ray is the buzziest new tech (14K stars on GitHub) in recent times, with a lot of potential as its built from ground up for Cloud computing and ML technologies. As Ray provides Java and Python API, it could be used beyond ML use-cases such as ETL and other Analytics workloads.