Migrating Large-Scale Interactive Compute Workloads to Kubernetes Without Disruption

8 May / Global

Share
Facebook
X social
Linkedin
Envelope
Introduction
Millions of people worldwide use Uber daily, generating vast amounts of data on traffic, routes, estimated times, and more. We use this data to learn and enhance their experiences with Uber.
We developed DSW (Data Science Workbench), an interactive notebook platform for applied scientists, data scientists, ML engineers, and operations specialists to facilitate this learning. DSW supports data exploration, analysis, model training, workflow scheduling, visualization, and collaboration through a web interface.
Behind the scenes, the DSW team provides access to Jupyter® and RStudio® notebooks by allocating isolated containers with internal tooling and necessary open-source software. These containers vary in memory, compute, and GPU resources. Each user session offers multiple Python® kernels, Apache Spark™ PySpark, and Sparkmagic kernels, with independent environments. Additional Python packages can be installed in the environment that used to require reinstalling upon container restart before the migration. 
Managing Python dependencies is complex, and migrating tech stacks without disruption is challenging. In this post, we explore how we migrated 3,500 interactive Jupyter and RStudio user sessions from Peloton—an Apache Mesos®–based container orchestrator—to Kubernetes®, with minimal disruption. We also highlight how intelligently debouncing inotify events helped track installed/uninstalled Python packages during the migration across restarts.
This blog is the second in a multi-part series about Kubernetes migration use cases. Uber is gradually converging to Kubernetes for batch, stateless, and storage use cases. The goal is to leverage industry-standard technology, rich built-in functionality, and stability. This blog describes how we migrated DSW batch job workloads from Peloton to Kubernetes. The previous blog described how we migrated all shared stateless workloads to Kubernetes. Future blogs will explore additional workload and architecture migrations.
Motivation Behind the Migration
We’ve reimagined the Uber infrastructure over the years, defining principles in the Crane and Migration to Google Cloud Platform™ initiatives. The infrastructure principles we now follow are: 
Our stack should work seamlessly across on-prem and the cloud without vendor lock-in
Zone turn-up and infrastructure management should be automated with homogeneous hosts
Capacity migration from on-prem to the cloud should be simple and centrally managed, preferably within a single container orchestration platform
DSW sessions use a common NFS (network file system) mount for cross-session persistence. To provide these mounts, a lower-level service handles NFS on dedicated host groups. However, these dedicated hosts within Peloton hindered workload portability. Following the principles outlined in the Crane initiative, we aimed to migrate DSW workloads to Kubernetes for better resource management, robust NFS support, and easier cloud migration.
Challenges
Aspects of DSW and Kubernetes presented migration challenges we had to consider. 
Modeling Interactive Workloads
DSW session workloads align fairly well with Kubernetes’ built-in Jobs construct, but key differences exist. Kubernetes Jobs are designed for non-interactive, short-lived, multi-container tasks, whereas DSW workloads are interactive, long-lived, and single-container.
Achieving Efficiency Gains
DSW workloads on the Peloton stack ran on dedicated, zonal clusters, risking complete downtime if a zone or cluster failed. This setup also caused load imbalances, with DSW clusters experiencing high load and scheduling delays while other Peloton clusters remained underutilized. Additionally, direct interaction between the DSW platform and dedicated clusters created operational challenges, such as difficulty in decommissioning or maintaining clusters and bringing new ones online without involving both the DSW and Compute teams.
Mounting NFS
Kubernetes allows first-class support for NFS mounts through plugins like container storage interface drivers or additional packages in the Kubelet service. However, adding such support requires further development and maintenance.
Moving Workloads Without Disruption
DSW offers interactive sessions in Jupyter and RStudio, allowing users to install custom packages. Moving a session container between orchestration platforms requires a restart, which results in losing the in-memory state, local disk data, and installed packages. This significantly impacts productivity and leads to dissatisfaction with the DSW platform.
Design Choices
Uber infrastructure principles and a focus on efficiency, maintenance, and availability drove our migration design choices. 
Modeling DSW Sessions as Kubernetes Jobs
Modeling DSW sessions as a CRD (Custom Resource Definition) in Kubernetes offered benefits like total control over pod behavior for long-lived containers. However, maintaining and sustaining a CRD would be challenging. Additionally, the existing Kubernetes Federation supported the Job interface, enabling faster migration. So, we chose to model DSW sessions as Kubernetes Jobs with some modifications.
Parallelism: Measures the number of parallel pods executing as part of the Job. Since each session is modeled as a Job, only one pod runs at a time for each Job, so parallelism must be set to 1.
Completions: The number of pods that should be completed to consider a Job a success. Since the sessions run indefinitely, our pods would never have a zero exit code marking success, so our completion value should be set to a large number. We use 1,024.
Restart Policy: The restart policy in Kubernetes determines how the scheduler responds to pod failures. We set restartPolicy to Never, so a failed pod gets replaced with a new one. Using OnFailure would restart the container on the same pod, causing issues. We also set spec.backoffLimit to a high number to keep DSW session containers running despite multiple failures. We configured it to ignore such failures to prevent Kubernetes disruptions like evictions from affecting the backoff limit.
Networking: With the support of dynamic port allocation, service mesh by the Compute team at Uber, and setting hostNetwork: true, we enabled service-to-service communication from the DSW sessions.
Secrets and Spire Authentication: Additional volumes were mounted in the DSW session container to access ownership-specific secrets and SPIRE identity.

Figure 1: Kubernetes Job specification depicting restart and pod failure parameters
Efficiency Gain Through Kubernetes Federation
The federation layer in batch compute at Uber, called Federator, provides an abstraction layer over ‌ Kubernetes batch clusters. Unlike directly submitting workloads to a dedicated cluster in the Peloton world, DSW now interacts with Federator, which internally talks to the underlying Kubernetes clusters. The federation layer provides various improvements. 
The first improvement is high availability. Federator is a regional service that interacts with multiple Kubernetes clusters across zones. It monitors cluster health by constant polling, ensuring that a cluster or zone failure doesn’t completely disrupt DSW workloads in the region, unlike in the previous Peloton setup.

                Figure 2: High availability with Federation.
Federator uses a cluster selection algorithm for each incoming workload, choosing the best cluster based on resource availability and current demand. This ensures balanced loads across compute clusters, leading to better SLA guarantees and reduced scheduling wait times.

Figure 3: Uniform utilization of Kubernetes clusters with Federation.
Federation also drives operational efficiency. Interacting directly with dedicated Peloton clusters posed challenges during decommissioning, provisioning, secret rotations, and maintenance. With Federator abstracting Kubernetes batch clusters, new Kubernetes clusters can be added or removed from the region without involving the DSW team.

Mounting NFS
On the Peloton orchestrator, DSW sessions used dedicated hosts with an NFS cluster mount via a low-level initialization service. Although Kubernetes offers native NFS support through CSI drivers and plugins, Uber’s early Kubernetes deployment didn’t allow such installations. We maintained the existing approach but eliminated the need for dedicated host groups, opting for a single host mount across the entire Kubernetes fleet.

Figure 4: Mounting NFS share.
Checkpointing User-Installed Packages
We aimed to make the migration as non-disruptive as possible. Moving an interactive workload between container orchestrators required restarting the container, which could disrupt the user experience. We explored Docker® checkpointing features and Kubernetes volume snapshots but found these solutions either unavailable or unsuitable for our performance needs. 
Understanding user pain points revealed that repeatedly setting up the development environment after restarts was a major hassle. To address this, we decided to checkpoint installed packages so they persisted across restarts. However, user installations occurred through interfaces like pip, poetry for Python, and different tools for R.
We focused on monitoring package installation directories, using a novel method to debounce inotify events and track installation and uninstallation events, storing this information in NFS. Upon session restart, missing packages are automatically reinstalled during bootstrapping. This approach ensures users experience minimal disruption, with only the in-memory state lost, while their notebooks continue running smoothly with all necessary packages intact.

Figure 5: Maintaining environment across session restarts.
Accessible Compute Metadata Through UI
The Peloton UI boasts additional features not found in the Kubernetes UI, such as job and resource pool views. Due to greater Apache Cassandra® capacity (versus etcd®), which backs the Peloton UI, 30-day historical retention has been expanded. Peloton’s strengths boost its effectiveness in providing comprehensive observability and management.
DSW users relied on the Peloton UI for detailed Job insights in the Peloton stack. To overcome the limitations of the Kubernetes UI, we implemented near-real-time data replication from etcd to Cassandra. This solution allowed us to support Kubernetes clusters within the same but rebranded Peloton UI, called the Compute UI, preserving and extending essential functionalities for Kubernetes environments.
With the Compute UI, users can view detailed information about individual pods, including their current state and logs for running and terminated pods. Compute UI also provides a clear overview of Job statuses, displaying the current state of Jobs, listing all associated pods, and showing their respective states. The UI offers a comprehensive view of the user’s resource pools, including current allocation and reservation metrics. It also lists running and terminated Jobs within each resource pool.
By moving workload states from etcd to Cassandra, we achieved higher data retention and overcame etcd’s storage limitations. This solution ensures users maintain access to critical observability features, matching the Peloton experience and enabling a transparent migration.

Sandbox and Log Viewing Experience
The migration from Peloton to Kubernetes significantly changed the logging and sandbox browsing experience for DSW users. Our goal was to maintain continuity while enhancing accessibility, addressing Kubernetes’ limitations, such as ephemeral container logs and limited historical data access.
We introduced the Browse Sandbox feature, which allows users to view the active file system and access logs in real-time. This was achieved by integrating the Kubernetes API server’s pod/exec API for file system exploration and Kubelet’s container logs API for streaming logs in the Compute UI. For terminated pods, we implemented an archival strategy, zipping and uploading sandbox content and logs to TerraBlob for future access.
Development and Deployment Challenges
Initially, the plan was to keep the entire environment on the NFS. However, an environment, be it Python or R, consists of many small files. Bootstrapping the environments was tedious and consumed a lot of IOPS.

Figure 6: NFS metrics showing IOPS consumed.
To address the issue, we pivoted to keeping the list of additional packages installed during the session lifetime instead of keeping the entire environment on NFS.
We also had to design a new restart API that’d send communication to the end-user, checkpoint the session running on Peloton—shut down the same, and respawn it on Kubernetes.
Impact 
The DSW workloads are among Uber’s first batch workloads to completely migrate to Kubernetes. The migration opened up opportunities for better resource sharing across teams, an easier path to the cloud, and cost savings. This effort involved migrating several thousand interactive sessions for more than 2,000 users without any major disruption or user complaints.
As Kubernetes at Uber matures, we’ll adopt first-class support for NFS through CSI drivers or the nfs-common Debian package in the Kubelets. Then, workloads will become truly portable.
Conclusion
As we modernize infrastructure, enhance reliability, and refine features, DSW platform adoption has increased. This blog discussed how we completed a large batch job workload migration by innovatively using existing technology and developing essential tools. We hope these insights will benefit others undertaking similar migrations.
Acknowledgments
Our continuous effort to provide world-class data exploration tools wouldn’t be possible without the entire Data Science Workbench team: the leadership team members Aasha Medhi, Sandeep Kumar, and Vijay Mann; the engineering team members Anshita Vishwa, Divya Biyani, Mehtab Alam, Praveen Muthusamy, Sparsh Bansal, Sayan Pal, and Yashi Upadhyay; and the compute team members Amit Kumar, Gaurav Kumar, Guru Chaitanya Ganta, Monojit Dey, Rishabh Mishra, and Zeel Patel.
Apache®, Apache Cassandra®, Apache Spark®, Apache Mesos®, and Mesos logo are registered trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Docker® is a registered trademark of Docker, Inc in the United States and/or other countries. No endorsement by Docker, Inc is implied by the use of this mark.
Google Cloud Platform™ is a trademark of Google LLC and this blog post is not endorsed by or affiliated with Google in any way.
Jupyter® and Jupyter® logo are registered trademarks of LF Charities in the United States and/or other countries. No endorsement by LF Charities is implied by the use of these marks.
Kubernetes®, etcd®, and Kubernetes® logo are registered trademarks of the Linux Foundation in the United States and/or other countries. No endorsement by The Linux Foundation is implied by the use of these marks.
Python® and the Python logos are trademarks or registered trademarks of the Python Software Foundation.
The R logo® is a registered trademark of the R Foundation in the United States and/or other countries. No endorsement by The R Foundation is implied by the use of this mark.
RStudio® is a registered trademark of Posit Software in the United States and/or other countries. No endorsement by Posit Software is implied by the use of this mark.

Sayan Pal

Sayan Pal is a Senior Software Engineer on the Uber Data Platform team in Bangalore, India. He focuses on batch workloads, UX, and operational aspects in the Data Science Workbench.

Rishabh Mishra

Rishabh Mishra is a Senior Software Engineer on the Uber Compute Platform team, based in Bangalore, India. He specializes in managing stateless and batch workloads on Kubernetes.

Posted by Sayan Pal, Rishabh Mishra

Category:

Engineering

Backend

Data / ML