July 20, 2023

How Uber Optimized Cassandra Operations At Scale

Runtian Liu

Jaydeepkumar Chovatia

Gopal Mor

Share this article

Introduction

Each day, the Uber app moves millions of people around the world and delivers tens of millions of food and grocery orders. Each trip or delivery order depends on multiple low-latency and highly reliable database interactions. Uber has been running an open-source Apache Cassandra® database as a service that powers a variety of mission-critical OLTP workloads for more than six years now at Uber scale, with millions of queries per second and petabytes of data.

As we scaled the Cassandra fleet and onboarded critical use cases, we faced numerous operational challenges. This blog shows our journey on how we debugged the challenges that pop up when we run Cassandra at scale. The blog also exemplifies the power of smaller, incremental changes in service that give you compounding reliability in return at scale.

Summary

Open-source Apache Cassandra was introduced as a managed service in 2016 to power Uber’s core services. The Cassandra service grew over a period of time. As our Cassandra service grew, we were faced with daunting challenges.

This blog is divided into two sections. The first part walks through the architecture summary of Cassandra deployment within Uber; the second part talks about production challenges we faced as we scaled our Cassandra fleet and how we tackled them.

A Managed Service

Cassandra is run as a managed service by the Cassandra team within Uber, and the team is responsible for the following:

Implement new features in Cassandra and contribute back to the community
Integrate into Uber’s ecosystem, such as control plane, configuration management, observability, and alert management
Critical bug fixes and contributions to the community
Providing a one-stop shop for Cassandra as a managed solution for Uber’s application teams
Ensure 99.99% availability and 24/7 support for Uber’s application teams
Data modeling for our partner teams
Guide teams on best practices while using Cassandra

Scale

Tens of millions of queries per second
Petabytes of data
Tens of thousands of Cassandra nodes
Thousands of unique keyspaces
Hundreds of unique Cassandra clusters, ranging from 6 to 450 nodes per cluster
Span across multiple regions

Architecture

At a high level, a Cassandra cluster will span across regions with the data replicated between them. The cluster orchestration and configuration are powered by Uber’s in-house developed stateful management system, known as “Odin”.

Cassandra Framework

Cassandra framework is an in-house developed technology powered by our stateful control plane, Odin, and the framework is responsible for the lifecycle of running Cassandra in Uber’s production environment. The framework adheres to Odin’s standards and focuses specifically on Cassandra’s functionality. The framework manages all the complexity of the way Cassandra is supposed to function for one-click operations, such as seed node selection, rolling restart, capacity adjustments, node replacement, decommission, node start/stop, nodetool commands, etc.

Cassandra Client

Both Go and Java open-source Cassandra clients have been forked and integrated with the Uber ecosystem. The clients are capable of discovering the initial contact points using the service discovery mechanism, so there’s no need to hardcode the Cassandra endpoints. It also has additional observability (e.g., capturing query fingerprints, key DDL/DML features mapping, etc.) that helps debug production issues and future decision-making.

Service Discovery

Service discovery plays a crucial role in a large-scale deployment. In collaboration with Uber’s in-house service discovery and stateful platform, each Cassandra cluster is uniquely identified, and their nodes can be discovered in a real-time fashion.

Whenever a Cassandra node changes its status (such as Up, Down, Decommission, etc.), then the framework notifies the service discovery of the change. Applications (a.k.a., Cassandra clients) use the service discovery as its first contact point to connect to the associated Cassandra nodes. As the nodes change their status, the discovery adjusts the list accordingly.

Workloads

Cassandra As A Service powers a variety of workloads at Uber, and each workload serves unique use cases at Uber. The workloads vary from read-skewed to mixed to write-skewed. Thus, it is impossible to characterize a unique pattern given so many different use cases using Cassandra. One commonality seen is that the majority of them utilize Cassandra’s TTL (time-to-live) feature.

Challenges at Scale

Since the Cassandra service’s inception at Uber, it has continued to grow year-over-year, and more critical use cases have been added to the service. As we scaled, we were hit with significant reliability challenges. Some of the challenges are highlighted below.

Unreliable Node Replacement

Reliable node replacement is an integral part of any large-scale fleet. Daily node replacement predominantly occurs due to the following two reasons:

Hardware failures
Fleet optimization

There are less common situations that trigger substantial node replacement, such as changes to the deployment topology or disaster recovery.

Cassandra’s graceful node replacement can be done simply by decommissioning the existing node (a.k.a., nodetool decommission) followed by adding a new node. But we faced a few hiccups:

Node decommission stuck forever
Node addition failing intermittently
Data inconsistency

These issues do not occur with every single node replacement, but even a smaller percentage has the potential to stall our entire fleet, which adds piles of operational overhead on teams. As an example, a success rate of 95% means 5 failures out of 100 node replacements. Supposedly, if we have 500 nodes getting replaced every day, then 25 manual operations, which is equivalent to 2 engineers dedicated only to recovering from these ad-hoc failures. Over the period of time, the problem would become so severe that we would have had to pause most of our automation, which would stall other initiatives; we could only continue emergency operations that could not be avoided, such as hardware failure.

This is how we attacked this primary problem.

Issues Due to Cassandra Hints

Cassandra does not clean up hint files for orphan nodes–say a node N1 is a legit node, and it has stored hint files locally for its peer node that was part of the Cassandra ring in the past, but not anymore. In this case, also, node N1 does not purge the hint files. To add more to the pain, when N1 decommissions then, it transfers all these orphan hint files to its next successor. Over the period of time, the (orphan) hint files kept growing, resulting in terabytes of garbage hint files.

Adversely, the decommission code path has a rate limiter, and the speed of the limiter is inversely proportional to the number of nodes. So, in a really large cluster, if a decommissioning node needs to transfer all its hint files, say in terabytes, to its successor, then it could take multiple days.

To solve this problem, we changed a few things in Cassandra and the ecosystem:

Proactively purged the hint files belonging to orphan nodes in the ring
Dynamically adjusted the hint transfer rate limiter (hinted_handoff_throttle_in_k), so in case of a huge backlog, the hint transfer would finish in hours instead of lingering for days

This has improved our node replacement reliability significantly and reduced the replacement time by order of magnitude, only to realize that we were hit with the same problem again.

Confusion arose in the team about whether the above solution worked or not. Upon digging further, we discovered that the decommission step intermittently errors out, and the most common reason was some other parallel activity, such as rolling restart due to regular fleet upgrades.

Unfortunately, the control plane cannot probe Cassandra about the decommissioned state, as there are no such JMX metrics exposed. A similar pattern was observed as part of the node bootstrap phase. We found a gap here and decided to improve the Cassandra bootstrap and decommission code path by exposing the state through JMX. With this additional knowledge, our control-plane layer can probe the current status of the decommission/bootstrap phase, and it can take necessary action instead of just blocking forever.

With the above change and a few more fixes, our node replacement became 99.99% reliable and completely automated, and the median replacement time was reduced significantly. Due to these improvements, our automation, as of today, is already replacing tens of thousands of nodes in just a few weeks!

The Error Rate of Cassandra’s Lightweight Transactions

There are a few business use cases that rely on Cassandra’s Lightweight Transactions pretty heavily, and that too at scale. Those cases suffered higher error rates every other week due to the following reason:

It was a general belief that Cassandra’s LWT was unreliable. To combat this degradation of our business once and for all, our team started a focused effort.

The above error can occur in case of more than one pending range, and one of the possibilities is somehow, we trigger multiple node replacements at the same time. We did a thorough analysis of our entire control plane to check for such a scenario and fixed a few corner cases. Eventually, we exhausted all our control-plane scenarios. Then we moved our focus to Cassandra, and further analysis revealed that only one Cassandra node believes in two token range movements; however, the majority believes there is only one. We added a metric in Cassandra followed by an alert to probe various stats and logs when the issue happens, and we were lucky to catch the following Gossip exception, thrown due to the failure of DNS resolution in code, on the culprit node:

The above exception led us to the root cause, which was when a new node (N2) is brought up as part of the replacement, then we need to supply the leaving nodes’ host-name (N1) to Cassandra JVM as cassandra.replace_address_first_boot=N1. Even after N2 successfully joins the ring, the Gossip code path on N2 continues to resolve N1’s IP address, and as expectedly, at some point, N1’s DNS resolution would throw an exception leaving Cassandra’s caches out of sync with other nodes. At this point, the node restart (N2) is the only solution.

We improved the error handling inside the Gossip protocol; as a result, Cassandra’s LWT became robust, and what used to be the issue every couple of weeks is no longer seen even once in the last twelve months or so!

Repair is Like Compaction

One complaint we have heard from our stakeholders is data inconsistency due to data resurrection, and it was due to a sluggish Anti-entropy (Cassandra repairs).

Anti-entropy (Cassandra repairs) is important for every Cassandra cluster to fix data inconsistencies. Frequent data deletions and downed nodes are common causes of data inconsistency. There are a few open-source orchestration solutions available that trigger repair externally. But at Uber, we wanted to rely less on a control plane-based solution. Our belief was that the repair activity should be an integral part of Cassandra itself, very much like Compaction. Keeping that goal in mind, we embarked on our journey to have the repair orchestration inside Cassandra itself that will repair the entire ring one after another.

At a higher level, a dedicated thread pool is assigned to the repair scheduler. The repair scheduler inside Cassandra maintains a new replicated table under system_distributed keyspace. This table maintains the repair history for all the nodes, such as when it was repaired the last time, etc. The scheduler will pick the node(s) that ran the repair first and continue orchestration to ensure each-and-every table and all of their token ranges are repaired. The algorithm is also capable of running repairs simultaneously on multiple nodes and also splits the token range into sub-ranges with the necessary retry to handle transient failures. Over the period, the automatic repair has become so reliable that it runs as soon as we start a Cassandra cluster, like Compaction, and does not require any manual intervention.

Due to this fully automated repair scheduler inside Cassandra, there is no dependency on the control plane, which reduced our operational overhead significantly. This automated repair scheduler brought down our p99 repair duration from tens of days to just under a single-digit number of days!

Conclusion

In this article, we have showcased Cassandra’s genesis at Uber and the importance of incremental changes to stabilize a large-scale fleet. We have also taken a deep dive into the architecture and explained how the entire Cassandra service was designed and has been running successfully for so many years without impacting stringent SLAs. In the next part of this series, we will focus on how we are taking Cassandra’s reliability to the next level.

Header Image Attribution: The “Wolverine Peak summit ridge” image is covered by a CC BY 2.0 license and is credited to Paxson Woelber.

How Uber Optimized Cassandra Operations At Scale

Introduction

Summary

A Managed Service

Scale

Architecture

Cassandra Framework

Cassandra Client

Service Discovery

Workloads

Challenges at Scale

Unreliable Node Replacement

Issues Due to Cassandra Hints

The Error Rate of Cassandra’s Lightweight Transactions

Repair is Like Compaction

Conclusion

Company

Products

Global citizenship

Travel

Select your preferred language

Products

Company

Select your preferred language

Ride

Drive & deliver

Uber Eats

Business

Drive & deliver

Ride

Uber Eats

Uber for Business

Manage account

Sign out