
Introduction
M3DB is a scalable, open-source distributed time-series datastore. It is used as persistent storage for the M3 stack, which enables metrics-based observability for all Uber services. These metrics are subsequently used for generating real-time alerts.
It has been 3 years since we last deployed the open source version of M3DB into Uber’s cloud. Since the last deployment, a series of performance enhancements have been added to the server code. Some of these enhancements include:
Scope of Work
One of the reasons we have been running on an older version of M3DB despite the open source project introducing significant improvements is that we didn’t have a well-tested verification and rollout process for M3DB.
One of the primary goals of this effort was to figure out a repeatable and reliable way of rolling out M3DB into Uber’s cloud so that we are able to upgrade more consistently. We wanted to automate as many steps as possible while simultaneously coming up with a list of improvements for the next iterations of upgrading the M3DB fleet.
Scale & performance considerations
There is a basic version of an integration suite available in the open source project but that doesn’t apply at Uber scale ( 1 billion write RPS, 80 million read RPS, and approximately 6,000 servers).
Given the scale of operation of M3DB within Uber, even a 5-10% regression in performance/efficiency/reliability could be a serious risk to our observability platform. Evaluating the correctness of newer versions of M3DB is meaningless unless tested under production analogous scale.
Further, while operating at scale, one of the common scenarios is that the M3DB cluster is in a dynamic state–in the sense that nodes are added, removed, or replaced often and almost always triggered by underlying infrastructure automation. So the integration suite should test for functionality, correctness, and performance under those chaotic scenarios as well.
Quantifying concrete gains for Uber workloads
The changes in the open source version have shown performance improvements, but have never been tested in an environment as large and complex as Uber. There are no numbers available to indicate the exact gains we would see by moving to the latest workloads. Given the effort involved in vetting and releasing newer versions in M3DB, we need to justify the ROI to undergo this effort.
So part of the testing effort for a newer version would include running it under Uber production-analogous traffic and comparing key performance indicators against the existing version running in production to get a sense of how the open source version stands against Uber’s scale.
Rollout challenges
Given the huge cluster sizes (approximately 100 clusters with 5,500 nodes in total) as well as high allocation per node (28 cores and 230G memory per node currently), rolling out any change to M3DB also presents challenges in terms of reliability and operability. We want to evaluate a release process that is safe, stable, and needs the least amount of human effort or intervention.
Rollout Strategy
We have over hundreds of M3DB clusters running in production–sharded by retention period and availability zones. Each cluster can have hundreds of nodes and a vast majority of them average around 100 nodes/cluster. While the clusters themselves can be upgraded in parallel, within a cluster, we have an added constraint of only one node can be upgraded at a time–to avoid data loss as writes are performed at two-thirds majority quorum within Uber.
Testing Strategy
We have considered and implemented two broad testing strategies:
- Offline mode, where we run M3DB under various simulated/experimental synthetic test loads and benchmark for key client performance indicators, like correctness, latency, success rates. Additionally, we benchmark server performance characteristics like server response time, bootstrap duration, tick duration, and data persistence between restarts and replacements.
- Online mode, where we run M3DB under duplicated and sampled production traffic to observe for performance and correctness regressions against existing production clusters.
Offline Mode
For the offline mode, we needed a M3DB benchmark tool that generates load at scale. The various scenarios against which the cluster needs to be tested are simulated by tweaking the input parameters to the benchmark tool or by tweaking the cluster setup as necessary. To scale our benchmarking tool to mimic or even surpass our production loads, we deployed our benchmarking tool service as a distributed service. We deployed powerful, high-capacity machines to subject the system to a significant workload. All the benchmarks were run on a newly created M3DB cluster with a cluster size of 10 nodes.
Test Scenarios
- Benchmark an M3DB cluster at older version – We run multiple benchmark tests on the older version and record the latencies and correctness of the M3DB cluster to establish baseline numbers.
- Run Topological Operations – Run benchmark suite while there are topological operations (node additions, removals and replaces) taking place within the cluster.
- Rollout new version – Run the benchmark suite when we do a deployment on the new version.
- Benchmark the newer version – Similar to above with the older version, we run the exact same benchmark tests with the newer version as well and collect the metrics for the same.
- Benchmark correctness – Run benchmark first on a write-only mode and record the metrics written in a file–later run the benchmark in a verification mode, which will try doing reads once the deployment of the newer version is complete.
The “a” and “b” points above will be done both when the cluster is on the older version of M3DB and once when the cluster is on the newer version.
The “c” and “d” points above are required when there is rollout to test in the in-transit test scenarios.
The “e” point talks about the correctness. Correctness is an essential requirement:
- When an upgrade happens, the old server will flush all the data to the disk and the new server version will start reading from the same data. We need to make sure that the new node is able to read the data flushed by the older version.
- We need to benchmark correctness in cases where the new version holds the data in the buffered state as well as the flushed state.