Uber uses MySQL as the underlying database engine for Schemaless and Docstore, our distributed databases. By default, MySQL uses the most popular InnoDB engine, a B+Tree structure for data storage. MyRocks is a MySQL storage engine that integrates with RocksDB, an open source project. The RocksDB store is based on the log-structured merge-tree (or LSM tree) and is optimized for fast storage and combines outstanding space and write efficiency with acceptable read performance.
The Uber Storage Platform team has migrated all Schemaless instances and some Docstore instances to MyRocks since 2019. In this post, we are going to talk about the journey of our migration to MyRocks.
Schemaless/Docstore are serving business-critical use cases across multiple lines of business and platforms e.g. rides, eats, grocery, maps, money movement etc. Both store tens of petabytes of data and serve tens of millions of requests/second. With the growth of Uber’s business, these numbers are only getting larger and the disk space has become the bottleneck: some nodes alone contain more than several terabytes, which makes it harder for some day to day operational tasks, such as data resync and backup.
Disk usage was becoming the bottleneck resource and causing CPU and memory resources to be underutilized. This was leading to overall higher total cost of ownership (TCO) for running Schemaless. It was thus critical for us to address this issue and bend the cost curve.
One way to address such an issue is splitting nodes to distribute disk space, however that is not only time consuming but also runs into the sharding limit on some instances, meaning node splitting is not a long term sustainable solution for us given our scale and growth. We tried to use InnoDB table compression to alleviate disk pressure, which gave us only about 5% disk space saving since data has already been compressed in our query engine layer before being stored in the database and additional compression will not yield much.
While looking out for a solution, we did a benchmark with MyRocks, which achieved over 30% disk space saving with appropriate latency. One of the main reasons for this is that RocksDB does not use a fixed page size and hence can achieve efficient compression with least wastage. Also, InnoDB uses B+ tree index which can get fragmented leading to additional disk space usage.
Essentially MyRocks is just MySQL with the RocksDB storage engine. It has the same client protocols and there is no change required in our query engine layer and clients, so we reached the conclusion that MyRocks would be the best choice to support Uber’s fast-growing business resulting in an increase of database footprint.
The migration of data and nodes at such a large scale while maintaining strict SLAs along with consistency, availability, and transparency is always difficult. The challenges include:
- A detailed automatic migration strategy with no impact on downstream service or user and manual intervention, including the rollback plan
- Ensuring read performance is acceptable after migration
- Large scale of MySQL 8.0 upgrade before migration to adapt MyRocks limitations. We were running MySQL 5.7 to start with.
- Quickly workaround for any bugs in MyRocks
Upgrade and Migration
The topology of a Schemaless/Docstore instance has data divided into shards which are mapped to a configurable number of physical partitions. Each partition has multiple data copies within a single region and runs Raft consensus protocol for consistency. Each partition is also replicated across geographical regions.
To ensure the migration was transparent to the stakeholders without impact, we designed an incremental migration strategy, from single partition and to multiple partitions, from single region to all regions. Between each node migration, we validated the data integrity and monitored latency and error rate to make sure there was no performance degradation and all these steps were done automatically.
MySQL 8.0 Upgrade
Uber uses XtraBackup for data resync and backup. Xtrabackup started supporting MyRocks from the MySQL 8.0 version, so as the first step of the MyRocks migration process, we needed to upgrade our fleet to MySQL 8.0. Another motivation for the upgrade was because Docstore planned to use MySQL JSON features provided by MySQL 8.0.
We performed in-place MySQL upgrade with just binary replacement. The overall steps were as below:
- Upgrade one follower on single partition in single region
- Upgrade all followers on the same partition in the same region
- Upgrade leader on the same partition in the same region
- Repeat step 1-3 on all other partitions in the same region
- Repeat step 1-4 in another region
It took several months to upgrade MySQL from 5.6 to 5.7 in 2018 at Uber, but from MySQL 5.7 to 8.0, it took over a year. We spent lots of time working around the bugs we hit in MySQL 8.0 and performance degradation, which included:
- Higher memory usage in MySQL 8.0 compared to MySQL 5.7
- Different execution plan due to MySQL optimizer change, such as bug 104083
- MySQL upgrade took so long to populating data dictionary tables if there were too many tables
- Threads stuck waiting on THD::release_resources PS-7525
- InnoDB corruption of an index tree/tablespace free space
We simultaneously migrated MySQL to MyRocks on the nodes that have been upgraded to MySQL 8.0. Different from the MySQL 8.0 in-place upgrade, the MyRocks migration strategy was similar to the MySQL logical upgrade: we added the extra nodes in one region using Xtrabackup and started the migration process there.
The node migration included engine conversions on each table, and several validations to make sure the MyRocks node provides the same data consistency and query performance as InnoDB node.
- First we needed to build a node. We continued to use Xtrabackup as it has been used in Uber as a standard and proved itself faster and more reliable than other ways such as mysqldump.
- To make sure there were no corrupted or missing tables during the migration, we collected checksum for each table, and the number of tables.
- With RocksDB bulk load mode, we altered all user tables to RocksDB Engine using multiple threads.
- After all tables were altered, we collected the same information as step 2, but on MyRocks.
- It was time to compare the information between InnoDB and MyRocks that we had collected from step 2 and 4. We saw some cases when altering a table to the RocksDB engine, if the MySQL got killed by out-of-memory during the waiting for handler commit phase, some tables were missing after MySQL restarted.
- Analyzing tables to regenerate the statistics on MyRocks. It is not guaranteed that every execution plan will still be the same after conversion, so analyzing the tables can avoid such execution plan changes, which may result in SQL performance degradation.
- To compare the performance, we replayed the live production queries on a MyRocks node to compare the result. We hit an issue where the same query returned a different result on InnoDB and RocksDB engine (PS-7722). We also replayed the queries to compare the execution plan and the execution time to ensure there was no performance degradation.
If the first node was converted to MyRocks successfully in a partition and passed all the validations, we will continue with the rest of the nodes in the same partition.
- Add the second MyRocks node as follower
- Add the third MyRocks follower node (at this point the partition has 3 MyRocks and 2 MySQL nodes)
- Promote the MyRocks as leader to take write traffic from services
- Remove the MySQL nodes one by one
We repeated the same steps on all partitions in parallel in one region, and after it was fully converted to MyRocks, we paused the process and compared the performance between 2 regions (e.g., latency, error rate, CPU/IO utilization, etc.). We still could rollback and failover the traffic into the MySQL region if needed.
If everything looked good in the MyRocks region, we could start the migration in another region. In this step, we would not add an extra MyRocks node, as everything has been verified in the previous region–we just replaced the MySQL node with MyRocks to save the extra capacity needed. Finally, we finished the MyRocks migration on an instance.
By migrating to MyRocks we saw disk space savings of over 30% across the board. However, what we learned from the migration is that not all the use cases are good candidates that can be migrated to MyRocks. We observed higher CPU usage and increased disk IO utilization on some instances as a trade-off of disk space saving.
The whole upgrade and migration took longer than the initial planned timeline. One of the reasons is MyRocks is still considered to be a very new database engine; there is no mature community from whom we can draw support, and we need to either wait for the patch from Percona or spend time to research and find our own workarounds whenever we run into issues.
It took several years to finish the Schemaless migration. We have also migrated some Docstore instances since then. Compared to Schemaless, we have seen much larger disk space savings with Docstore, since the data is not compressed before being stored. We plan to continue migrating Docstore to MyRocks.
We plan to focus on tuning MyRocks to ensure higher efficiency and performance for our dominant Schemaless and Docstore workloads. We also plan to solidify our overall backup recovery story for MyRocks.
If you are interested in working on large-scale distributed storage systems and taking on the challenge and responsibility of powering Uber’s products worldwide, please explore and apply to the open Storage team positions on Uber’s Careers page.