Stay up to date with the latest from Uber Engineering

Robust Database Backup Recovery at Uber

May 22 / Global

Introduction

Uber leverages storage technologies to run its real-time operations. This includes storing online data in open-source databases like MySQL^®, Apache Cassandra^®, etcd^®, and Apache Zookeeper^®, as well as Uber’s homegrown storage solutions like Docstore and Schemaless hosted on top of Uber’s Stateful Platform.

Database backup recovery is vital for Uber’s business continuity and disaster recovery scenarios. This supports use cases like outage mitigation, recovery from data corruption, and forensics and regulatory compliance. It also supports production-like environment simulation for load testing, data integrity, and security.

Uber’s online storage solutions hold 10s of petabytes to serve critical business operations. They also serve millions to billions of requests per second across Uber’s global business verticals. With that scale, close to 100 PB of data is backed up from databases at different intervals. TBs to PBs of data can be restored within a few minutes to a few hours.

This blog describes recent developments in Uber’s robust backup recovery system for online databases.

Challenges

Evolving Uber’s backup recovery system at scale presented multiple challenges:

Rudimentary backup scheduling: Previously, backups were naively scheduled in a periodic best-effort manner. There was no consideration of current network resources, host resources, prioritization, rate limiting, or observability along the system. This led to spiky backup workload utilization, causing multiple issues that required a slow mechanism to resume backups.
Ad-hoc recovery process: The recovery process wasn’t systemically defined. It resided as unchecked scripts or outdated runbooks. With periodic database system upgrades, this process was vulnerable to staleness.
No practice for exercising recovery drills: There was no restore process or systemic workloads defined. Further, there were no regular drills to validate that recovery would work properly.
New recovery objectives: In the past, the RPO (Recovery Point Objective) for some of the technologies ranged from 7-21 days, with the RTO (Recovery Time Objective) ranging from unknown to days for large databases. With multiple optimizations at the scheduling layer, various infrastructure layers, and technology-specific backup and recovery processes, most of the database fleet has moved towards requiring reliable backups. RPO reduced to a range of 4-24 hours and RTO reduced to 300 TB per hour.

This blog describes some of the optimizations we’ve implemented to address such challenges.

Architecture

The enhanced Backup Recovery system on top of Uber’s Stateful Platform provides a unified experience by abstracting all aspects of managing the backup restore ecosystem and its underlying components. These aspects include centralized adaptive scheduling for backup workloads across the fleet of stateful databases. Each backup workload backs up the current database snapshots with the overall state propagation and monitors backups. On the restore side, it conducts periodic recovery testing to verify completeness and correctness of existing backups with the underlying restore automation. Overall, this establishes the notion of a CBCR framework (Continuous Backup Continuous Recovery).

The Backup Recovery system uses a snapshot-based backup and recovery architecture to manage the Stateful fleet for improved disaster recovery.

Figure 1: Continuous Backup Recovery at the Stateful fleet.

The Backup Recovery system contains these components:

Continuous Backup: A centralized orchestrator to schedule backups periodically across the stateful technologies with configurable database backup policies. This adaptively spreads out backup workloads to provide for network reliability and safety.
Continuous Restore: A centralized orchestrator to test restore periodically across ‌stateful technologies with configurable database restore policies. This provides for healthy restore workloads by verifying the correctness and completeness of backup snapshots.
Backup framework: A unified backup driver that integrates storage database–specific plugins to execute database snapshotting logic and upload to Uber’s blob storage.
Restore framework: A unified restore driver integrates storage database–specific plugins to execute backup downloading and loading snapshots to the database.
Technology mainstream workloads: Any stateful technology comes with mainstream components like a Technology Manager/Worker, database workload, and other sidecar workloads important for conducting business operations. Manager/Worker orchestrates and follows the goal-state-driven architecture.
Uber Blobstore: An object storage back end built for large-scale uploads/downloads with configurable policies. Serves as a virtualization layer over multiple cloud storage options.

Continuous Backup

Continuous Backup allows reliable backups for all databases to be taken continuously. It allows backups to be taken at a sufficient rate while improving storage costs and network usage.

Time Machine is a critical component of the Continuous Backup framework, serving as the centralized orchestrator for backups. It functions as a global, adaptive scheduler, periodically triggering backup workloads. This system uploads at least a few PBs of data daily. Uber’s networking bandwidth is shared across multiple Uber business-critical workloads in online/offline flows. Time Machine addresses the challenge of scheduling backups adaptively without disrupting the serving stack’s network usage.

Time Machine has an optimal selection engine with inbuilt client-side rate-limiting to decide which databases to back up in a given scheduling cycle. It considers multiple signals like:

Backup freshness criteria
Dynamic network and host Infrastructure availability
Historical trends of backup consumption
Business peak/off-peak network utilization
Backup rate-limiting policies per storage technologies, prioritizing critical databases
Geographical location, its availability, and utilization levels

This intelligent scheduling allows fleet-wide database backups to be efficiently and evenly distributed at scale. Upon making optimal selection decisions, Time Machine invokes database-specific triggers to initiate runtime backup workloads, providing reliable backups, ‌integrity, and protection.

Figure 2: Continuous Backup at the Stateful platform.

The backup process follows these steps:

Continuous Backup (or Time Machine) has a global scheduler module that runs periodically to schedule backup workloads.
Backup Scheduler in each periodic cycle invokes 3 sequential phases: discovery, selection, and trigger.
- The discovery phase runs a fleet-wide scan across the Stateful fleet to compile all possible databases to back up.
- The selection phase applies decision-making criteria (multiple filter and sort rules) to deduce the final set of eligible databases to back up. This makes the process adaptive to network and host infrastructure reliability.
- The trigger phase decides the backup mode (full or incremental) and invokes the backup workload per technology-specific behavior.
The technology plugin interface is defined per storage technology for operations required while executing Continuous Backup phases. This includes database-level metadata, backup history, running backup workloads, and current fleet status.
For each backup trigger, equivalent technology-specific Manager/Workers powered by Uber’s Stateful Platform orchestrate the backup workload to its entirety. It provides for spawning of on-demand backup sidecar workloads and shares the desired state. Then, it verifies completeness and publishes the actual state to the stateful metadata store.
Each backup workload for any storage technology performs database snapshotting logic and uploads to Uber’s blob storage.

Backup Framework

The Backup framework is a generic backup driver integrated with technology-specific plugins to execute database snapshotting logic and upload efficiently and reliably to Uber’s blob storage. A runtime backup workload runs as an on-demand sidecar container in parallel to mainstream containers like technology node-worker and database containers. A node is a unit of stateful workload running as a containerized cgroup powered by the host fleet, following a goal-state-driven architecture.

Here’s how the backup flow works:

For any database, a node worker serves as the primary workload responsible for orchestration and guarding database health. It shares the incoming backup’s goal state as input to the runtime backup workload and oversees the backup lifecycle.
The backup driver orchestrates the extraction of database snapshot files to back up. It uploads them incrementally with rate-limiting and data integrity checks, checkpointing the upload file states to the backup index. It cleans them up post-upload to avoid spiking disk usage.
The backup driver also pushes the backup index to blob storage, which helps deduplicate immutable files across snapshot versions to build incremental/differential backups. It references all files, forming a logically full backup.
The backup driver monitoring hook provides for termination when node resources are highly utilized to avoid disrupting production traffic.
Backup driver finishes, sharing the actual state back via the backup I/O volume to the worker to eventually sync back to the stateful metadata store.

Here’s how the database snapshotting logic works for each technology:

MySQL-based storage technologies use the Percona Xtrabackup^™ snapshot toolkit. This includes MySQL^®, and Uber’s home-grown Docstore/Schemaless technologies with a customized and efficient differential backup solution.
Cassandra uses a Medusa^™ -like differential backup setup using the nodetool snapshot toolkit.
etcd uses etcd-clientv3 to have a point-in-time snapshot.
Zookeeper takes a backup for the latest snapshot.<zxid> file.

Restore Framework

The Restore framework is technology-agnostic, similar to the Backup framework, designed to enable automated and consistent restoration. Its modular design allows the framework to be extensible and adaptable to different database architectures. Furthermore, the framework eliminates manual intervention, reduces recovery time, and minimizes the risk of human error in any in-place and out-of-place recovery scenarios..

The framework provides a generic driver with extensible database technology-specific plugins to define key components. The Manifest Provider plugin is customized to fetch a database-specific backup index manifest as described above, where each backup is associated with a detailed backup index. The backup index references all snapshot files from different backup types (full, incremental, or differential) and files needed during a restore. Database Loader defines the database ingestion logic tailored to the database architecture (MySQL, Cassandra, etcd) to bring the database to a usable state.

Here’s how the database loading process works for each technology:

MySQL-based restore processes extract and prepare the backup using tools like Percona XtraBackup so MySQL can start smoothly with the right data and settings.
Cassandra restores entail downloading the backup files, typically SSTables, and loading them into the Cassandra engine.
etcd/Zookeeper restores follow a similar process of placing backup snapshots into their designated database directories and leveraging their specific snapshot loader libraries.

Figure 4: Restore framework control flow.

The Restore framework’s integration with the Continuous Restore framework facilitates continuous validation and readiness for real-world situations.

Continuous Restore

The Continuous Restore framework confirms data completeness and correctness by frequently validating the restored backups. Framework offers intelligent scheduling with configurable restore test cadences that offer regular and ad-hoc validation. The scheduling logic considers hardware resource availability to prevent production impact.

Restore testing strategies include dedicated database testing and random database testing. With dedicated testing, it uses predefined databases with known data to perform detailed, end-to-end restoration and validation. With randomized testing, production-scale databases are selected with defined characteristics for broader validation of restore workflows under real-world conditions.

Post-restore, the framework performs robust data verification, including file integrity validations and byte-to-byte data comparisons for known dedicated databases. The framework also collects detailed insights, including restore success rates, restoration rates against a backup footprint, data integrity validation results, and multiple performance-oriented signals, reported further for monitoring and analysis.

The restore evaluation process follows four phases:

Discovery/Selection: Identifies databases eligible for restore testing based on criteria like tier, size, and the last successful backup. Applies filtering and prioritization rules to determine which databases will be restored, resulting in an even distribution of workloads. This also takes into account the restore evaluation strategy—dedicated or random—providing flexibility in the evaluation framework.
Trigger: Executes the restore process by creating temporary clusters for testing. It sets up a new cluster and triggers a restore, leveraging the extensible restore framework. This phase acts like the system under ‌test, responsible for the actual restoration from a backup.
Validation: Runs validations based on restore evaluation strategy. For dedicated databases, data is compared against predefined datasets. For random databases, backup integrity is validated.
Reporting: Once a test completes, the system generates a detailed report and cleans up any temporary resources created during the process.

Figure 5: Continuous Restore framework at the stateful fleet.

Benefits of the Continuous Restore framework include:

Operational resilience: Systems are restore-ready, reducing the risk of downtime.
Compliance and audit support: Generates automated reports to meet recovery and compliance requirements.
Data assurance: Validates data integrity and restoration processes to promote reliability.
Actionable insights: Provides visibility into restore performance and highlights potential areas for improvement.

By continuously validating restore processes, our framework reinforces disaster recovery readiness, safeguards critical data, and strengthens system resilience at scale.

Conclusion

Backup Recovery for the stateful fleet at Uber is crucial for Uber’s business continuity and disaster recovery plan. We hope to explore additional system developments in future blogs.

Cover Photo Attribution: “What’s in the database” by Lotta Holmström is licensed under CC BY-SA 2.0.

Apache^®, Apache Cassandra^®, Apache Zookeeper^® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

etcd^® is a registered trademark of the Linux Foundation in the United States and/or other countries. No endorsement by The Linux Foundation is implied by the use of these marks.

MySQL is a registered trademark of Oracle^® and/or its affiliates.

Pecona^™ and Percona XtraBackup^™ are trademarks of Percona, LLC.

Arjav Jain

Arjav Jain is a Senior Software Engineer on the Storage Platform at Uber. He has deep knowledge in building and operating database backup recovery at scale for highly distributed storage solutions for relational/non-relational database technologies. He led the efforts for the evolution of the Continuous Backup system architecture as well as similar significant contributions on the restore path and SLOs. He is actively involved in various database backup recovery initiatives and use cases.

Shivam Vijay

Shivam Vijay is a Senior Software Engineer on the Backup and Restore Platform at Uber. He’s been deeply involved in building backup and restore flows from the ground up, facilitating strict SLAs for RPO and RTO for recoveries across data stores.

Debadarsini Nayak

Debadarsini Nayak is a Senior Engineering Manager, providing leadership for various storage technologies.

Mohammed Khatib

Mohammed Khatib is a Senior Staff Software Engineer at Uber, contributing to the reliability and performance of the storage infrastructure. He specializes in disaster recovery planning and implementation, as well as improving the efficiency of large-scale storage systems. His 15 years of experience span various aspects of distributed and storage systems, from pub-sub messaging and power capping techniques to non-volatile memory. Dr. Mohammed holds a PhD in Computer Science and has (co)authored 15+ peer-reviewed publications and 5 US patents.

Ramnik Jain

Ramnik Jain is a Senior Staff Software Engineer at Uber, leading, contributing, and providing guidance across various storage technology initiatives across the Uber Stateful Platform and multiple other cross-functional initiatives.

Posted by Arjav Jain, Shivam Vijay, Debadarsini Nayak, Mohammed Khatib, Ramnik Jain

Category:

Engineering

Backend