Locking Down the Fleet: Encryption at Rest and Disk Isolation at Scale

August 14 / Global
Share
Facebook
X social
Linkedin
Envelope
Introduction
At Uber, we use the Stateful Platform Odin to run databases. Teams of engineers are responsible for database technologies such as MySQL®, Redis®, Apache Cassandra®, Schemaless, and others for the 5,000 software engineers at Uber. Odin provides 3.45 Exabytes of disk space across 100,000 hosts. This blog post describes how we achieve fleet-level encryption at rest while providing transparency for stakeholders.
Background
Historically, Uber relied on local disks for data storage on the Stateful Platform (Odin). Then these local disks were merged into a single RAID0 device with a single file system shared between multiple workloads. 
This is shown in Figure 1.

Figure 1: Shared file system layout.
As servers got larger, we colocated more workloads on each host to ensure high resource utilization and capacity efficiency. In some cases, we have more than 100 workloads running on a single host!
Moving from a shared file system to one-volume-per-workload had a large impact. On the negative side, we lost the ability to share disk space between workloads as an elastic resource, and needed more complex logic to manage the creation, maintenance, and cleanup of volumes.
However, the benefits vastly outweighed the negatives. With dynamic volume management, we got the ability to tune workloads FS and performance configuration such as inode ratios and read-ahead values.
We also moved the volume configuration higher up the stack, such that the Odin teams can change volume layouts independently of other teams. This reduces the lead time for customizations.
And finally, we enabled encryption-at-rest compliance, with encryption strategies that are configurable per workload, even if they’re co-located on the same host. We can have unique encryption keys per workload.
A dynamic volume manager also reduced host layout variety and ensured compatibility with future Kubernetes® initiatives.
For dynamic volume management, we chose LVM, and for block device encryption, we selected LUKS.
Choosing an Encryption Strategy
Data at rest can be encrypted at the application, file system, or block level. Application-level encryption offers flexibility but is complex and resource-intensive. File system encryption is more transparent but still exposes metadata and requires configuration. Block-level (full disk) encryption secures everything, including metadata and free space, with a simple, uniform setup.
At the platform level, it is more practical to remain agnostic about the importance of data, applying a single encryption standard as broadly as possible. That’s why we favor block-level encryption over the selective flexibility of higher layers.
The Linux® kernel provides the device-mapper functionality. It allows the mapping of physical block devices onto high-level virtual block devices. This functionality is used by LVM to provide logical volume management.  For encryption purposes, the Linux kernel provides dm-crypt—a transparent disk encryption subsystem that’s part of the device-mapper infrastructure. To standardize the physical storage of the encrypted data, Linux has a standard on-disk format—LUKS—that facilitates compatibility among distributions and provides secure management of encryption keys.
The primary purpose of LUKS is to transform a block device/volume into an encrypted container, called a LUKS container. A LUKS container includes the LUKS headers and a bulk encrypted area.
For all operations with LUKS containers, we use the cryptsetup utility, which provides convenient mechanisms to set up disk encryption.
Logical Volume Per Workload
On the RAID0 layout, when a workload is scheduled on a host, we create a directory on the shared file system and bind-mount this directory to the workload containers.

Figure 2: Layout change after migration to LVM.
On LVM hosts, we first need to create a Logic Volume(LV), then ensure the LUKS (encryption) formatting, create a file system, and finally mount the device to workloads. This limits the workload’s disk space to the underlying logical volume, rather than the entire disk.
Disk usage can vary over time. Some use cases may grow, while others shrink. Some may have seasonality based on different events, such as the end of the fiscal year, or similar. On RAID0 hosts, because there’s no disk space isolation, this growth can largely be absorbed on the host, as these patterns are unlikely to be correlated across use cases. As one workload on the host reduces its disk usage, that space is immediately available for other workloads on the host to consume.
For a workload with its own logical volume, when utilization is approaching the initially allocated size, we need to extend the volume, to grow the file system and prevent potential out of disk space issues. Symmetrically, if a workload reduces its disk usage, that disk space won’t be available for other workloads before the relevant logical volume has been shrunk.
The size change needs to be propagated through the whole stack: LV, LUKS, FS. While extension and shrinking of LV and LUKS is supported, shrinking a file system is more complex. ext4 supports it only when the file system is unmounted. xfs which we used at that time, doesn’t support shrinking a file system at all.
There are two modes of storage allocation on LVM: thin and thick provisioning. Thin provisioning mirrors the RAID0 setup, in the sense that each workload is allocated a file system with the full volume group size. The underlying volume will then grow and shrink with usage. This gives some operational simplicity, but can cause unpredictable performance, and there’s still no disk space isolation. Thick provisioning, by contrast, pre-allocates the full storage upfront, ensuring predictable performance but potentially wasting unused space.
We went with thick provisioning for these reasons:
File system overhead: File systems add ~1% metadata overhead. Creating multiple VG-sized file systems significantly reduces usable disk space, especially problematic with large datasets or limited storage.
Management complexity: Thin volumes require managing a thin pool and complicate IO isolation, as relocating the pool can affect performance.
Error handling: When out of space, thin volumes trigger EIO errors, which can disrupt applications.
Free space monitoring: Workloads can’t accurately assess free space on thin volumes, complicating capacity planning and shard assignment.
Performance measurement: Kernel-level overhead makes thin-volume performance harder to measure and analyze.
Shrinking operations: Thin volumes don’t auto-shrink—manual TRIM operations are needed, adding maintenance and possible performance impact.
The complexities associated with file system overhead, management, error handling, performance measurement, and maintenance can outweigh the advantages in certain scenarios, leading us to explore alternative solutions.
Odin Control Plane
Odin has a control plane that, among other things, is responsible for assigning resources to workloads. This means that when a new workload needs to start, the Odin Control Plane decides where to run this workload. The relevant component in this context is called DDMS (Dynamic Disk Management System), the system that decides how much disk space is set aside on a host for each individual workload. This matters because we want to ensure that we balance workloads not running out of disk space, with not wasting resources at a fleet level. The control plane has a forecast for each individual workload that is based on historical disk usage. Applying this ensures that each workload gets the resources it needs, without requiring manual intervention from operators. As teams manage tens of thousands of workloads, this is a key capability to reduce toil. DDMS can also change how much disk space is assigned for an already running workload. As described above, this can change over time.
For growing workloads, DDMS will increase the allocation until the workload no longer can fit on the host, either because of other workloads, or because it has simply outgrown the host. In this case, DDMS will move the workload to a host where it fits.
For workloads that are shrinking on RAID0, no actions need to be taken, as the disk space immediately becomes released. This also means that the control plane doesn’t need to know or care about host-level operations, and can work at a high level of abstraction. Simultaneously, the host agent wasn’t doing any work related to ensuring the disk space for individual workloads was available.
Disk space isolation through thick provisioning has shifted the paradigm: volume resizing is now a real-world operation that impacts workloads. The host agent and the control plane need to be coupled tighter.
To Shrink or Not To Shrink
We chose not to support in-place volume shrinking. Shrinking the file system involves data copying, and requires downtime for the workload. This means we need to balance the speed of the file system operation as it impacts the disruption budget for the entire cluster with preventing noisy neighbor issues for colocated workloads. Many factors impact how quickly we can shrink the file system. Hardware capabilities, load on the system, how much disk is free on the host, and so on. This means we have no stable and easy way to determine how quickly a file system can be resized. How we currently model and think about downtimes fit better with the downtime as a part of moving the workload.  Simultaneously, we have begun our transition towards Kubernetes, and Kubernetes doesn’t support in-place volume shrink.
So, we had to remodel the disk autoscaling system so it stopped relying on the ability to shrink volumes. We wound up with a design where there’s a decentralized volume extension loop running as part of our host agent that monitors the free space on each LV and triggers an extension when necessary.
Validating the decision logic was challenging—we needed to balance space efficiency with avoiding disk exhaustion. Too many extensions add overhead, and workload sizes vary widely. Some technologies, like Elasticsearch® and Cassandra, rely on static file system size for shard or data decisions, which conflicts with dynamic volumes. To address this, we made volume extension logic configurable per technology.
Extension Loop
We require a tighter coupling between the control plane and the host agent. Namely, as the host agent extends the logical volume to facilitate growth, we must ensure that this doesn’t violate the guarantee that a workload can grow up to its allocation.

Figure 3: Host with multiple disk units.
As an example, consider the Host in this figure. It comes with 6 disk units. The control plane will assign disk units based on its forecasting. We’ll also never hand out the last disk unit from the control plane. This is called the Host Buffer and allows automation to resolve any disk growth issues before workloads run out of disk space. In the case above, we have 6 disk units, 1 assigned to the blue workload and 2 assigned to the green workload. The disk extension loop is aware of this disk unit assignment. This means that the blue workload can grow to 4 disk units without compromising the green workload, and the green workload can grow to 5 disk units without compromising the blue workload. Our forecasting is of high quality, and this means that only workloads that behave abnormally can run out of disk space, and the blast radius for any misbehaving workload is contained to other workloads that behave abnormally on that same host. 
This means that an unfair reduction of the extension loop can be represented as shown in Figure 4.

Figure 4: Extension loop decision making.
Optimizing LUKS Extension
Our initial benchmarking of the effect of volume extensions on a workload’s IO didn’t show anything dramatic. So we started rolling out LVM to our fleet in a controlled, gradual fashion. The rollout started with less-critical workloads and grew the scope to include both more workloads, and more critical workloads. During this rollout, we started seeing elevated latencies—to be more precise, IO stalls—for some of the workloads. We decided to pause the migration while root-causing this performance regression.
Access to  host and workload specific metrics available fleet-wide allowed us to pinpoint the worst offenders, where we saw the largest IO stalls. It turned out that ‌ IO stalls are proportional to the amount of dirty pages and the speed of drives on a host.
We dug into the issue and read the cryptsetup and dm-crypt code bases. As a result, we found a non-optimal behavior that the cryptsetup used to perform the extension operation of dm-crypt device. The culprit was how device suspend and resume operations were executed. By default, they always locked the file system and waited for the full flush of all dirty pages. While these are required actions for most of the modifications of the dm-crypt device, it should be safe in the case where we only perform the extension operation. We proposed the optimization and it was accepted and merged into the main branch.
Tuning Encryption
Our workloads have very different disk access patterns. During per-production benchmarking, we observed that the standard encryption configuration caused unacceptable latencies for some of our technologies. We started looking for optimizations which would help to address the concern.
Reading the cryptsetup’s man page, revealed several commands which should potentially improve the performance of encryption. After benchmarking each of them, we pinpointed flags (no_read_workqueue and no_write_workqueue) which dramatically improved the situation for the affected workloads. These flags allow to bypass extra encryption kernel queues and perform encryption in an almost synchronous fashion.

Figure 5: Read latencies drop after disabling extra crypt queues.
There were two more good things about disabling extra crypt queues: 
CPU time required for encryption/decryption operations are accounted for the workload initiated these operations. 
Since we use cpusets, the encryption/decryption operations have limited CPU noisy neighbor impact on other  workloads on the host.
These features aren’t foolproof—especially for reads. Even with no_read_workqueue, decryption may be offloaded because devices like NVMe complete reads in an interrupt context, where FPU use (needed for some algorithms) isn’t allowed. So, decryption is deferred to a separate thread.
These flags aren’t a cure-all. They help most with small I/O, but can hurt performance with large I/O. Benchmarking on target hardware is essential, as CPU architecture matters—what works well on x86 might not on ARM. In fact, on ARM, enabling crypt queues sometimes greatly improved performance.
Spiky Workloads
Some workloads grow steadily, making disk allocation predictable—but not all do. One of our heaviest write workloads saw disk usage spike due to overlap between long-running compaction and backup processes. Backups lock sstable files, preventing cleanup and temporarily doubling disk use.
In shared file systems, this was manageable. But in LVM, where shrinking isn’t practical, it meant allocating 2× capacity—unacceptable for large, high-throughput workloads. 
Conclusion
The journey of enabling logical volume management on Odin was a challenging one, with a broad set of considerations. The migration itself was a herculean task where we continuously had to balance a safe, reliable rollout, with the need to close the migration as quickly as possible. After all, any ongoing migration puts a strain on the system and adds to the complexity. We took the opportunity to modernize our approach to host layout management, including preparing for future Kubernetes adoption through wrapping our LVM solution into the CSI interface. In the end, we stand with encryption-at-rest, more dynamic host layouts, and a better integration between host agent and control plane. This shows how technical host-level details can be tied directly to business outcomes and how desired business outcomes influence low-level decisions. 
Cover Photo Attribution: Drawn by Johan Abildskov
Apache®, Apache Cassandra®, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Elasticsearch is a registered trademark of Elasticsearch BV.
Kubernetes® and its logo are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
MySQL is a registered trademark of Oracle® and/or its affiliates.
Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by Uber is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Uber.
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.
Ivan Shibitov
Ivan Shibitov is a Senior Software Engineer at Uber, based in Aarhus, Denmark. He is part of the team building Odin, Uber’s stateful container orchestration platform. Ivan focuses on the host-level control plane, managing compute and storage resources, and occasionally diving into Linux internals.
Johan Abildskov
Johan Abildskov is a Developer Advocate II at Uber. He works in Aarhus, Denmark as part of the team that builds Odin, Uber’s stateful container orchestration platform. Johan’s daily focus is on enabling product and platform engineers to be as productive as possible.
Posted by Ivan Shibitov, Johan Abildskov
Category:
Engineering
Backend