Zone-Failure-Resilient OpenSearch® at Uber
Senior Software Engineer
Introduction
ZFR (Zone Failure Resilience) is a non-negotiable requirement for distributed systems like OpenSearch® that power use cases across all tiers at Uber. To be considered zone-failure resilient, an OpenSearch deployment must be able to withstand the complete loss of a zone without impacting core functions such as querying and data ingestion.
This blog describes how we achieved ZFR by combining OpenSearch’s native shard-allocation awareness features with Uber’s in-house isolation-group infrastructure, built on top of Odin, Uber’s container orchestration platform.
Background
Uber requires OpenSearch clusters to survive a full-zone failure without losing query or ingestion capability. To achieve this, Uber combines OpenSearch’s shard allocation awareness features with Uber’s isolation group infrastructure, which ensures nodes are evenly and predictably distributed across failure domains. Using forced shard allocation awareness, OpenSearch avoids aggressive shard rebalancing when a zone goes down, preserving stability in surviving zones.
With at least 3 shard copies and 5 cluster manager nodes, the system can tolerate a full zone outage plus an additional node failure while still retaining quorum and preventing data loss.
Together, these mechanisms provide a robust, predictable, and resilient OpenSearch deployment model across Uber’s infrastructure.
Understanding Isolation Groups
An isolation group is a logical partitioning mechanism used to distribute the nodes of an OpenSearch cluster across different failure domains (zones or racks) in a more controlled and even way. Key properties of isolation groups are:
- Failure domain awareness. Isolation groups are designed to map cleanly onto physical failure domains such as zones or racks. In an ideal configuration, multiple isolation groups don’t share the same zone without incurring a placement penalty in the scheduler.
- Role-level balancing. Nodes are balanced across isolation groups per role (like data nodes, cluster manager nodes). This ensures that critical roles aren’t accidentally concentrated in a single failure domain.
- Stable group membership. A node and its replacement node are guaranteed to belong to the same isolation group, even if the underlying physical host changes. This stability is important for consistent shard placement semantics.
At Uber, most technologies, including OpenSearch, use 3 isolation groups. With this approach, a single zone failure can remove at most ~33% of capacity, as a zone is mapped to only one isolation group.
The number of physical zones is typically larger than 3. Balancing nodes evenly across all physical zones directly would be complex, as we’d need even capacity across all zones, which is why isolation groups are introduced as an abstraction layer.

Figure 1: Shard allocation awareness can help with even distribution of shard copies across the isolation groups for baseline resilience against single-zone failure.
Shard Allocation Awareness
Figure 1: Shard allocation awareness can help with even distribution of shard copies across the isolation groups for baseline resilience against single-zone failure.
OpenSearch supports distributing copies of a shard evenly across logical groupings of nodes using shard allocation awareness. This is achieved by:
- Defining custom node attributes in opensearch.yml.
- Enabling shard allocation awareness based on those attributes at the cluster level.
Figure 2: The technology worker sidecar container helps fetch node’s isolation group attribute and sets it in the opensearch.yml
With this configuration, if there are 5 total copies of a shard (1 primary and 4 replicas), OpenSearch will place them across the 3 isolation groups in a pattern like 2, 2, and 1. The difference in shard copy counts between any two isolation groups will never be greater than one.
However, shard allocation awareness assumes a reasonably balanced pool of nodes behind each attribute value. If nodes aren’t balanced across isolation groups and the cluster can’t maintain even shard distribution due to resource constraints, some shard copies will remain unassigned. This is why even node distribution across isolation groups is a critical prerequisite.
At Uber, all OpenSearch indices are configured with a minimum of 2 replicas (3 total shard copies). Shard allocation awareness ensures that these 3 copies are distributed across 3 isolation groups, providing the baseline resilience needed to survive single-zone or single-group failures without data loss.
Forced Shard Allocation Awareness
Shard allocation awareness works well under normal conditions, but zone failure introduces additional challenges. Consider the case where all nodes belonging to a single isolation group go down.
OpenSearch detects missing shard copies and, by default, will try to redistribute them evenly across the remaining isolation groups. This can trigger large-scale data movement across surviving nodes, consuming substantial disk I/O, CPU, and network bandwidth. The resulting load can cause instability, increased query latency, or even cascading failures in the remaining zones.
To prevent this behavior, OpenSearch provides forced shard allocation awareness. The cluster is explicitly configured with the full set of expected attribute values (for example, all 3 isolation groups), not just the ones currently present. If one isolation group disappears (due to zone failure), OpenSearch knows that there are missing attribute values and refuses to over-allocate shards onto the remaining groups. Shards that belong to the missing group remain unassigned, and the cluster typically transitions to a yellow state instead of aggressively rebalancing.
This approach has two key benefits:
- Protects surviving zones. The healthy zones avoid sudden, heavy rebalancing loads and remain stable and performant for existing traffic.
- Controlled recovery semantics. Unassigned shards are only allocated when:
- The failed zone/isolation group recovers, and nodes with the same attribute value return, or
- An administrator explicitly removes or updates the forced awareness configuration (for example, after adding new capacity in surviving zones and deciding to redistribute shards).
Forced shard allocation awareness lets us trade immediate full replication for cluster stability during a zone failure event.
Zone and Node Failure Resiliency
The most critical resilience scenario is a zone failure immediately followed by an additional node failure in a different zone.
Data Node Resiliency
For data nodes, our design is straightforward. All OpenSearch/Elasticsearch® Tier 3 and below clusters at Uber maintain at least 3 copies of each shard. These copies are distributed across 3 isolation groups using shard allocation awareness.
In a zone and node failure scenario, a zone failure removes one isolation group and one shard copy. A subsequent node failure in a different zone may remove a second shard copy. The third shard copy, hosted in the remaining isolation group, is still available.
Cluster Manager Node Resiliency
When 2 out of 3 cluster manager nodes fail, the cluster loses quorum and brings the entire cluster to a standstill (a red state).
To address this, OpenSearch supports dynamically shrinking the voting configuration to the set of available cluster manager nodes via cluster.auto_shrink_voting_configuration: true.
We use this feature by running 5 cluster manager nodes instead of 3. A failure sequence with 5 cluster manager nodes will be as follows:
- Start with 5 cluster manager nodes, distributed across 3 isolation groups (and therefore across multiple zones).
- A zone goes down, up to 2 cluster manager nodes may be lost, leaving 3 surviving.
- cluster.auto_shrink_voting_configuration kicks in and shrinks the voting configuration to these 3 available nodes.
- A new primary leader is elected with a quorum of 2 out of 3.
- If 1 more cluster manager fails afterward (the zone and node scenario), there are still 2 remaining nodes, which is enough quorum (2/3) to elect and maintain a primary.
Impact
Previously, using physical zone IDs led to yellow states in clusters because physical zones often have uneven node counts, and OpenSearch’s awareness logic would fail to find valid placements for every shard.
IGs (Isolation Groups) solved this by providing a balanced abstraction. Every IG is guaranteed to have the same number of nodes, ensuring 100% shard assignment and green cluster health. We also eliminated disk skew and hot nodes caused by physical zone asymmetry.
Conclusion
By combining isolation groups for physical placement and shard allocation awareness for logical placement, we’ve decoupled OpenSearch’s resilience from the complexities of physical data center topology. This architecture ensures that even in the face of a zone + 1 failure, Uber’s search and analytics infrastructure remains stable, performant, and, most importantly, available.
Acknowledgments
We’d like to thank the Uber Odin Platform, which provided many out-of-the-box features that supported this project.
Cover Photo Attribution: “Life Jacket Ring” by nymphofox is licensed under CC BY 2.0.
OpenSearch is a registered trademark of the LF Projects, LLC (under Linux Foundation) in the United States and/or other countries. No endorsement by LF Projects, LLC is implied by the use of these marks.
Elasticsearch is a registered trademark of the Elastisearch B.V. in the United States and/or other countries. No endorsement by Elastisearch B.V. is implied by the use of these marks.
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.
Smit Patel
Senior Software Engineer
Smit Patel is a Senior Software Engineer on the Search Platform team at Uber. His work is focused on developing features that improve resiliency, performance, and experience for OpenSearch use cases.