Skip to main content
May 4, 2026

The 5 Layers Every Cloud Commitment Depends On

Kamran Zargahi

VP of Engineering

Share this article

Introduction

Committing to a long-term hyperscale cloud commitment is a distributed systems problem with a ten-figure price tag. The act of committing to a cloud platform is the beginning of a massive engineering initiative, one that requires fluency in a set of technical domains that many organizations only discover after the ink is dry. 

A cloud commitment at this scale is like a construction project. The team that‌ delivers it is the one that can stand on an empty lot and see the building. As an engineering sequence: you study the site before you pour the foundation, you size the utility feeds before you design the electrical, you resolve the structural load before you spec the mechanical, and you design the plumbing after you know where everything above it sits. Each layer constrains the one above it. Skipping that sequence is how projects crumble later. 

Cloud infrastructure at this scale follows the same discipline across 5 physical layers. You start with regional and zonal topology where you build and why, what the fault boundaries are, what latency the geography imposes on your data-write paths. From there, power, what the facility draws, how ‌redundancy is architected, and whether the hyperscaler’s energy supply can sustain failures. Then comes the fit-for-purpose compute ecosystem. Not what the provider offers by default, but what our workload profile‌ requires in terms of regional load balancing, memory:core ratios, and silicon generation. That drives SKU and hardware selection, which leads to qualifying specific instance families. And finally, network topology and traffic routing related to how packets move between zones, regions, and services, and what the egress cost structure looks like. Getting the sequence ideally right is critical. In the following sections, I dive deeper into each step. 

1. Regional and Availability Zone Topology

The first decision in any large cloud deployment is which regions to anchor in. This is a systems engineering decision with consequences that are expensive to reverse at scale. A region defines the blast radius for most classes of failure, and an honest accounting of what cross-region latency imposes on write paths, critical to stateful services. Measured Round-Trip Time (RTT) between regions like us-east-1 to us-east-2 runs low single-digit milliseconds on the provider backbone. There’s an important distinction in how inter-availability zone and inter-region latency figures are reported: published RTT numbers frequently reflect NIC-to-NIC measurements at the physical layer, while application-observable latency is VM-to-VM and includes hypervisor jitter, noisy neighbor effects, and switch and SmartNIC driver packet processing overheads. Under contention in a shared multi-tenant environment, these can add several hundred microseconds above the bare-metal baseline for the long-tail.

Beyond geographical considerations and latency measurement, it’s important to consider that cloud regions aren’t identical in their service offerings. There are varying availability of managed services, SKU capacity depth, compliance certifications, and availability zone (AZ) count from region to region. This may pose a challenge to quorum-based data architecture. For instance, you may find a managed database available in region A, but the same offering may not be available in region B, because the provider hasn’t seen enough demand for a given managed service in region B. These services’ availability is often governed via service-matrix-policy, an artifact critical to study before any commitment is made to a specific region.

Within a region, AZs are the fault isolation domains, interconnected by dedicated fiber typically placed 10 kilometers or more apart. In our experience, measured inter-AZ RTT ranges from approximately 0.39 milliseconds to 2.42 milliseconds. That range is sufficient for synchronous replication, but these variances matter for stateful quorum-based design where write latency scales directly with synchronization RTT. The zone label problem compounds this further. One major hyperscaler maps physical AZs randomly to zone names per account, meaning us-east-1a in one account may map to a completely different physical facility than us-east-1a in another. And it goes deeper than naming. For instance a given AZ may consist of a single building with several data halls, or many data halls spread across multiple buildings in the same region. The type of data halls also varies significantly, from traditional raised-floor designs to modern liquid-cooled facilities purpose-built for AI workloads. With that level of physical complexity, if you care about computational symmetry, and at Uber we do, landing the right capacity across that geographical location starts to feel like a game of Tetris®. None of this necessarily disqualifies a region or zone, but it demands thorough understanding and a clear mitigation plan before a single workload moves to cloud.

2. Power First

Power is the only truly inelastic constraint in data center infrastructure. You can’t burst past it, and you can’t borrow against it. And when you exceed limits, there’s often no way to gracefully degrade. It’s typically a hard down! That’s why any serious evaluation of a cloud partner begins with a rigorous audit of their power architecture and operational approach: how the facility is fed, what the redundancy topology looks like (2N, N+1, distributed, or block redundancy), whether the power chain is backed by on-site generation, and critically, whether the hyperscaler has removed grid dependency entirely through dedicated supply. Our physical infrastructure and data center team regularly audits the full electrical and mechanical stack, UPS topology, PDU ratings, cooling plant capacity, and generator fuel runtime. 

We also use power as a migration validation instrument. For instance, if a 5 MW on-premise footprint migrates to the cloud and the equivalent workload draws more than 5 MW in the target environment, that delta is a critical signal. It means either the migration sizing calculations are wrong, or the hyperscaler’s infrastructure is operating outside its designed efficiency target. A well-executed migration from a typical Uber data center (PUE ~1.4) to a hyperscale facility (PUE ~1.2) should reduce total power consumption by roughly 15% for the equivalent compute. So based on our experience a 5 MW on-premises workload should land at approximately 4.25 MW in the cloud, if the hyperscaler’s environment truly realizes the advantages of scale. When it doesn’t, the power delta is the fastest diagnostic path, and it has caught misconfigured deployments, over-provisioned capacity, outsized stranded resources that silently burn power. Without this test, problems will otherwise surface months later as unexplained cost variances.

3. Fit-for-Purpose Compute Ecosystem

The default cloud architecture and the consumption model most providers try to push for, assumes a customer operating on provider-defined primitives: general-purpose virtualized SKUs at fixed memory-to-core ratios of usually 4:1, managed database services with opaque storage engines, CI/CD pipelines integrated at the provider’s abstraction layer, and horizontal autoscaling driven by typical CPU and memory signals. These factors can all work, but they’re necessary application-aware metrics. A complex and sophisticated application has its own set of signals.

A fit-for-purpose cloud ecosystem starts from an entirely different point. Rather than mapping existing workloads onto available SKU families, the workload movement works in the opposite direction. The compute, memory, I/O profile, latency SLAs define the hardware requirement, and the hardware requirement drives the final design on cloud. This means the machine design is a variable: it could be virtualized or bare metal, or it may have a custom memory-to-core ratio (7:1 or 8:1) for memory-bound services, selecting silicon generations based on measured performance indicators for X86 or ARM architecture, with different throughput and behavior for latency-critical paths. For AI and ML workloads, it extends to GPU memory bandwidth, and NVLink fabric in the cloud, and the ratio of GPU-to-CPU on inference clusters.

Most importantly, portability is a first-class operational capability. We can redirect traffic from one region to another, from one zone to another, from on-premises to cloud or cloud to on-premises, in full or in increments, within minutes. We can drain a zone completely or bleed it down gradually, any combination, in any direction, in real time. That level of control is only possible because our abstraction layers sit above the provider’s native layer. This is a differentiating capability that provides operational autonomy and the purpose that the architecture and migration strategy need to fit into. 

4. SKU Selection and Silicon-Level Awareness

The instance type and SKU selection layer is where a surprisingly large amount of value is either captured or lost. At the surface, it appears to be a matter of choosing the right vCPU count, memory, and storage configuration. In practice, it requires understanding what’s running underneath it and how its characteristics interact with workload behavior. As the easier efficiency gains disappear, more of the remaining value has to be engineered across that boundary through tighter hardware and software co-design. SKU selection is increasingly a systems problem spanning silicon characteristics, virtualization design, operating system behavior, runtime behavior, storage topology, and workload shape.

Modern cloud instance families are built on different silicon generations with materially different performance characteristics, and those differences don’t affect all workloads equally. The transition from x86 architecture (Intel® or AMD®) to provider-designed ARM silicon like AWS Graviton®, Ampere®, or Google Axion® represents real differences in instruction throughput, memory bandwidth, and price-to-performance ratio that affect workloads differently. An added layer of complexity kicks in when your bare-metal or virtualization design prescribes a specific system configuration, particular core-to-memory ratios, or page size settings that don’t map cleanly onto the provider’s physical machine design. In those cases, resources get stranded on the physical host, and how the cost of that stranded capacity is allocated becomes an important design and planning question. This is where hardware and software co-design becomes important. Rather than treating cloud SKUs as fixed products to be consumed as-is, Uber increasingly works with cloud providers and silicon partners to translate production workload characteristics into better-fit system configurations and instance specifications. The objective isn’t only to benchmark what exists, but to influence what becomes production-ready for Uber.

An application can behave differently across ARM and x86 silicon, even when the application code itself is unchanged. In managed runtimes such as Java, differences in memory bandwidth, cache hierarchy, page-size configuration, garbage collection behavior, and synchronization costs can all influence observed performance. Some compute-heavy numerical and throughput-oriented workloads may see a significant 20% to 50% better price-performance on a current-generation ARM instance than on the equivalent x86 instance. A latency-sensitive workload with tight p99 requirements may tell a different story. A concrete example of this workload-aware approach was our collaboration with OCI™ and Ampere A4 instance family, where Uber’s workload insights helped shape instance configurations around real production demand. Together, we evaluated tradeoffs across CPU frequency, deployment model, core counts, memory-to-core ratios, and local versus remote storage. For latency-sensitive stateless services, the priority was high per-core performance and simpler VM-based deployment. For throughput-oriented storage systems, the emphasis shifted toward parallelism, memory efficiency, and storage locality.

SKU qualification and testing specific workloads against specific instance families before committing to them at scale is foundational engineering work, even when the tests are synthetic. It involves representative load testing, application-level profiling, and measuring silicon and instance behavior under sustained load pressure, failure, or at burst conditions. It also requires establishing full-stack readiness: whether a platform is mature across the OS, kernel, runtimes, storage path, orchestration layer, and application stack, and whether it performs well enough in practice to justify adoption. The same principle now extends beyond CPUs into GPUs and accelerators. Micro-efficiency work such as GPU partitioning with MIG (multi-instance GPU), along with early evaluation of platforms like AMD Instinct™ MI300X, AWS Inferentia™, and AWS Trainium™ all reinforce the same point: silicon optionality only creates value when it is backed by readiness.

This work matters more now for 4 reasons. First, the silicon landscape is expanding quickly, and Uber needs a focused function to evaluate, qualify, and accelerate adoption of the right platforms. Second, the easy efficiency gains are largely behind us, so better performance now has to be engineered end-to-end through tighter hardware and software co-design. Third, broader silicon optionality improves resilience and reduces dependence on any single vendor path. And fourth, silicon choices influence not only cost and performance, but also reliability, because their effects show up across the full stack and ultimately in production behavior. At a deployment scale where SKU choice governs the per-hour cost of thousands of instances, the ROI on this work is material.

5. Network Topology and Traffic Routing

Network architecture in large cloud deployments is where the gap between expected behavior and real-time reality can be the widest. The default scenarios most often work. The edge cases don’t particularly matter when you care about how your own physical network should interact with hyperscalers in a symmetrical manner across regions, as well as how your services are served in a symmetrical manner within a cloud region.

What tends to get discovered late: ephemeral port ranges for stateful connections, behave differently than anticipated when combined with connection tracking. Under high connection churn which is common in microservices architectures exhaustion and port availability become issues before other resources such as CPU or memory become constraints. The physical characteristics of regions and zones as well as the size of zones on a per-core basis for a multi-tenant environment play a role in network latency as do the capacity of 100GB or 400GB ports for east-west traffic routing.

Egress economics and data transfer costs are the most consistently underestimated line items in large cloud deployments, and will break the economics if not paid attention to. The reason is that the cost model reflects cloud providers’ data gravity principle, which penalizes heavily when data leaves/egresses its boundary, generally a large volume of east-west traffic that needs to go across various boundaries. Traffic patterns are rarely measured precisely enough in pre-migration modeling to predict them accurately if lacking the visibility of traffic consumers.

Intra-zone traffic between resources within the same availability zone is generally not charged for data transfer. This creates a meaningful architectural incentive: services that communicate frequently should be co-located within a zone, not just within a region. This is called zonal isolation, a powerful concept not just from egress economics perspective, but also from the perspective of reliability and isolating fault domains. 

Inter-zone traffic between availability zones within the same region carries a per-GB charge on most major providers. At the traffic volumes generated by large-scale services, this number is not negligible. 

Inter-region traffic, between regions, is priced at a significantly higher tier. An acceptable network topology should attempt to avoid ‌inter-region traffic. Cloud egress is typically metered and priced per GB, meaning costs fluctuate directly with traffic volume. One mitigating factor could be to use ports and ensure the cost of each port remains fixed, and not metered, and move cross-region traffic over your own physical network infrastructure. The architectural implication is that synchronous cross-region data patterns at high volumes require explicit justification. 

Conclusion

A cloud commitment at this scale is a distributed systems problem that stays live for the duration of the commitment. That means reconciling ‌architectural changes and drift before they become cost variances that nobody can explain. The physical layer decisions covered in this post; the region and zone topology, power architecture, fit-for-purpose compute, SKU qualification, and network design, are constraints that compound if left unattended. Get them right, and everything built on top of them is on solid ground. Get them wrong, and you spend the lifetime of the commitment managing consequences instead of extracting value.

The physical layer is only half ‌the story. In the next blog, I shift to the software layer: how managing your own control planes, engineering service-level scaling, and building abstractions between software and hardware allow you to extract the full value of the underlying infrastructure. And underpinning all of it, the physical and the software layers alike, is a contracting layer that most organizations discover too late: the legal commitments that either protect the system you've engineered or quietly work against it. That's a thread I'll pick up as well.

Acknowledgments

Thank you to Jean He, Paul Thies, Tor Kyaagba, and Nav Kankani for being key guides for Uber on our cloud journey.

The cover photo for this article was created by AI with ChatGPT.

AMD and AMD Instinct are trademarks of Advanced Micro Devices, Inc. 

Ampere is a registered trademark of Ampere Computing, LLC. 

AWS Graviton and AWS Inferentia are trademarks of Amazon Web Services or its affiliates. 

Google Axion Processor is a registered trademark of Google, LLC. 

Intel is a registered trademark of Intel Corporation. 

Java is a registered trademark of Oracle® and/or its affiliates.

Tetris is a registered trademark of Tetris Holding, LLC. 

TRAINIUM is a trademark of Amazon Technologies, Inc.

Written by

Kamran Zargahi

VP of Engineering

Kamran is the VP of Engineering at Uber responsible for cloud engineering.

Related articles
928 articles
Filter by:
All categories
Promotions
March 18, 2026
Promotions
Ride
March 9, 2026
Promotions
Ride
March 9, 2026
Promotions
Ride
March 9, 2026
Promotions
Ride
March 9, 2026
Promotions
March 6, 2026
Data / ML
Engineering
Security
February 26, 2026
Prev
3
Next