Stay up to date with the latest from Uber Engineering

How Uber, OCI™, and Ampere® Co-Optimized OCI AmpereOne® M A4 Compute

December 15 / Global

Share
Facebook
X social
Linkedin
Envelope
Introduction
At Uber, we innovate where software meets hardware, building a deep understanding of our workloads so we can shape infrastructure around how we actually consume compute,  not how we provision it. This starts at the silicon layer, where we now collaborate with chip providers to optimize silicon and tailor OCI (Oracle Cloud Infrastructure™), instance configurations for Uber’s real workloads.
Uber’s migration to Arm-based cloud infrastructure exemplifies deep cross-company collaboration that harnesses the strengths of OCI, Ampere Computing, and Uber Engineering. In February 2023, Uber began transitioning from on-premise data centers to the cloud using OCI and Google Cloud Platform™, taking on the dual challenge of shifting massive workloads and introducing Arm-powered compute instances into a previously x86-dominated environment, supporting thousands of microservices and storage databases.
OCI’s leadership in adopting Ampere Arm-based processors reflected a strategic decision to drive cloud efficiency, with Ampere offering higher performance per watt, reduced energy consumption, and greater compute density—factors that benefit hyperscale providers and their enterprise customers. For Uber, partnering with OCI and Ampere brought the dual advantages of energy efficiency and hardware diversity, aligning with Uber’s sustainability goals and supporting operational flexibility in a dynamic supply chain landscape.​
Over the past 24 months, Uber, OCI, and Ampere have worked closely on numerous technical challenges and key innovations to dramatically improve price-performance curves for a large footprint of Uber workloads, setting the stage for supporting multi-architecture environments for years to come. 
In this blog, we discuss key innovations born from this partnership: the OCI AmpereOne® M A4 instance family, which incorporates key optimizations and features informed by Uber Engineering’s experience with the previous generations of OCI Ampere A1 and A2 instances.
Background 
To fully appreciate the learnings and insights that went into designing the OCI Ampere A4 instances, it’s important to understand the technical challenges encountered while using the previous generation instances, as workloads were onboarded to Arm architecture for the very first time in the history of Uber. 
OCI Ampere A1 and A2 instances provided the foundation for the Uber platform to become Arm-ready. Uber implemented a strategic, multi-phase approach to enable Arm-based Ampere systems in its predominantly x86 environment, focusing on deep infrastructure, build, and deployment changes to support multi-architecture workloads.  
It was during early phases of onboarding that many of the learnings were discovered. Throughout this journey, Uber, Oracle Cloud Infrastructure, and Ampere engineering teams worked closely to innovate and successfully resolve issues to enable them the onboarding of additional workloads. As we got closer to testing production workload services, we learned more about how Uber’s software code interacted with the Arm instruction set architecture—insights that provided a crucial customer perspective for Ampere’s next-generation silicon development and OCI’s instance configurations.
Key Learnings
Right out of the gate, when we started with synthetic benchmarking using specJBB2015®, we noticed dual socket systems showed reduced performance for latency-sensitive operation (critical java operations per second). Go benchmarks also show sensitivity in operations execution speeds.  Collaborative efforts to debug these issues allowed us to identify root cause with the cross-socket link bandwidth in a multi-socket system. With the option to use high core counts per socket offered by Ampere®, there was no longer the need for multi-socket servers that introduced communication latency between processors
Uber’s initial cohort of Go services showed out-of-memory (OOM) errors for the GOMEMLIMIT setting. We isolated the degradation in Go performance as mainly an artifact of smaller Translation Lookaside Buffers (TLB) and cache sizes in the silicon that led to more page faults and slower garbage collection. 
Uber’s compute platform also runs a set of very latency-sensitive services with single-threaded performance requirements that can’t be addressed through horizontally scaling alone. For example, services running the Gurobi® mixed integer program solver that rely on the highest clock speeds. Uber services using Gurobi saw slow performance on the OCI A1 systems compared to the existing on-prem x86 servers due to the fixed 3.0GHz single-core frequency. The Ampere Altra® CPU doesn’t support turbo clocking of cores, which meant that for workloads not running on all cores, there was stranding of unused resources.
Applying Learnings to Design OCI A4 Instances from Silicon to System
OCI Ampere A1 and A2 instances provided the foundation to onboard Arm-based instances in Uber’s fleet. In addition, the learnings from benchmarks and running workloads at scale in production highlighted an opportunity to improve the price-performance of the OCI Ampere instance family. These insights catalyzed deeper collaboration to optimize the instances from silicon to system.
Our learnings led to Uber Engineering sharing detailed workload characteristics and performance targets with Ampere architects, which were instrumental in informing the design and optimization of the AmpereOne® M silicon. Oracle and Ampere engineering also enabled a FlexSKU option based on Uber’s feedback for CPUs that had the highest single-thread performance.
FeatureAmpere® Altra®AmpereOne® M
ProcessorAltra® AmpereOne® M
Cores32-12896 (flexSKU for OCI) up to 192 (chip)
Clock SpeedUp to 3.0 GHzUp to 3.6 GHz (96-core model)
L2 Cache/Core1 MB2 MB
System Cache16/32 MB64 MB
Memory8x DDR4-3200, up to 4 TB12x DDR5-5600, up to 1.5 TB
PCIe (Peripheral Component Interconnect Express)Up to 128 Gen4 lanes96 Gen5 lanes
Table 1: A summary of the specs for AmpereOne M, highlighting how Ampere incorporated key optimizations and features informed by Uber’s feedback.
Our collaboration focused on optimizing the OCI Ampere A4 instance family. Uber’s services at scale provided critical insights, enabling OCI and Ampere to translate observed workload attributes into optimal system configurations and instance specifications. We explored combinations of high-frequency Flex SKUs (3.6 GHz versus 3.2 GHz), bare-metal and virtual-machine deployments, 92-192 cores, multiple core-to-memory ratios using 12-channel DDR5, and local NVMe (Non-Volatile Memory Express) versus remote block storage. This close three-way collaboration involved co-optimization for the A4 instance family.
We anchored on two broad workload classes. For latency-sensitive stateless services, we prioritized per-core performance and predictable garbage collection: higher CPU frequency (about 8 GB per vCore) and VM deployment to avoid NUMA/I/O tuning and keep migrations from the previous generation simple. For throughput-oriented storage systems, we favored parallelism and memory efficiency at efficient clocks, with flexible memory-to-core ratios that align with observed working-set sizes to improve packing efficiency without over-provisioning, and storage sized to workload locality—local NVMe when it matters, remote block when it doesn’t—to meet demand without stranding capacity.
Conclusion 
The journey from the first Arm-based deployments on OCI Ampere A1 and A2 to the collaborative optimization of OCI Ampere A4 represents more than an architectural upgrade. It’s a transformation in how Uber approaches infrastructure innovation. By engaging deeply from silicon to system, Uber Engineering has evolved from adapting to existing hardware influencing the next generation of compute optimized for our real workloads.
This collaboration between Uber, Oracle Cloud Infrastructure, and Ampere showcases what’s possible when software, systems, and silicon capabilities are precisely aligned to meet demanding workload needs. OCI Ampere A4 instances aren’t just faster or more efficient,  they embody a deeper integration between platform needs and hardware design, setting the foundation for Uber’s future of performance-driven, energy-efficient, and cost-optimized computing at scale.
To learn more about the new OCI Ampere A4 Compute, visit the Oracle Cloud Infrastructure Launch Blog, and the Ampere Launch Blog.
Acknowledgments 
We’d like to thank the contributions of engineers across teams in Uber, Oracle and Ampere.  From Uber, Maz Zabaneh, Dan Song, Kshtiij Doshi, Shashi Aluru, Ben Wang, Amulya Nanjajjar
Cover Photo Attribution: Ampere Computing LLC. 
Ampere Computing, Ampere Altra®, and AmpereOne® M are registered  trademarks of Ampere Computing LLC.
Google Cloud Platform™  is a trademark of Google LLC and this blog post is not endorsed by or affiliated with Google in any way.
Gurobi Optimization, Gurobi, and the Gurobi logo and design are registered trademarks or trademarks of Gurobi Optimization, LLC.
Oracle, Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates.
SPEC® and the benchmark name SPECgeneric® are registered trademarks of the Standard Performance Evaluation Corporation.
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.

Vikrant Soman

Vikrant Soman is a Technical Lead at Uber, where he’s responsible for the hardware technology strategy for the company’s multi-cloud fleet. Additionally, he spearheads projects focused on enhancing fleet reliability and efficiency.

Dan Royal

Dan Royal is Senior Engineering Manager at Uber and leads the Engineering Efficiency team. His areas of focus have been resource utilization, price-performance, cost attribution, demand planning, and infrastructure strategy.

Nav Kankani

Nav Kankani is a Senior Engineering Manager and Platform Architect on the Uber Infrastructure team. He has been working in the areas of AI/ML, hyperscale cloud platforms, storage systems, and the semiconductor industry for the past 23 years. He is also named inventor on over 21 US patents.

Posted by Vikrant Soman, Dan Royal, Nav Kankani, Kamran Zargahi

Category:

Engineering

Backend

Select your preferred language

English

Sign up to drive

Sign up to ride

Select your preferred language

English

How Uber, OCI™, and Ampere® Co-Optimized OCI AmpereOne® M A4 Compute

Introduction

Background

Key Learnings

Applying Learnings to Design OCI A4 Instances from Silicon to System

Conclusion

Acknowledgments

Select your preferred language

Sign up to drive

Sign up to ride

Products

Company

Select your preferred language

Sign up to drive

Sign up to ride

Feature	Ampere^® Altra^®	AmpereOne^® M
Processor	Altra^®	AmpereOne^® M
Cores	32-128	96 (flexSKU for OCI) up to 192 (chip)
Clock Speed	Up to 3.0 GHz	Up to 3.6 GHz (96-core model)
L2 Cache/Core	1 MB	2 MB
System Cache	16/32 MB	64 MB
Memory	8x DDR4-3200, up to 4 TB	12x DDR5-5600, up to 1.5 TB
PCIe (Peripheral Component Interconnect Express)	Up to 128 Gen4 lanes	96 Gen5 lanes