Avançar para o conteúdo principal

Stay up to date with the latest from Uber Engineering

Follow us on LinkedInFollow us on LinkedIn

Stay up to date with the latest from Uber Engineering

Follow us on LinkedInFollow us on LinkedIn
Backend

Crane: Uber’s Next-Gen Infrastructure Stack

15 de setembro de 2022 / Global
Featured image for Crane: Uber’s Next-Gen Infrastructure Stack
  1. The size of our server fleet was growing rapidly, and our tooling and teams weren’t able to keep up. Many operations for managing servers were still manual. The automated tooling we did have was constantly breaking down. Both the manual operations and automated tooling were frequent outage culprits. In addition, operational load was taking a severe toll on teams, which meant less time for them to work on fundamental software fixes, leading to a vicious cycle.
  2. Fleet size growth came with the need to expand into more data centers/availability zones. What little tooling existed for turning up new zones was ad hoc, with the vast majority of the work being manual and diffused across many different infrastructure teams. Turning up a new zone took multiple months across dozens of teams and hundreds of engineers. In addition, circular dependencies between infrastructure components often led to awkward bootstrapping problems that were difficult to solve.
  3. Our server fleet consisted mostly of on-prem machines, with limited ability to take advantage of additional capacity that was available in the cloud. We had a single, fledgling cloud zone but manual operations implied that we were not really taking full advantage of the cloud.
  • Our stack should work equally well across cloud and on-prem environments.
  • Our cloud infrastructure should work regardless of cloud provider, allowing us to leverage cloud server capacity across multiple providers.
  • Migration of capacity between cloud providers should be trivial and managed centrally.
  • Our tooling should evolve to support a fleet of 100,000+ servers, including:
    • Homogeneity of hosts at the base OS layer, allowing us to centrally manage the operating system running on every host. All workloads should be containerized with no direct dependencies on the host operating system.
    • Servers should be treated like cattle, not pets. We should be able to handle server failures gracefully and completely automatically.
    • The detection and remediation of bad hardware should be completely automated. A single server failing should not lead to an outage or even an alert.
    • A high-quality host catalog should give us highly accurate information about every host in our fleet¹.
    • Any changes to the infrastructure stack should be aware of various failure domains (server/rack/zone), and be rolled out safely.
  • Zone turn-up should be automated. Our current stack required dozens of engineers more than 6 months to turn up a new zone. We wanted one engineer to be able to turn up a zone in roughly a week.
  • Circular dependencies among infrastructure components led to their own set of operational and turn-up challenges. These needed to be eliminated.
Image

Figure 1: Layers framework
Image

Figure 2: A sample of hosts from our host catalog
Image

Figure 3: High-level life cycle as a host. On-prem and cloud specific flows are abstracted away.
  • Failure domain diversity: We will attempt to give teams hardware that is located in a different failure domain (e.g., a server rack) than where their current hosts are. We also allow teams to specify minimum diversity requirements (i.e., “never put more than 5 of my hosts in the same rack”).
  • Disk configuration: All hosts are pre-provisioned, including the partitioning and file systems used for their disks. We ensure that teams get hosts with a disk configuration matching what they have specified in their config.
Image
Figure 4: Dominator/subd architecture
Image

Figure 5: The process for detecting a bad host
  1. General confusion: What constitutes bad hardware is somewhat subjective. Which SEL events are truly important? What SMART counters (and thresholds for those counters) are most appropriate for monitoring disk health? Does out-of-date firmware count as a hardware problem? In a multi cloud world, there are also host health events provided by each cloud vendor. How do we react to these in a holistic manner?
  2. Duplicate effort: Because every team was responsible for their own hardware, we often ended up solving the same problems over and over again.
Image
Figure 6: Undesirable Cluster Configuration
Image
  • The developer is saved from writing what is essentially the same config 4 times
  • Schema uniformity of config is enforced
  • Copy and paste errors are prevented
  • It’s clearly obvious what is different between each cluster and what is the same
Image
Figure 8: rollout.yaml
Image
Figure 9: agent_version.star
  • Expanding our cloud abstraction layer to support more cloud providers
  • Dramatically improving the types of signals we can use to detect bad hardware
  • Detecting bad network devices in addition to bad servers
  • Supporting new CPU architectures (e.g., arm64)
  • Executing centralized Kernel and OS distro upgrades
  • Taking advantage of elastic cloud capacity
Kurtis Nusbaum

Kurtis Nusbaum

Kurtis Nusbaum is a Senior Staff Engineer on the Uber Infrastructure team in Seattle. He works on highly-reliable distributed systems for managing Uber’s server fleet.

Tim Miller

Tim Miller

Tim Miller is a Software Engineer on the Uber Infrastructure team in Seattle, focused on the operating system, kernel, security and upgrades of Uber's production hosts.

Brandon Bercovich

Brandon Bercovich

Brandon Bercovich is a Software Engineer on the Uber Infrastructure team in San Francisco. He works on the capacity management systems for allocating servers across Uber’s server fleet.

Bharath Siravara

Bharath Siravara

Bharath Siravara is an Engineering Director on the Uber Infrastructure team in Seattle. He supports several teams including server fleet management and Uber’s Compute Platform.

Publicado por Kurtis Nusbaum, Tim Miller, Brandon Bercovich, Bharath Siravara

Categoria: