Uber has multiple, domain-specific products to manage and distribute configuration changes at runtime across our many systems. These configuration products cater to different use cases: some have a web UI that can be used by non-engineers to change product configuration for different cities, and others expose a Git-based interface that primarily caters to engineers.
While these domain-specific configuration products have different applications, they share common parts that can be consolidated for simplicity and to reduce the overhead of operations, maintenance, and compliance. This article will cover how we consolidated and streamlined our underlying configuration and rollout mechanisms, including some of the interesting challenges we solved along the way, and the efficiencies we achieved by doing so.
Configuration System Definition
Taking a step back, because our existing systems were so spread out and bespoke, we needed to first define a configuration system, so we could reduce the duplication of the common parts among them.
At a high level, configuration systems commonly consist of at least these 5 parts:
- A configuration store used to store configuration
- A generator that maps the configuration store into a consumable form
- A distribution pipeline that transfers generated configurations to downstream consumers
- A rollout mechanism that ensures that the distribution is executed safely, including policies to monitor specific configuration changes (i.e., what is their impact on a given availability zone), and triggers for rollbacks
- Configuration clients that consume generated configurations
Existing Configuration Distribution Pipeline Issues
We specifically focused on the configuration distribution pipelines and associated rollout mechanisms as described in steps 3 and 4. Below are some of the problems with the existing systems that we sought to address:
One of the old configuration distribution pipelines used pull-based instead of push-based distribution mechanisms. That means that each layer of the system would have to keep pulling the data periodically. This didn’t scale well as it put pressure on the networking stack and could potentially saturate the network at the host layer.
Having multiple distribution pipelines that are nearly doing the same thing is expensive in hardware. This is especially true running at Uber scale, since the configuration agents are running over 150,000+ hosts.
Each of the distribution pipelines will need to have dedicated resources on the host, which means we will be scheduling fewer services on the machine. Furthermore, the dedicated resources were doubled or tripled as each agent had its own dedicated resources on each host (not to mention the engineering resources dedicated to maintaining each distribution pipeline).
Propagation Latency Guarantees
The existing distribution pipelines did not provide clear guarantees for configuration propagation latency, which meant there could be no clear SLA. Nor was there a clear relationship between latency, configuration update size, and the number of consumers. Moreover, there are latency-sensitive stakeholders that will be impacted if propagation latency is higher than expected.
Incremental Rollout Support
Some of the distribution pipelines didn’t provide a way to distribute changes incrementally. The only offering was to distribute the changes globally across the fleet (global blast). Adding support for incremental rollout was challenging due to the design of these old pipelines.
This wasn’t aligned with Uber’s goal of having safe and reliable configuration and code rollouts.
|Configuration Property||The smallest unit the system works on, identified by a key containing a value–i.e., some bytes that could either contain a primitive value, but typically will contain a blob (e.g., JSON or YAML)|
|Configuration Namespace||A collection of related properties–configuration consumers will typically subscribe at the namespace granularity|
|Configuration Domain||Configuration for a specific configuration product will typically reside within the same configuration domain, which can be seen as a collection of related namespaces|
|Stakeholder||The producer or consumer of one of the configuration namespaces or domains|
Introducing The UCDP Pipeline
The Unified Configuration Distribution Platform (UCDP) is a platform at Uber to unify our many configuration distribution pipelines. It’s designed from the ground up to be efficient, reliable, stable, and generic, so domain-specific configuration products can be built on top of it.
The platform provides an API to distribute the configuration to all hosts within Uber with specified rollout pace. In addition, it provides a configuration client library that workloads (services, batch jobs, etc.) can use to read configuration and subscribe to configuration updates. It also has an agent running on all of Uber’s production servers, which makes configuration available to the workloads running on these servers.
To interact with the system, there are two types of clients a stakeholder can use: producer and consumer.
The producer client is an abstraction on top of the Deployer RPC APIs. It allows the stakeholders to either send a full snapshot update of a configuration namespace or send a patch on top of the existing snapshot where only some of the configuration properties in the namespace are updated.
The consumer client allows the stakeholders to subscribe to one or more configuration namespaces. The client gives the caller an in-memory cache of the configuration namespace as well as the ability to register callbacks on updates. The client abstracts all the details about creating and maintaining a lease on the configuration as well as how to read in configuration changes from a local disk cache.
UCDP consists of three layers: global, zonal, and host.
The global layer is the global control plane of the system. The service at the global layer (the Deployer) offers an API for producers to distribute configuration updates. The Deployer API endpoints are wrapped in an easier-to-use producer client library, which stakeholders use to distribute configuration changes.
In each availability zone we have a fanout layer (Distributor) that is responsible for fetching and distributing configuration changes to the hosts where it is needed. The distributor uses the goalstate store (Apache Zookeeper) to coordinate goalstate updates between the distributor instances in the zone to provide a consistent view for all the instances. When a goalstate is updated the distributor will fetch and unpack the associated configuration artifact(s).
Uber hosts can host different kinds of workloads based on the platform (stateful or stateless compute platforms). Those platforms schedule one or more workload items per host. These workloads can be moved around over time. The result is that only a subset of the total dynamic runtime configurations are required on each host.
To handle this we have added location-aware configuration distribution, so that the workload requiring configuration needs to subscribe to that config. We cache the data on disk to retain the data in case of a restart.
How the Layers Interact with Each Other
Let’s go through the lifecycle of a configuration entity.
The producer will use the UCDP producer client to build and deploy configuration updates. An example of such a producer is Flipr.
The Deployer will upload the configuration update to the artifact store and then update the global deploystate. Depending on the rollout policy, the update will be pushed to either a single zone or multiple zones at once.
- When the zone receives the update it will first update the zonal goalstate.
- All the Distributor instances are listening to the zonal goalstate updates, so when they receive an update signal they will download, unpack, and save a copy of the update configuration namespace in memory.
- The Distributors will then send a patch with the configuration update to the host-agents with active subscribers.
- When the host-agent receives the update, it will write the update to disk based on the expected action. The clients will then act on an inotify event in a filewatch and read the change from disk.
Interesting Problems to Solve
When building a system like UCDP a lot of interesting problems arise. Here are some examples of challenges we have encountered and how we dealt with them.
Keeping Track of Consumers’ Status
Considering that UCDP host-agent is running on 150,000+ hosts, without some optimizations, restarting the host-agent or Distributors would cause us to send snapshots of data to all host-agents, which is an expensive process due to the huge volume. This is a problem specifically for Flipr, as they require several hundreds of megabytes of configuration data to be present on all hosts.
As mentioned earlier, the host-agent keeps a copy of the data on disk, so when the host-agent container is upgraded, it scans the data on disk and sends the information back to the Distributor to which it is connected. The information includes the configuration namespace, version, and checksum to verify data integrity. If this doesn’t match what is on the Distributors, we will send a fresh snapshot of the configuration namespace. Otherwise, we register the information and send only future incremental patches.
The same will happen when the Distributor container is upgraded: the host-agent will be disconnected and will connect to another Distributor instance with the same data.
UCDP consolidated the distribution pipelines into one with putting Uber’s potential growth in mind. As UCDP is push-based instead of pull-based (described above), this was a big difference from the old distribution pipelines. It helped to significantly reduce the load on the network stack, as roughly 25% of the network calls across our data centers were from the different pipelines checking for configuration updates.
Furthermore, introducing a caching layer at the zone level helped us to increase the startup time of the service during restarts/upgrades. Previously, the old distribution pipelines had to re-fetch configuration data from the global layer on restart/upgrades which caused the instance to take up to multiple hours to warm up and go back to service.
In UCDP, the data is cached on disk. When the service starts up, it reads the data on disk, and compares what is cached to the source of truth and backfills any inconsistencies. This helped us to reduce the startup time from a couple of hours to a couple of minutes. It will still take time to bootstrap the service instance for the first time, as it will need to download and cache the data on disk, but it’s faster in the following restarts.
Propagation Latency Guarantees
Propagation latency guarantees are one of the big challenges we face and we try to improve upon. When talking about configuration distribution, not all use cases are the same. For example, a small configuration domain distributed to thousands of hosts is different from a large configuration domain distributed to a small set of hosts.
Distributing a patch update is different from distributing a full snapshot to stakeholders. Configuration size, type (patch or snapshot), and the number of subscribers are all factors in measuring the latency–though not all of them are equal.
We solved the first problem by defining an SLA based on configuration update size and type (snapshot or patch).
We have three configuration categories: small, medium, and large. Small updates have a faster end-to-end propagation latency, medium updates have higher propagation latency, while large updates have no propagation latency guarantees at all.
We are still working on improving this further, by incorporating the number of subscribers into our SLA guarantees.
UCDP is a business-critical system; without it, dynamic control of Uber’s business would not be possible.
The system has over 1.5 million clients subscribing for configuration changes across over 150,000 hosts. With more than 400,000 weekly configuration deploys, this results in over 350TiB of data being distributed on a weekly basis. Moreover, the system is storing 277 configuration domains and 71,000 configuration namespaces.
Without proper safeguards, a single stakeholder could potentially unintentionally put so much load on the system that it would degrade performance for other stakeholders or potentially bring down the entire system. Thus, we have implemented multiple measures to isolate stakeholders and prevent any one use case from affecting the system’s overall performance:
- Rate limiter to protect the system from the excessive load from a single stakeholder.
- Outbound data limiter–the amount of data produced by our stakeholders can impact the system’s overall performance. We have executed load tests on the system and the downstream dependencies to measure the performance to have an estimation of how much load our system can handle. Thus, we have set a limit to the amount of outbound data to the zonal level.
- Zookeeper child node limit–Zookeeper has a limit to the number of immediate child nodes a parent node can have. It’s capped at the size of the bytes returned when calling Zookeeper API (1MB) [ref]. Going above the limit can cause Zookeeper to be unstable as Zookeeper is designed to store data on the order of kilobytes in size. Thus, we have designed the structure of Zookeeper nodes to be at two levels. The first level is for the configuration domains and the second level is for the configuration namespaces. So, if a single domain reaches the limit, other domains will not be impacted.
Our goal next is to focus on having statistics of the different configuration domains and namespaces. We need to have more visibility on knowing which service is using which configuration, as the relationship between configuration domains/namespaces and services is many to many. Knowing in which zones and regions they are used, how many subscribers each configuration namespace has, and more, will allow us to improve the safety and experience of rollouts.
Furthermore, we want to bridge the gap between code and configuration rollout and make the experience seamless and safe. Code rollout at Uber has many safety features (e.g., waiting between zones/regions to verify service health, automatic rollback on alert, integration, and load testing as part of rollout, and more) that we plan to leverage for configuration rollouts using a shared Change Management component.
With UCDP as a generic configuration distribution system, we were able to reduce duplication between domain-specific configuration products. By replacing less efficient, pull-based distribution systems, we were able to gain significant efficiency wins, distributing the same configuration changes with lower propagation latency and using an order of magnitude less CPU compared to the previous system.
In this post we described at a high level what UCDP is and how it works. In a future post we will build on that and focus on security and how we ensure safe code and configuration rollout at Uber under our new, unified system.
Ahmed Magdy is a Senior Software Engineer on Uber’s Configuration Platform Team. He is currently focusing on improving the reliability and efficiency of the configuration deployment for Uber’s microservices.
Jacob Damgaard is a Staff Software Engineer at Uber where he is the Tech Lead on the Configuration Platform Team.
CheckEnv: Fast Detection of RPC Calls Between Environments Powered by Graphs
September 13 / Global