Uber’s many software systems require a high volume of changes every day. Because of our systems’ size and complexity, it is a significant challenge to implement these changes without unintended consequences, ultimately slowing down developer productivity. Flipr is a big part of Uber’s solution to solving this problem. Flipr is a tool that we created for dynamic configuration management, such as feature flags, allowlists, incremental rollout, and other advanced use cases. In this post, we will describe the architecture and features of Flipr, and we’ll show how we can use it to make a large volume of changes quickly and easily, without sacrificing reliability.
Configuration Systems and Runtime Context
A configuration system controls a set of values that affect how the application behaves, which can be changed without changing the code. Typically this key value map is stored in a file, database, or service, so it can be updated independently from the code, and there will be a client library that can perform the lookup. The configuration system’s main benefit is that it enables you to make changes to the application behavior without having to recompile, redeploy, or redistribute code.
In Flipr we refer to these key and value pairs as properties. This example client library lookup function uses JSON to map a string key to a boolean value, but other data types can also be used:
The get function retrieves the config vaue specified by the key.
A simple configuration system like this could be used to store configuration for use cases like:
- “Feature #456 is enabled”
- “The character size limit of the ‘name’ textbox is 256 characters”
- “The URL of the map service API is http://mapservice:8080/v2”
The wide range of applications for this includes feature flagging, networking configuration, circuit breaking, and emergency operational control.
This kind of config system is based on a one-to-one key value map, and so it can be harder to represent more complex configuration use cases without changing code, or having a very large key space. For example, we might not easily be able to store:
- “Enable this feature #456 for all members of the ‘admin’ group who are currently making a request from New York”
- “Lockdown the property API for requests, with a ‘temporarily unavailable’ error message, between 4pm-5pm on Wednesday”
These use cases require information which is not known until runtime, such as group membership of the user, or time of day. One way to maintain the ability to change this kind of configuration without updating code, is to enhance the configuration system to be able to return different values based on runtime information. This is where Flipr differs from some other configuration systems: Flipr properties can have multiple values per key (i.e., a multimap instead of a map), depending on the runtime context.
The signature of the get function changes to include a runtime context parameter:
With the addition of this new parameter, properties can now include exceptions that decide if the value should be overridden, depending on the context. We call these overrides exceptions, and the rules within the exceptions are known as constraints. The client library retrieves the latest version of the property, but now the value can be different, depending on the runtime context:
Rolling Out a Feature with Flipr
Let’s walk through an example to illustrate how Flipr might be used in practice. Feature flags are a well known technique that developers use to safely and quickly rollout code changes. First we deploy the service and app code, like any standard change, but with the new code and features gated by a feature flag. This flag can be as simple as an if statement based on a boolean value that is retrieved from the config system, as shown below:
When the code is deployed, there is no change to the application behavior, because all the new behavior is gated by the feature flag, and initially the property is set to False. Here’s how the property would look like in JSON:
Next we can update the value to enable the feature for 2 specific users in the city with the ID 1. The following example shows the same property as before, but now specifying that the value should be True if the context includes a key cityId that has the value 1:
Whenever the property is updated, the changes are distributed to all hosts that need them. The next time the client.get function is called, it will use the latest updated rule, and the service will start using the new code for particular requests.
If something goes wrong, we can revert or turn off changes by updating the value back to False, and because the config is changed via a network call, and doesn’t involve code changes, We don’t have to wait for long code deploys or shipping of binaries to the app store—it just takes seconds.
Once the feature is enabled we can monitor our metrics, run integration tests and look for problems. If everything looks good, the next step is to enable the feature for a larger group of users, like all registered beta testers in the city. Then, if that change passes verification, we could keep expanding the audience to multiple cities, and eventually all users. At any step we are still able to rollback and revert to the old behavior if any issues are detected. This approach of incrementally expanding the use of the feature allows us to roll out and test features quickly and safely.
Feature flagging is just one example of how Flipr is used. The same rules-based config can be used for allowlist/denylists, experiments, geographical or time based targeting and many other configuration use cases.
In this section we will briefly describe each of Flipr’s main components. In general, Flipr is just a service, with a UI and a client library for users. Our infrastructure’s large scale also necessitates a set of components to help distribute the data, which helps with reliability.
Customers predominantly interact via the UI. Here’s a screenshot of the property exception builder UI, where a user can create and update properties with exceptions:
There are also extensive UI features for managing and updating properties, rollout, rollback, peer reviews, permission management, history, and many other features.
Flipr supports an extensive API so that other services can use Flipr programmatically. This API is exposed using the standard Uber software networking stack. The API directly supports the UI and the Gateway services, as well as some other special use cases that are not available, such as emergency control and operational features used by our Production Engineering teams.
Scaling out to so many servers could put a lot of pressure on the backend service, but because most use cases are read-only, we can use a fan-out cache of gateway services that keep up-to-date copies cached, so the backend doesn’t get directly hit with traffic.
The replication to gateways is asynchronous, and so making changes and reading via the client is eventually consistent. In practice the eventually-consistent model with a fan-out cache has proven to be scalable, reliable and performant, even for fleet-wide config at Uber scale. Flipr is also moving to a new distribution system that is based on a subscription model to achieve this same scale, but with a more efficient use of resources.
As described in the Gateway section, Host-agents are part of the fan-out cache for Flipr. The host agents are responsible for pulling data from Gateways and persisting changes to disk. One big advantage of persisting to disk is that if there is a problem with the gateways, backend, network, or many other problems, the clients can still continue to function, as they just keep reading from the on-disk copy. The host agents are also used for things like metrics, consistency monitoring, and abstracting the fetch parts of code out of clients to help with upgrades and changes. There may be multiple containers running on a host, but a single host agent keeping all the required config up to date.
The client libraries enable reading Flipr config data for all the supported languages at Uber. These libraries read the disk format into memory and watch for updates. They are also responsible for evaluating the constraints and exceptions whenever the services request a config value. The main functionality of the clients is the get function we saw in the example, but there are often other versions of this, like type-safe get functions in the Golang library. Some of the client libraries also provide utility functions for calling the backend API to make writes and updates via the API.
The constraints system for writing exceptions has a lot of flexibility. It allows rules to be written that take into account many logical dimensions, such as geography, city ID, user ID, driver ID, vehicle ID, device type, app versions, time, experiment treatments, etc. All of these can be combined with various boolean operations to write arbitrarily complex exceptions in a very flexible way.
Properties can also have types, defined by a schema. Flipr enforces the schema when editing values in the UI to reduce the risk of runtime errors.
Flipr has a lot of features that help keep the system running reliably. It has a consistency checking system that makes sure the files on disk are consistent with the backend. There are metrics, alerts and monitoring for all components in the stack to make sure the on-call engineers know immediately if something is wrong. There are tools for mitigating inconsistencies and dealing with emergencies, such as tools that bust caches and force updates. There is also an on-call schedule of engineers ensuring that the system and customers are supported around the clock.
Flipr manages over 350K active properties, with approximately 150K changes per week. This configuration data is used by over 700 services at Uber across 50K+ hosts, generating around 3 million QPS for our backend systems.
To make Flipr safe, reliable and compliant with security and privacy requirements, Flipr has extensive safety, reliability and auditing features. Peer reviews are required for most Flipr changes and this is enforced via standardized code review tools and custom UI in Flipr. There are access control and permissions systems to ensure that only authorized users can view and update config. Rollouts are performed incrementally across physical dimensions, so that engineers can make changes gradually and safely, rather than applying changes globally, which reduces the blast radius of any unintended consequences. There are also integrations with our monitoring systems so that config can be automatically rolled back if problems are detected after making a change. In the near future we’re also adding some exciting integration testing features that will enable some very easy-to-use pre-production testing of config changes.
This post was a high-level overview of Flipr—what it is and how it works. In the next post, we’ll cover some real use cases, demonstrating how Flipr has helped Uber remain nimble in keeping our systems up to date without causing problems that interfere with our day-to-day productivity. If this was interesting to you why not come and work with us, we’re hiring!.