A few months back, we discussed Uber’s decision to abandon its monolithic codebase in favor of a modular, flexible microservice architecture. Since then, we’ve devoted many thousands of engineering hours to expanding this ecosystem of Uber microservices (several hundred and counting), written in a variety of languages and using many different frameworks. This ongoing refactor is a huge undertaking, so we took the opportunity to adopt a new suite of technologies for building microservices at Uber. With a tech stack and standards well suited for the SOA migration, we have streamlined service development at Uber.
Starting a New Service
In a rapidly growing engineering organization, it can be difficult to keep track of all the efforts underway. This kind of growth demands a process to prevent duplicating work across teams. At Uber, we have solved this problem by requiring that authors of new services submit a Request for Comments (RFC), a high-level proposal of the new service that outlines its purpose, architecture, dependencies, and other implementation details for the rest of Uber Engineering to discuss. The RFC serves two purposes: 1) to solicit feedback for improving the service’s quality as it’s developed and 2) to prevent duplicate efforts or expose opportunities for collaboration.
Several other engineers who are familiar with the domain review the service’s design. Once feedback has been incorporated into the service proposal, the fun and games of building the service begin.
Implementing a New Service
Tincup, our currency and exchange rate service, is a great example of how a microservice at Uber is implemented. Tincup is the interface for up-to-date currency and exchange rate data. It serves two major endpoints: one to get a currency object and a second to get the current exchange rate for a given currency (per USD). These endpoints are necessary because Uber is a global business. Exchange rates change often, and we facilitate transactions in nearly 60 currencies.
Bootstrapping Microservices with New Technologies
Rewriting all logic related to currencies and exchange rates while building Tincup provided a good opportunity to reevaluate some design decisions at Uber that were made long ago. We used a flurry of new frameworks, protocols, and conventions to implement Tincup.
First, we addressed the overall structure of code related to currencies and exchange rates. At Uber, we have modified the persistence layer of many datasets (like this one) several times in recent years. Each change is long and cumbersome. We have learned from this process that, if possible, it is best to separate the persistence layer specifics from the application logic. This leads to an application development approach we refer to as MVCS, which extends the common MVC approach to include a service layer where application logic lives. By secluding the application logic in the service layer from other parts of the application, the persistence layer can evolve or be replaced without refactoring business logic (only the code dealing directly with storing/reading the data needs to change).
Second, we considered the persistence layer for currencies and exchange rates. Before Tincup, this data was stored in a relational PostgreSQL database with incremental integer IDs. However, this method of data storage does not allow for globally replicated data across all of Uber’s data centers and thus does not align with our effort for an all-active (simultaneously serving trips from all data centers) architecture. Since currencies and exchange rates are required to be accessible from all data centers, we swapped out the persistence layer to use UDR, Uber’s globally replicated scalable datastore.
Anticipating Concerns of Microservice Growth
After deciding on the design changes specific to currencies and exchange rates, we addressed the new concerns that naturally arise when the number of microservices in the engineering ecosystem increases.
Blocking on network I/O is a serious concern that may lead to uWSGI worker starvation. If all requests to services like Tincup are synchronous, the degradation of one service runs the risk of causing a ripple effect and impacting all callers. We decided to adopt Tornado, an event-loop-based asynchronous framework for Python, to prevent blocking. Since we were tearing out a great amount of code from the (Flask) monolithic codebase, it was important for us to minimize risk by choosing an asynchronous framework that would allow much of the existing application logic to remain the same. Tornado met that requirement for us, as it allows for synchronous-looking code but non-blocking I/O. (Alternatively, to address the aforementioned I/O issue, many service owners are giving a new language a Go.)
What was once a single API call may fan out into a great number of calls to microservices. To facilitate the discovery of other services and and identification of points of failure in the large ecosystem, Uber microservices use open source TChannel over Hyperbahn, a network multiplexing and framing protocol for RPC developed in-house. TChannel provides a protocol for clients and servers, with Hyperbahn’s intelligent routing mesh connecting the two. It solves a few core issues that arise in a world of microservices:
- Service discovery. All producers and consumers register themselves with the routing mesh. Consumers access producers by name instead of needing to know about hosts or ports.
- Fault tolerance. The routing mesh tracks metrics like failure rates and SLA violations. It can detect unhealthy hosts and subsequently remove them from the pool of available hosts.
- Rate limiting and circuit breaking. These features ensure bad requests and slow responses from clients don’t cause cascading failures.
Since the number of service calls grows rapidly, it is necessary to maintain a well-defined interface for every call. We knew we wanted to use an IDL for managing this interface, and we ultimately decided on Thrift. Thrift forces service owners to publish strict interface definitions, which streamlines the process of integrating with services. Calls that do not abide by the interface are rejected at the Thrift level instead of leaking into a service and failing deeper within the code. This strategy of publicly declaring your interface emphasizes the importance of backwards compatibility, since multiple versions of a service’s Thrift interface could be in use at any given time. The service author must not make breaking changes, and instead must only make non-breaking additions to the interface definition until all consumers are ready for deprecation.
Preparing Tincup for the Big Leagues: Production
Finally, when implementation stages of Tincup were near complete, we used a few helpful tools to prepare Tincup for the production environment:
First, we acknowledged that Uber’s traffic is variable with the time of day, day of week, and day of year. We see huge peaks at expected times, like New Year’s Eve and Halloween, so we must ensure that services can handle this increased load before we launch them. As required at Uber when launching a new service, we used our built in-house Hailstorm service to load test Tincup’s endpoints and determine shortcomings and breaking points.
Next, we considered another major goal for Uber Engineering: use hardware more efficiently. Since Tincup is a relatively lightweight service, it can easily share machines with other microservices. Sharing is caring, right? Well, not always—we still want to ensure that each service runs independently and doesn’t affect other services running on the same machine. To prevent that problem, we deployed Tincup using uContainer (Docker at Uber) for resource isolation and limitation. As its name implies, uContainer leverages Linux container capability and Docker to containerize Uber services. It wraps up a service into an isolated environment to guarantee that the service will run consistently, regardless of other running processes on that same host. uContainer extends Docker’s capabilities by adding 1) features for more flexible builds and 2) tools for more visibility into the Docker containers.
Finally, to prepare for outages and network connectivity issues that inevitably arise in production, we used an internal tool called uDestroy to unleash controlled chaos on our services. By simulating disruption on our own terms, we gain visibility into our systems’ resilience. Since we periodically and purposefully disrupt our systems as they evolve, we can identify vulnerabilities and continually work to improve durability.
We learned several lessons about expanding the SOA through building Tincup:
- Migrating consumers is a long, slow process, so make it as easy as you can. Provide code examples. Budget time to walk people through this migration.
- Learning a tech stack is best on a small service. Tincup’s application logic is very simple, which allowed developers to focus on learning the new tech stack rather than the detailed migration of business logic.
- Devoting the initial time to develop extensive unit and integration tests pays dividends down the road. Debugging issues with the code is much easier (and less stressful!) if done in a development environment.
- Load test as early and often as you can. There’s nothing worse than finding out that your system can’t handle peak traffic after you have spent weeks or months implementing.
Microservices at Uber
Uber’s SOA migration has presented opportunities for many people to own services, even those with limited industry experience. Owning a service is a big responsibility, but Uber’s open, knowledge-sharing culture makes picking up a new set of technologies and owning a codebase a rewarding and valuable experience.
Like what you’re reading? Sign up for our newsletter for updates from the Uber Engineering blog.