Like many startups, Uber began its journey with a monolithic architecture, built for a single offering in a single city. At the time, all of Uber was our UberBLACK option and our “world” was San Francisco. Having one codebase seemed “clean” at the time, and solved our core business problems, which included connecting drivers with riders, billing, and payments. It was reasonable back then to have all of Uber’s business logic in one place. As we rapidly expanded into more cities and introduced new products, this quickly changed.
As core domain models grew and new features were introduced, our components became tightly coupled, and enforcing encapsulation made separation of concerns difficult. Continuous integration turned into a liability because deploying the codebase meant deploying everything at once. Our engineering team experienced rapid growth and scaling, which not only meant handling more requests but also handling a significant increase in developer activity. Adding new features, fixing bugs, and resolving technical debt all in a single repo became extremely difficult. Tribal knowledge was required before attempting to make a single change.
Moving to a SOA
We decided to follow the lead of other hyper-growth companies—Amazon, Netflix, SoundCloud, Twitter, and others—and break up the monolith into multiple codebases to form a service-oriented architecture (SOA). Specifically, since the term SOA tends to mean a variety of different things, we adopted a microservice architecture. This design pattern enforces the development of small services dedicated to specific, well-encapsulated domain areas. Each service can be written in its own language or framework, and can have its own database or lack thereof.
Migrating from a monolithic codebase to a distributed SOA solved many of our problems, but it created a few new ones as well. These problems fall into three main areas:
With 500+ services, finding the appropriate service becomes arduous. Once identified, how to utilize the service is not obvious, since each microservice is structured in its own way. Services providing REST or RPC endpoints (where you can access functionality within that domain) typically offer weak contracts, and in our case these contracts vary greatly between microservices. Adding JSON Schema to a REST API can improve safety and the process of developing against the service, but it is not trivial to write or maintain. Finally, these solutions do not provide any guarantees regarding fault tolerance or latency. There’s no standard way to handle client-side timeouts and outages, or ensure an outage of one service does not cause cascading outages. The overall resiliency of the system would be negatively impacted by these weaknesses. As one developer put it, we “converted our monolithic API into a distributed monolithic API”.
It has become clear we need a standard way of communication that provides type safety, validation, and fault tolerance. Other goals include:
- Simple ways to provide client libraries
- Cross language support
- Tunable default timeouts and retry policies
- Efficient testing and development
At this stage in our hyper-growth, Uber engineers continue to evaluate technologies and tools to fit our goals. One thing we do know is that using an existing Interface Definition Language (IDL) that provides lots of pre-built tooling from day one is ideal.
We evaluated the existing tools and found that Apache Thrift (made popular by Facebook and Twitter) met our needs best. Thrift is a set of libraries and tools for building scalable cross-language services. To accomplish this, datatypes and service interfaces are defined in a language agnostic file. Then, code is generated to abstract the transport and encoding of RPC messages between services written in all of the languages we support (Python, Node, Go, etc.)
In addition to Thrift, we’re creating lifecycle tooling to publish these clients to packaging systems (such as pip for Python and npm for Node). Discovering and contributing to the service then becomes a manageable task. Service clients also act as learning tools, in addition to docs and wikis.
The most compelling argument for Thrift is its safety. Thrift guarantees safety by binding services to use strict contracts. The contract describes how to interact with that service including how to call service procedures, what inputs to provide, and what output to expect. In the following Thrift IDL we have defined a service Zoo with a function makeSound that takes a string animalName and returns a string or throws an exception.
Adhering to a strict contract means less time is spent figuring out how to communicate with a service and dealing with serialization. In addition, as a microservice evolves we do not have to worry about interfaces changing suddenly, and are able to deploy services independently from consumers. This is very good news for Uber engineers. We’re able to move on to other projects and tools since Thrift solves the problem of safety out of the box.
Lastly, we drew inspiration from fault tolerance and latency libraries in other companies facing similar challenges, such as Netflix’s Hystrix library and Twitter’s Finagle library, to tackle the problem of resiliency. With those libraries in mind, we wrote libraries that ensure clients are able to deal with failure scenarios successfully (which will be discussed in more detail in a future post).
Tradeoffs and Where We’re Headed
Of course, no solution is perfect and all solutions have challenges. Unfortunately, Thrift’s toolset is relatively young and tools for Python and Node are not abundant. There is a risk that a lot of time will be invested in creating these tools. Additionally, there is no higher-level support for headers. Authentication and cross-service tracing, for example, are two challenging problems since higher level meta-data would be passed in every time.
Dismantling our well-worn monolith has been a long time coming. While it has been a key component that enabled our explosive growth in the past, it has grown cumbersome and difficult to scale further and maintain.
Our goal for the remainder of 2015 is to get rid of this repo entirely—promoting clear ownership, offering better organizational scalability, and providing more resilience and fault tolerance through our commitment to microservices.