Update: This article discusses the lower half of the stack. For the rest, see Part II: The Edge and Beyond.
Uber’s mission is transportation as reliable as running water, everywhere, for everyone. To make that possible, we create and work with complex data. Then we bundle it up neatly as a platform that enables drivers to get business and riders to get around.
While we want Uber’s UI to be simple, we engineer complex systems behind it to stay up, handle difficult interactions, and serve massive amounts of traffic. We’ve broken up the original monolithic architecture into many parts to scale with growth. With hundreds of microservices that depend on each other, drawing a diagram of how Uber works at this point is wildly complicated, and it all changes rapidly. What we can cover in a two-part article is the stack we used as of spring 2016.
Uber Engineering’s Challenges: No Free Users, Hypergrowth
We have the same global-scale problems as some of the most successful software companies, but 1) we’re only six years old, so we haven’t solved them yet, and 2) our business is based in the physical world in real time.
Unlike freemium services, Uber has only transactional users: riders, drivers, and now eaters and couriers. People rely on our technology—to make money, to go where they need to go—so there’s no safe time to pause. We prioritize availability and scalability.
As we expand on the roads, our service must scale. Our stack’s flexibility encourages competition so the best ideas can win. These ideas aren’t necessarily unique. If a strong tool exists, we use it until our needs exceed its abilities. When we need something more, we build in-house solutions. Uber Engineering has responded to growth with tremendous adaptability, creativity, and discipline in the past year. Throughout 2016, we have even bigger plans. By the time you read this, much will have changed, but this is a snapshot of what we’re using now. Through our descriptions, we hope to demonstrate our philosophy around using tools and technologies.
Uber’s Tech Stack
Instead of a tower of restrictions, picture a tree. Looking at the technologies across Uber, you see a common stack (like a tree trunk) with different emphases for each team or engineering office (its branches). It’s all made of the same stuff, but tools and services bloom differently in various areas.
We’ll start from the bottom (worked for Drake).
This first article focuses on the Uber platform, meaning everything that powers the broader Uber Engineering organization. Platform teams create and maintain things that enable other engineers to build programs, features, and the apps you use.
Infrastructure and Storage
Our business runs on a hybrid cloud model, using a mix of cloud providers and multiple active data centers. If one data center fails, trips (and all the services associated with trips) fail over to another one. We assign cities to the geographically closest data center, but every city is backed up on a different data center in another location. This means that all of our data centers are running trips at all times; we have no notion of a “backup” data center. To provision this infrastructure, we use a mix of internal tools and Terraform.
Our needs for storage have changed with growth. A single Postgres instance got us through our infancy, but as we grew so quickly, we needed to increase available disk storage and decrease system response times.
We currently use Schemaless (built in-house on top of MySQL), Riak, and Cassandra. Schemaless is for long-term data storage; Riak and Cassandra meet high-availability, low-latency demands. Over time, Schemaless instances replace individual MySQL and Postgres
We use Redis for both caching and queuing. Twemproxy provides scalability of the caching layer without sacrificing cache hit rate via its consistent hashing algorithm. Celery workers process async workflow operations using those Redis instances.
Our services interact with each other and mobile devices, and those interactions are valuable for internal uses like debugging as well as business cases like dynamic pricing. For logging, we use multiple Kafka clusters, and the data is archived into Hadoop and/or a file storage web service before it expires from Kafka. This data is also ingested in real time by various services and indexed into an ELK stack for searching and visualizations (ELK stands for Elasticsearch, Logstash, and Kibana).
We use Docker containers on Mesos to run our microservices with consistent configurations scalably, with help from Aurora for long-running services and cron jobs. One of our infrastructure teams, Application Platform, produced a template library that builds services into shippable Docker images.
Routing and Service Discovery
Our service-oriented architecture (SOA) makes service discovery and routing crucial to Uber’s success. Services must be able to communicate with each other in our complex network. We’ve used a combination of HAProxy and Hyperbahn to solve this problem. Hyperbahn is part of a collection of open source software developed at Uber: Ringpop, TChannel, and Hyperbahn all share a common mission to add automation, intelligence, and performance to a network of services.
Legacy services use local HAProxy instances to route JSON over HTTP requests to other services, with front-end web server NGINX proxying to servers in the back end. This well-established way of transferring data makes troubleshooting easy, which was crucial throughout several migrations to newly developed systems in the last year.
However, we’re prioritizing long-term reliability over debuggability. Alternative protocols to HTTP (like SPDY, HTTP/2, and TChannel) along with interface definition languages like Thrift and Protobuf will help evolve our system in terms of speed and reliability. Ringpop, a consistent hashing layer, brings cooperation and self-healing to the application level. Hyperbahn enables services to find and communicate with others simply and reliably, even as services are scheduled dynamically with Mesos.
Instead of archaically polling to see if something has changed, we’re moving to a pub-sub pattern (publishing updates to subscribers). HTTP/2 and SPDY more easily enable this push model. Several poll-based features within the Uber app will see a tremendous speedup by moving to push.
Development and Deploy
Phabricator powers a lot of internal operations, from code review to documentation to process automation. We search through our code on OpenGrok. For Uber’s open source projects, we develop in the open using GitHub for issue tracking and code reviews.
Uber Engineering strives to make development simulate production as closely as possible, so we develop mostly on virtual machines running on a cloud provider or a developer’s laptop. We built our own internal deployment system to manage builds. Jenkins does continuous integration. We combined Packer, Vagrant, Boto, and Unison to create tools for building, managing, and developing on virtual machines. We use Clusto for inventory management in development. Puppet manages system configuration.
We constantly work to build and maintain stable communication channels, not just for our services but also for our engineers. For information discovery, we built uBlame (a nod to git-blame) to keep track of which team owns a particular service, and Whober for looking up names, faces, contact information, and organizational structure. We use an in-house documentation site that autobuilds docs from repositories using Sphinx. An enterprise alerting service alerts our on-call engineers to keep systems running. Most developers run OSX on their laptops, and most of our production instances run Linux with Debian Jessie.
At the lower levels, Uber’s engineers primarily write in Python, Node.js, Go, and Java. We started with two main languages: Node.js for the Marketplace team, and Python for everyone else. These first languages still power most services running at Uber today.
We adopted Go and Java for high performance reasons. We provide first-class support for these languages. Java takes advantage of the open source ecosystem and integrates with external technologies, like Hadoop and other analytics tools. Go gives us efficiency, simplicity, and runtime speed.
We rip out and replace older Python code as we break up the original code base into microservices. An asynchronous programming model gives us better throughput. We use Tornado with Python, but Go’s native support for concurrency is ideal for most new performance-critical services.
We write tools in C and C++ when it’s necessary (like for high-efficiency, high-speed code at the system level). We use software that’s written in those languages—HAProxy, for example—but for the most part, we don’t actually work in them.
And, of course, those working at the top of the stack write in languages beyond Java, Go, Python, and Node.
To make sure that our services can handle the demands of our production environment, we’ve developed two internal tools: Hailstorm and uDestroy. Hailstorm drives integration tests and simulates peak load during off-peak times, while uDestroy intentionally breaks things so we can get better at handling unexpected failures.
Our employees use a beta version of the app to continuously test new developments before they reach users. We made an app feedback reporter to catch any bugs before we roll out to users. Whenever we take a screenshot in the Uber apps, this feature prompts us to file a bug-fix task in Phabricator.
Engineers that write backend services are responsible for their operations. If they write some code that breaks in production, they get paged. We use Nagios alerting for monitoring, tied to an alerting system for notifications.
Aiming for the best availability and 1 billion rides per day, site reliability engineers focus on getting services what they need to succeed.
Observability means making sure Uber as a whole, and its different parts, are healthy. Mostly developed by our New York City office, a collection of systems acts as the eyes, ears, and immune system of Uber Engineering around the world.
We developed M3 in Go to collect and store metrics from every part of Uber Engineering (every server, host service, and piece of code).
After we collect the data, we look for trends. We built dashboards and graphs by modifying Grafana to more expressively contextualize information. Every engineer watching a dashboard tends to care about data in a particular location or region, around a set of experiments, or related to a certain product. We added data slicing and dicing to Grafana.
Argos, our in-house anomaly detection tool, examines incoming metrics and compares them to predictive models based on historical data to determine whether current data is within the expected bounds.
Acting on Metrics
Uber’s μMonitor tool enables engineers to view this information and thresholding (either static or Argos’s smart thresholding) and take action on it. If a stream of data goes out of bounds—say trips drop below a certain threshold in some city—this information gets passed to the Common Action Gateway. That’s our automatic response system. Instead of paging engineers when there’s a problem, it does something about it and reduces the problem’s duration. If a deploy presents some problem, rollback is automatic.
Most of our observability tools are kept within Uber because they’re specific to our infrastructure, but we hope to extract and open source the general-purpose parts soon.
Using Data Creatively
Storm and Spark crunch data streams into useful business metrics. Our data visualization team creates reusable frameworks and applications to consume visual data.
Mapping and experimentation teams rely on data visualization to convert data into clear, sensible information. City operations teams can see the drivers in their city flowing in real time as cars on a map instead of deriving insights from tedious SQL queries.
We also develop frameworks for visualization that other technologies, like R and Shiny and Python, can access for our charting components. We want high-data-density visualizations that perform smoothly in the browser. To obtain both goals, we developed open-source WebGL-based visualization tools.
Uber’s maps teams prioritize the datasets, algorithms, and tooling around map data, display, routing, and systems for collecting and recommending addresses and locations. Map Services runs on a primarily Java-based stack.
The highest-volume service in this area is Gurafu, which provides a set of utilities for working with road map data and improves efficiency and accuracy by providing more sophisticated routing options. Gurafu is fronted by µETA, which adds a business logic layer on top of raw ETAs (things like experiment group segmentation). Both Gurafu and µETA are web services built on the DropWizard framework.
Our business and customers rely on highly accurate ETAs, so Map Services engineers spend a lot of time making these systems more correct. We perform ETA error analysis to identify and fix sources of error. And beyond accuracy, the scale of the problem is interesting: every second, systems across the organization make huge numbers of decisions using ETA information. As the latency of those requests must be on the order of 5 milliseconds, algorithmic efficiency becomes a big concern. We have to be concerned with the way we allocate memory, parallelize computations, and make requests to slow resources like the system disk or the data center network.
Map Services also powers all of the backend technology behind the search boxes in the rider and driver app. The technologies include the autocomplete search engine, the predictions engine, and the reverse geocoding service. The autocomplete search engine allows high-speed, locally-biased location search for places and addresses. Our predictions engine uses machine learning to predict the rider’s destination based on a combination of user history and other signals. Predictions account for ~50% of destinations entered. The reverse geocoder determines the user’s location based on GPS, which we augment with additional information for suggested Uber pickup spots based on our overall trip history.
Above this, we enter the parts of the stack that interact with your phone. Stay tuned for the next post. While Uber’s technologies and challenges will likely change, our mission and culture of overcoming them will last. Want to be a part of it?
Photo Credit: “Chapman’s Baobab” by Conor Myhrvold, Botswana.
Header Explanation: Baobab trees are renowned for their resilience, longevity and thick trunk and branches. Chapman’s Baobab in the Kalahari desert is one of Africa’s oldest trees.