Uber’s busy 2019 included our billionth delivery of an Uber Eats order, 24 million miles covered by bike and scooter riders on our platform, and trips to top destinations such as the Empire State Building, the Eiffel Tower, and the Golden Gate Bridge. Behind the scenes of all this activity however, there is a story about data, and the innovations we made to our data infrastructure to support our platform services.
At our scale and global scope, supporting the Uber platform 24/7 in real-time means petabytes of data coursing through our infrastructure. We not only use this data to show people scooter locations or up-to-date restaurant menus, but continually analyze aggregated, anonymous trends to see where our services can run more smoothly.
One promising area we found to improve efficiency involved applying the practice of data science to our infrastructure, which enabled us to compute optimum datastore and hardware usage. We also launched new projects internally to better manage our data, maintaining its freshness and quality in the face of continual growth. And we built a new analytics platform so we can garner critical business insights in seconds.
While not the totality of our data platform work in 2019, these projects represent a critical tip-of-the-iceberg.
Using data science to optimize our data platform
Building an effective data infrastructure goes far beyond simply setting up databases and filling them with data. For some of our use cases, new data comes in every second of every day and records need to be continually updated. In others, data arrives at a slower cadence and requires fewer updates. Likewise, some of our analytics need real-time data, while others rely on historic data patterns.
These differing use cases open the door for optimization through data science, calculating cost functions and other means of determining the best way to store data. In one such optimization project, our Data Science and Data Warehouse teams worked together to analyze the utility of tables in our data warehouse, determining which needed to remain available to our low-latency analytics engine, and which could be off-boarded to less costly options.
The models our data scientists developed took into account factors such as the number of queries and users of individual tables, their maintenance costs, and dependencies between tables. Off-loading specific tables with low utility resulted in a potential 30 percent reduction in our database cost, and our teams are currently considering how to apply artificial intelligence to take this work further.
Similarly, our teams considered how to optimize Vertica, a popular interactive data analytics platform in use at Uber. As our needs evolved, the most straightforward solution would involve duplicating Vertica clusters in our infrastructure. However, that solution was not cost effective.
Instead, we devised a means of partially replicating our Vertica clusters, distributing data between them in a more efficient manner. Our Data Science team came up with a cost function to show how partial replication could work, while the Data Warehouse team built components to make the solution work seamlessly in production. This solution was able to significantly reduce overall disk consumption by over 30 percent, while continuing to provide the same level of compute scalability and database availability.
With data infrastructure at our scale, even small optimizations result in huge gains, requiring fewer resources while speeding up queries, and ultimately making Uber’s services run more smoothly.
Serving multiple disparate business lines, such as Uber Freight, facilitating cargo hauling between carriers and shippers, and Uber Eats, connecting delivery persons, restaurants, and eaters, requires equally disparate data. Understanding the full lifecycle of data, from its origination, various transformations, to its final destination is essential to guarantee data quality and fidelity.
Our internal product, uLineage, tracks where data comes from, where it’s been, and how it’s been transformed. This comprehensive system maintains information from the moment data comes in through an Uber service, as it’s transported by Apache Kafka, and when it’s stored in Apache Hive. uLineage propagates data freshness and quality information while supporting advanced search and graph settings.
Of more limited scope but equal value to our operations, we also launched a global index for large Apache Hadoop tables, another part of our Big Data infrastructure. Our global index serves as a bookkeeping component, showing services where particular pieces of data are stored so they can make updates or surface query results. Using Apache HBase, a non-relational distributed database, for our global index ensures high throughput, strong consistency, and horizontal scalability.
We built another component for our Apache Hadoop data lake, DBEvents, to serve as a change data capture system. We designed DBEvents around three principles: data freshness, data quality, and efficiency. DBEvents captures data during ingestion from sources such as MySQL, Apache Cassandra, and Schemaless, updating our Hadoop data lake. This solution manages petabytes of data through standardized changelogs, ensuring that services have a uniform understanding of the data available.
Analyzing the data around rideshare, bicycle, and scooter trips taken through our platform or any of our other services, serves both troubleshooting and proactive improvement. For example, if the data shows that riders are waiting longer than average to connect to driver-partners, we can analyze the problem in real-time and see if our operations teams can help. We can also analyze historic data about our services and develop new features to improve them, such as making it easier for riders to find driver-partners in crowded pick-up areas.
Our need for timely and usable analytics led us to build a new platform, which currently powers several business-critical dashboards. This platform, whose dashboard is depicted in Figure 2, above, consolidates existing real-time analytics systems, including AresDB and Apache Pinot, into one unified self-serve platform, letting internal analytics users submit queries through Presto, an open source, distributed SQL query engine supported by our engineers.
Given its support for SQL queries, the new platform also makes data analytics more accessible for our internal users making business-critical decisions. Equally as important, it delivers results with low latency, letting us act on issues quickly.
Building for the future
Maintaining a data infrastructure that reliably supports quality and freshness is an important part of Uber’s future. Data must be as accurate and as timely as possible to support services on our platform. We worked hard over the past years to build infrastructure that scales to our global, 24/7 operations. A key part of our current work involves making our infrastructure run efficiently and support all of our internal users.
Our engineers made great advances throughout 2019, many more than are documented in this article. We look forward to further optimizing our data infrastructure in the new year and making Uber’s platform services run better than ever.
Interested in contributing to the growth of Uber’s Data Platform in 2020 and beyond? Apply for a role on one of our data teams!