Data powers Uber
Uber has revolutionized how the world moves by powering billions of rides and deliveries connecting millions of riders, businesses, restaurants, drivers, and couriers. At the heart of this massive transportation platform is Big Data and Data Science that powers everything that Uber does, such as better pricing and matching, fraud detection, lowering ETAs, and experimentation. Petabytes of data are collected and processed per day and thousands of users derive insights and make decisions from this data to build/improve these products.
Problems beyond scale
While we are able to scale our data systems, we previously didn’t focus enough on some vital data problems that become even more important at scale. Some of the specific problems that emerged include:
- Data Duplication: We didn’t have source-of-truth for some critical data and metrics, which lead to duplication, inconsistency, and a lot of confusion at the time of consumption on which data and metrics to use. Consumers have to compensate for this by doing a lot of due diligence, taking time away from solving business problems. This problem was exacerbated by hundreds of thousands of datasets created using self-service tools, with no obvious indication for which datasets are important.
- Discovery Issues: Discovering the data among hundreds of thousands of datasets was hard without rich metadata and faceted search. Poor discovery resulted in duplicate datasets, repetitive work, and inconsistent answers depending on what data was used to answer a question.
- Disconnected Tools: Data flows through many tools, systems, and orgs. But our tools did not integrate with each other and led to duplication of effort and poor developer experience — for instance, one would have to copy and paste documentation and owner information across multiple tools; developers could not confidently change schemas because it wasn’t obvious how it was consumed downstream.
- Logging Inconsistencies: Logging on mobile devices was done manually; logs didn’t have a structure to enable an easy and consistent way to measure actual user behavior, leading to inferring user behavior which is inefficient and erroneous.
- Lack of Process: Lack of Data engineering processes across teams resulted in different levels of maturity and variations. There was no consistency in the definition or measurement of data quality and processes across teams.
- Lack of Ownership & SLAs: Datasets were not properly owned — they often didn’t come with quality guarantees, SLAs for bug fixes were inconsistent, on-call, and management of incidents was far from the way we did for services.
These problems are not unique to Uber — based on our conversations with engineers and data scientists in other companies, these are common problems, particularly for those that grew very fast. While services and service quality tend to get more focus due to immediate visibility in failures/breakage, data and related tools often tend to take a backseat. But fixing them and bringing them on par with the level of rigor in service tooling/management becomes extremely important at scale, especially if data plays a critical role in product functionality and innovation, as it does for Uber.
Need for a holistic approach with data
The picture below shows the high-level flow of data from mobile apps and services to our data warehouse and the ultimate consumption surfaces. Our initial piece-meal reactive attempts to fix the problems only at points in the data flow where the data issues occur addressed the symptoms instead of the root cause. We came to a realization that we needed a holistic approach to solve these issues and address the problems end-to-end. Our goal was to restructure the data logging systems, tools, and processes to provide a step-function change in data quality across Uber. We brought together teams across the end-to-end data flow stack that included engineers and data scientists from each part of the stack and ended up modifying over 20 existing systems.
To keep the effort focussed on holistic thinking, we took a “slice” of critical data, related to trip and session information on the Rider app, and attempted to create a source of truth (SoT) for them as well as fix logging on the app, tools to process data, the data itself, and the processes necessary to maintain them as SoT.
Approaching data from first principles
Unlike services that try to hide data and expose narrow interfaces outside services, offline data in the warehouse is more about exposing data from related services and domains to be analyzed together. A key realization we had was that in order to do this well, we should address not just the tools for data, but also the people and process aspects of data. So we came up with a few guiding principles:
- Data as code: Data should be treated as code. Creation, deprecation, and critical changes to data artifacts should go through the design review process with appropriate written documents where consumers’ views are taken into account. Schema changes have mandatory reviewers who sign off before changes are landed. Schema reuse/extension is preferred to creating new schemes. Data artifacts have tests associated with them and are continuously tested. These are practices we normally apply to service APIs, and we should extend that same rigor to thinking about data.
- Data is owned: Data is code and all code must be owned. Each data artifact should have a clear owner, a clear purpose, and should be deprecated when its utility is over.
- Data quality is known: Data artifacts must have SLAs for data quality, SLAs for bugs, and incident reporting and management just like we do for services. The owner is responsible for upholding those SLAs.
- Accelerate data productivity: Data tools must be designed to optimize collaboration between producers and consumers, with mandatory owners, documentation, and reviewers when necessary. Data tools must integrate with other related tools well bypassing necessary metadata seamlessly. Data tools should meet the same developer grade as services, offering the ability to write and run tests before landing changes, to test changes in a staging environment before rolling to production, and integrating well with the existing monitoring/alerting ecosystem.
- Organize for data: Teams should aim to be staffed as “full-stack,” so the necessary data engineering talent is available to take a long view of the data’s whole life cycle. While there are complicated datasets that can be owned by more central teams, most teams that produce data should aim toward local ownership. We should have the necessary training material and prioritize training engineers to be reasonably well versed with basic data production and consumption practices. Finally, team leads should be accountable for the ownership and quality of the data that they produce and consume.
In the rest of this article, we will highlight some of the most useful and interesting takeaways from our experience with the program.
Data Quality and Tiers
We have experienced a lot of toil due to poor data quality. We have seen instances where inaccurate measurement in experiments led to extensive manual labor and lost productivity in validating and correcting the data. This, it turns out, is a problem that’s becoming more common with the widespread use of big data — an IBM study and HBR estimate businesses powered by data to suffer a huge negative impact due to poor data.
To reduce the toil and negative business impact, we wanted to develop a common language and framework for talking about data quality so anyone can produce or consume data with consistent expectations. In order to do this, we developed two main concepts: standard data quality checks and definition of dataset tiers.
Data Quality is a complicated topic with many different facets worthy of examination in-depth, so we’ll limit our data quality discussion to areas in which we have made significant progress, and leave out others for future discussion. The context in which data is produced and used in Uber had a significant role to play in our selection of what areas of data quality to focus on. While some of that is transferable to others, some are not. A common set of questions that data producers and consumers face in Uber are the following: How to trade-off between doing analysis on the freshest data vs complete data? Given we run pipelines in parallel on different data centers, how should we reason about the consistency of data in different DCs? What semantic quality checks should be run on a given dataset? We wanted to choose a set of checks that provided a framework for reasoning about these questions.
Data Quality Checks
After several iterations, we landed on five main types of data quality checks described below. Every dataset must come with these checks and a default SLA configured:
- Freshness: time delay between production of data and when the data is 99.9% complete in the destination system including a watermark for completeness (default set to 3 9s), as simply optimizing for freshness without considering completeness leads to poor quality decisions
- Completeness: % of rows in the destination system compared to the # of rows in the source system
- Duplication: % of rows that have duplicate primary or unique keys, defaulting to 0% duplicate in raw data tables, while allowing for a small % of duplication in modeled tables
- Cross-data-center consistency: % of data loss when a copy of a dataset in the current datacenter is compared to the copy in another datacenter
- Semantic checks: captures critical properties of fields in the data such as null/not-null, uniqueness, # of distinct values, and range of values
Dataset owners can choose to provide different SLAs, with appropriate documentation and reasoning to consumers — for instance, depending on the nature of the dataset one might want to sacrifice completeness for freshness (think streaming datasets). Similarly, consumers can choose to consume datasets based on these metrics — run pipelines based on completeness triggers rather than simply based on time triggers.
We are continuing to work on more sophisticated checks in terms of consistency of concepts across datasets and anomaly detection on top of the checks above on time dimension.
In addition to quality measures, it’s also necessary to have a way to associate datasets with different degrees of importance to the business, so we can easily highlight the most important data. We already used to do that for services by assigning “tiers” — based on business criticality of data. These tiers help with determining the impact of outages and provide guidelines on what tiers of data should be used for what purposes. For instance, if some data impacts compliance, revenue, or brand then it should be marked as tier-1 or tier-2. Temporary datasets created by users for ad hoc exploration that are less critical are marked as tier-5 by default and can be deleted after some fixed time if it is not used. Tires also determine the level of the incident that must be filed and the SLA for fixing bugs created against the dataset. A byproduct of tiering is a systematic inventory of the data assets on which we rely to make business-critical decisions. Another benefit of this exercise was the explicit deduplication of datasets that are similar or no longer serve as source of truth. Finally, the visibility enabled by tiering helped us refactor the datasets toward better modeling and coherent data grain and level of normalization.
We have developed automation to generate “tiering reports” for orgs that show the datasets that need tiering, usage of tired data, etc., which serve as a measure of the organization’s “data health.” We are also tracking these metrics as part of our “eng excellence” metrics. With more adoption and feedback, we are continually iterating on the exact definitions and measurement methodologies, further improving upon them.
Data Quality Tools
Simply having these definitions alone isn’t sufficient if we don’t automate them and make them easy to use and apply. We consolidated multiple existing data quality tools into one tool that implements these definitions. We automatically generated tests where it made sense (for raw data — which are dumps of Kafka topics into the warehouse — we can generate four categories of tests, except semantic tests, automatically) and made it easy to create new tests with minimal input from dataset owners. While these standard checks provide a minimal set of tests for each dataset, the tool is also built to be flexible enough for producers to create new tests by simply providing a SQL query. We learned many interesting lessons on how to scale these tests with low overhead, the abstractions that make it easy to build a suite of tests for a dataset, when to schedule tests to reduce false positives and noisy alerts, how these tests apply to streaming datasets, and much more which we hope to publish in a future post.
Databook and Metadata
As described before, we have hundreds of thousands of datasets and thousands of users. If we consider other data assets — reports, ML features, metrics, dashboards, etc. — the number of assets we manage is even larger. We wanted to ensure that: a) consumers are making decisions using the right data, and b) producers are making smart decisions to evolve data, prioritize bug fixes, etc. To do this, we need a single catalog that collects metadata about all data assets and presents the right information to the users depending on their needs. In fact, poor discovery, we realized, had previously led to the vicious cycle of producers and consumers producing duplicate, redundant datasets that were then abandoned.
We wanted to present comprehensive metadata to users about every data artifact (table, column, metric):
- Basic metadata: such as documentation, ownership information, pipelines, source code that produced the data, sample data, lineage, and tier of the artifact
- Usage metadata: statistics on who used it, when, popular queries, and artifacts that are used together
- Quality metadata: tests on the data, when do they run, which ones passed, and aggregate SLA provided by the data
- Cost metadata: resources used to compute and store the data, including monetary cost
- Bugs and SLAs: bugs filed against the artifact, incidents, recent alerts, and overall SLA in responding to issues by owners
Creating this single metadata catalog and providing a powerful UI, with context-aware search and discovery is critical to enable collaboration between producers and consumers, reduce toil in using data, and uplevel data quality overall.
Toward this goal, we completely revamped both the backend and the UI of our in-house metadata catalog, Databook. We standardized metadata vocabulary, made it easy to add new metadata attributes to existing entities, designed extensibility to easily define new entity types with minimal onboarding effort, and integrated most of our critical tools into this system, and published their metadata into this central place connecting the dots between various data assets, tools, and users. The revamped UI presents information cleanly and supports easier ways for users to filter and narrow down the data they need. Tool usage increased sharply after these improvements. We covered these changes in detail in this blog post – Turning Metadata Into Insights with Databook.
App Context Logging
To understand and improve the product, getting our app logging to capture the actual user experience is critical. We want to measure the user experience, not infer it, but each team having a custom logging method led to inconsistencies in how the user experience was measured. We wanted to standardize how logging is done across teams in the entire app, and even “platformizing” logging so developers are free to think less about logging information that’s necessary to capture across all product features such as: what’s shown to the user, the state of the app when a user interacted with it, the type of interaction, and interaction duration.
After digging into the mobile framework to build apps at Uber, we realized that the mobile app development framework (open sourced previously) already has a natural structure built into it that can provide critical information about the state of the app when the user experienced it. Automatically capturing that hierarchy of RIBs will give us the state of the app and which RIBs (roughly think of them as components) are currently active. Different screens on the app map to different hierarchies of RIBs.
Building on this intuition, we developed a library that captures the current RIB hierarchy, serializes it, and automatically attaches it to every analytics event fired from the app. In the backend gateway that receives these messages, we implemented a lightweight mapping from the RIB hierarchy to a flexible set of metadata (such as screen names, names for the stage in the app, etc.). This metadata can be evolved independently to add more information either by producers or consumers without having to rely on mobile app changes (which is slow and costly due to the build and release cycle of multiple weeks). At the backend, the gateway would attach this additional metadata to the analytic events in addition to the serialized state, before writing to Kafka. This mapping on the gateway is also available via an API so the warehouse jobs can backfill the data when the mapping evolves.
Beyond the core problems above, we had to solve a few other problems, which we won’t cover in detail here, such as: optimizing the serialized RIB hierarchy to reduce the analytics payload size, making the mapping efficient, keeping the mapping correct as the app changes via a custom testing framework, several intricacies in mapping RIB trees to state correctly, standardizing on the screen and state names, etc.
While this library didn’t completely solve all the logging problems we set out to solve, it did provide a structure for logs that made a lot of analytics easier, as described below. We are iterating on this library to solve the other problems outlined.
Rider Funnel Analytics
Using the data produced from the logging framework above, we were able to greatly simplify funnel analysis on the Rider behavior. We built a dashboard in a matter of hours, which would have taken us several weeks in the past. This data is currently powering a lot of experiment monitoring and other dashboards to understand user behavior.
When we started Data180, we had many metric repositories in the company. We evaluated the pros and cons of these solutions and standardized on a single repository called uMetric. In fact, it’s more than a repository — it has advanced capabilities such as letting the user focus on the definition in YAML format and taking a lot of the toil away by generating the queries for different query systems like Hive/Presto/Spark, generating streaming and batch pipelines for the metric, creating data quality tests automatically, etc. This system is receiving wider adoption and we are also investing to further enhance it. We are automating duplicate and near-duplicate metric detection, integrating this system with Databook and other data consumption surfaces so the consumers can just consume the metrics result instead of copying and running the SQL for metrics (where it’s easier to make mistake and duplicate metrics by tweaking SQL), improving the self-service nature and detecting bugs before landing diffs, etc. This standardization has helped us significantly reduce duplication and confusion at the time of consumption. This system is described in detail in this blog The Journey Towards Metric Standardization.
Other tooling and process changes
In addition to the changes listed above, we implemented several other tooling and process changes to improve our data culture, briefly described here:
Shared data model: To avoid duplication in the schema definition of the same concepts, which is common, we improved the schema definition tools to allow importing and sharing of existing types and data models. We are now building additional features and processes to drive up the adoption of shared data models and reduce the creation of duplicate and near-duplicate data models.
Mobile analytics mandatory code reviewers and unit testing: We reorganized the schemas for mobile analytics events and allowed producers and consumers to add themselves as mandatory reviewers to avoid rolling out changes without proper reviews and notification. We also built a mobile logging testing framework to ensure data tests are run at build time.
Mandatory ownership: We improved data tools and surfaces at the root of data production (schema definition, Kafka topic creation, pipelines that create data, metric creation, dashboard creation, etc.) to make ownership information mandatory when we cannot automatically infer the owner. The ownership information is further standardized to a single service across the company and keeps track of teams and orgs, not just individual creators. This change eliminated new unowned data. We further ran heuristics to assign owners to “abandoned” datasets that didn’t have owners or owners who were no longer at the company, putting us on track to reach 100% ownership coverage.
Integration across tools: We integrated tools so that once documentation, ownership, and other critical metadata are set at the source tools, it flows across all downstream tools seamlessly. We integrated pipeline tools with standard alerting and monitoring tools, so there is consistency in how alerts are generated and managed for services and data pipelines.
We started with the hypothesis that holistically thinking about data — considering the full end-to-end flow of data across people and systems — can lead to higher overall data quality. We believe that this effort has shown strong evidence in favor of that hypothesis. However, the initial work is merely the start of our transformational journey toward better data culture. Building on the success of this work, we rolled this program across Uber to different orgs and apps. Program teams focus on tiering, building sources of truth data, uplevel data quality, and SLA for data, while platform teams continue to improve the tools mentioned above and more. Both work together to improve processes toward establishing a strong Data Culture at Uber. Some examples of ongoing work include:
- Many more foundational improvements to the tools and more automation to support different data quality checks; more integration to reduce toil
- Enhancements to the app logging framework to further capture more visual information about what the user actually “saw” and “did” on the app
- Process and tooling improvements to improve collaboration between producers and consumers
- Enforcement of life cycle on data assets so that unused and unnecessary artifacts are removed from our systems
- Further adoption of the principles outlined above during the day to day data development workflow of engineers and data scientists