Simplifying Data and Product Integrations with a Data Abstraction Layer
Staff Software Engineer
Senior Software Engineer
Senior Software Engineer
Introduction
If you work with data regularly, how often have you run into this?
The widget_metrics_v2 table is the source of truth for widget metrics. Prior to 2025, use the widget_metrics_v1 table instead. There are some differences between these tables—the count_widgits metric used to be called num_widgets in v1, and average_widgit_score is only available in v2.
Situations like this are very common, and tooling just can’t keep up with constantly evolving product needs and data models. Queries are tightly coupled with the structure and topology of the underlying datasets, so making changes is often long, slow, and difficult.
What if we took a page out of the programming playbook and introduced an abstraction? In programming parlance, an abstraction hides complexity and implementation details behind a simple interface. Abstractions let consumers focus on what something does rather than how it is done, and can make it much easier to develop systems with low coupling. In this blog, we describe a data abstraction layer—or DAL—that we built at Uber to make it easier for consumers to access all kinds of data while empowering data producers to evolve models over time. Although designed to be flexible and domain agnostic, we focused initially on a critical driver of business value for Uber: ads.
Case Study: Advertiser Reporting
Advertiser reporting is a use case that significantly benefits from this concept of an abstraction. When advertisers run ad campaigns through Uber, we gather data on how their ads are performing. One critical use case for this data is reporting, where advertisers can explore their performance data, slice and dice, compare metrics, and so on. Data is collected in near-real-time, but advertisers can also query historical data for long-term comparisons and trend analysis.
Data presentations vary from charts to tables and more. Most are configurable—advertisers can specify a time range, select campaigns, and choose the dimensions to break down on and the metrics to compute. Supporting this with traditional tools is a monumental task given the request flexibility and heterogeneity of data storage. Before onboarding to the DAL, it used to take anywhere from several weeks to a couple of months to build out a new report.

Figure 1: Performance graph from the ads manager campaign management UI.
Enter the DAL
Figure 1: Performance graph from the ads manager campaign management UI.
The DAL is an RPC service that sits between a data consumer (in this case, the advertiser-facing front ends) and a data producer. Its responsibilities include accepting requests, determining where to source data, orchestrating queries, and assembling a coherent response. It does this through a combination of configuration files that declaratively specify data locations and relationships, and a sophisticated table resolution process that creates an execution plan to satisfy the request.

Figure 2: High-level component diagram of the DAL showing the data flow when handling a request.
Requesting Data
Figure 2: High-level component diagram of the DAL showing the data flow when handling a request.
The DAL exposes a FetchData endpoint for data retrieval. The request specifies a table name, schema, time range, and filters of interest. For example, Figure 3 shows a request to get basic ad performance data for an account, broken down by campaign.
Figure 3: Example request to retrieve basic ad performance data.
Figure 4: Example response.
Figure 5: Example account-level response.
Behind the Scenes
A FetchData request is made against a logical table—an expression of a data interface that defines the shape of data without describing where or how to get it. Let’s take an example.
The ads.advertiser_metrics logical table exposes the schema shown in Figure 6.
Figure 6: Schema excerpt for the ads.advertiser_metrics logical table.
If you look at the metric columns, you’ll notice that they have a source attribute. This contains a list of the physical tables candidates that provide the metric. This serves as the starting point for seeding the candidates in table resolution.
Table Resolution
Table resolution is the process by which a query plan is generated for a request based on the logical table being queried, the time range, and the requested schema. It occurs over several phases, where a set of rules is applied to physical dataset candidates based on the columns in the request to determine the right tables to query. The major phases are
- Schema eligibility. Candidates that don’t contain every dimension or any metric in the request are discarded.
- Dataset availability. Tables define an availability window in terms of freshness and retention. Candidates whose availability window doesn’t intersect with the request interval are discarded.
- Column continuity. For each metric, construct a mapping from time range to candidate. When these intervals overlap, prefer candidates that are closer in cardinality to that of the request.
From this result, we now have a set of physical tables, columns, and time ranges to query.
Figure 7 contains excerpts from the physical tables mentioned above.
Figure 7: Excerpts from an example physical table.
- Tables describe their availability windows. The realtime tables have overlapping availability with the daily table 3-8 days ago.
- Each table has the same dimensions. This is important for result assembly, described later.
- Metrics describe how they can be rolled up. All these metrics use the SUM transform, but other aggregation functions are available.
For the above request and configuration, table resolution produces the output shown in Figure 8.
Figure 8: Output of table resolution for the example request.
The example here is somewhat simplified. Physical table dependency graphs can be multiple layers deep, and logical tables can reference physical tables across multiple database technologies.
Query Generation and Execution
After table resolution, the query engine is responsible for orchestrating data retrieval. Each table undergoes the following steps:
- A query is generated to match the schema and request semantics, possibly doing aggregations, applying predicates, and running transformations.
- The query is sent to a query runner that submits the query, waits for completion, and retrieves the response.
Both operations are database-specific, with implementations for our internal OLAP database, Docstore, Apache Hive™, and more. Note that these queries can be mixed for a single consumer request. The DAL can retrieve data from both the OLAP database and Docstore simultaneously, for example.
Here’s the query generated for ads.realtime_metrics for the above example. The other two tables produce similar queries with slightly different schemas and predicates.
Figure 9: Query generated for the ads.realtime_metrics table for the example request.
Result Assembly
In result assembly, the individual responses to each query are assembled into a final result. There are several kinds of operations in result assembly that are conditionally invoked depending on the outcome of table resolution.
- Concatenation. Assembly concatenates datasets that are temporally adjacent. For example, if table resolution indicates that data should be read from a daily table and an hourly table, the responses from each will be concatenated together.
- Joins. Complementary datasets—meaning datasets that individually cover a subset of the metric columns—are joined together using the dimensions of each row.
- Rollups. Although aggregations are usually performed during query execution, in some situations data must be rolled up during result assembly.
- Scalar transformations. We also support a variety of other types of transformations, including virtual columns, normalizations, and reference joins.
Once assembled, the response is sent to the client.
Using the DAL in Advertiser Reporting
The simplicity and flexibility of the DAL enables powerful use cases for advertiser reporting. Most experiences are backed by a single logical table, ads.advertiser_metrics. The UI sends a schema describing the shape of the data it wants to see. Figure 10 shows some examples of this.
Figure 10: Sample report definitions showing their schemas and predicates.
Advertisers can also use a report builder to specify exactly which dimensions and metrics to include—similar to a pivot table—as well as a third-party API for programmatic integrations. Through the DAL these implementations become greatly simplified, as the inherently dynamic nature of the experiences matches closely with the capabilities provided by FetchData.
Outcome and Next Steps
Usage of the DAL for advertiser reporting had a dramatic impact on the turnaround time for new reports. What used to be a multi-week process can now be accomplished in under two days. Furthermore, this has enabled much more sophisticated tooling for advertisers that would have been infeasible under the original architecture, with dynamic experiences ranging from user-specified filters to full-fledged report builders.
There are many exciting ways in which the DAL continues to evolve. Some of these include:
- Use case onboarding. The DAL was initially built for advertiser reporting, but was designed from the beginning to be general-purpose and domain-agnostic. It has since evolved to serve as a general DAL for ads and expanded into non-ads domains. We expect this trend to continue with time.
- Additional database integrations. Connecting the DAL to other databases is relatively straightforward via a connector interface. This makes it easier to connect data that spans across different database technologies, a crucial value add for the DAL.
- Asynchronous requests. Async is useful in situations where you need to return larger volumes of data. The DAL extends relatively easily into some aspects of asynchronous execution, but some other parts require a bit of sophistication.
- Extensibility mechanisms. While the typical request flow is fully generalized, domain-specific requirements are inevitable. The DAL provides a few extension points in the form of decorators that allow its behavior to be customized in a controlled manner to facilitate this complexity.
Conclusion
We discussed the implementation of a Data Abstraction Layer that provides an abstraction between the producer and consumer of datasets, which enables teams at Uber to move fast and build powerful experiences on top of data. We also described a case study of how the DAL was used to solve complex problems in advertiser reporting.
Acknowledgments
Cover Photo Attribution: “Beautiful aerial view of the teotihuacan pyramid of the sun and Moon in Mexico” by Marie Hernandez is licensed to Uber.
Apache®, Apache® Hive™, and Hive™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
If you’re interested in solving technically challenging problems like this one, Uber is hiring!
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.