June 11, 2026

Simplifying Data and Product Integrations with a Data Abstraction Layer

Daniel Musgrave

Staff Software Engineer

James Yang

Senior Software Engineer

Narendra Kumar

Senior Software Engineer

Share this article

Introduction

If you work with data regularly, how often have you run into this?

The widget_metrics_v2 table is the source of truth for widget metrics. Prior to 2025, use the widget_metrics_v1 table instead. There are some differences between these tables—the count_widgits metric used to be called num_widgets in v1, and average_widgit_score is only available in v2.

Situations like this are very common, and tooling just can’t keep up with constantly evolving product needs and data models. Queries are tightly coupled with the structure and topology of the underlying datasets, so making changes is often long, slow, and difficult.

What if we took a page out of the programming playbook and introduced an abstraction? In programming parlance, an abstraction hides complexity and implementation details behind a simple interface. Abstractions let consumers focus on what something does rather than how it is done, and can make it much easier to develop systems with low coupling. In this blog, we describe a data abstraction layer—or DAL—that we built at Uber to make it easier for consumers to access all kinds of data while empowering data producers to evolve models over time. Although designed to be flexible and domain agnostic, we focused initially on a critical driver of business value for Uber: ads.

Case Study: Advertiser Reporting

Advertiser reporting is a use case that significantly benefits from this concept of an abstraction. When advertisers run ad campaigns through Uber, we gather data on how their ads are performing. One critical use case for this data is reporting, where advertisers can explore their performance data, slice and dice, compare metrics, and so on. Data is collected in near-real-time, but advertisers can also query historical data for long-term comparisons and trend analysis.

Data presentations vary from charts to tables and more. Most are configurable—advertisers can specify a time range, select campaigns, and choose the dimensions to break down on and the metrics to compute. Supporting this with traditional tools is a monumental task given the request flexibility and heterogeneity of data storage. Before onboarding to the DAL, it used to take anywhere from several weeks to a couple of months to build out a new report.

Figure 1: Performance graph from the ads manager campaign management UI.
Enter the DAL

The DAL is an RPC service that sits between a data consumer (in this case, the advertiser-facing front ends) and a data producer. Its responsibilities include accepting requests, determining where to source data, orchestrating queries, and assembling a coherent response. It does this through a combination of configuration files that declaratively specify data locations and relationships, and a sophisticated table resolution process that creates an execution plan to satisfy the request.

Figure 2: High-level component diagram of the DAL showing the data flow when handling a request.
Requesting Data

The DAL exposes a FetchData endpoint for data retrieval. The request specifies a table name, schema, time range, and filters of interest. For example, Figure 3 shows a request to get basic ad performance data for an account, broken down by campaign.

A code snippet displaying a JSON-like query structure for retrieving advertiser metrics from the "ads.advertiser_metrics" table, specifying schema fields, a date interval from January 1 to January 8, 2025, and a predicate filtering by "ad_account_uuid".

Figure 3: Example request to retrieve basic ad performance data.

The request returns a response like the one shown in Figure 4.

Table displaying advertising campaign data with columns: ad_campaign_uuid, sum_ad_spend_local_e5, count_impressions, and count_clicks. Two rows show partial UUIDs, ad spend values of 25040 and 102698, impressions of 35 and 217, and clicks of 2 and 11.

Figure 4: Example response.

Removing ad_campaign_uuid from the above request produces the same data at the account level.

Table with three columns labeled 'sum_ad_spend_local_e5', 'count_impressions', and 'count_clicks'. The corresponding values in the row below are 127738, 252, and 13.

Figure 5: Example account-level response.

Notice how metrics are automatically rolled up to match the new schema—the client didn’t need to do anything other than omit the dimension. This reflects a fundamental design principle of the DAL: the request specifies what data the consumer wants to see. The service is responsible for how the data is retrieved and computed.

Behind the Scenes

A FetchData request is made against a logical table—an expression of a data interface that defines the shape of data without describing where or how to get it. Let’s take an example.

The ads.advertiser_metrics logical table exposes the schema shown in Figure 6.

YAML configuration for 'ads.advertiser_metrics' with dimensions for ad_account_uuid, ad_campaign_uuid, and ad_group_uuid (all type UUID), and metrics for count_impressions and count_clicks (type BigInt, sourced from ads.realtime_metrics and ads.daily_metrics), plus sum_ad_spend_local_e5 (type CurrencyAmount, sourced from ads.realtime_ad_spend and ads.daily_metrics).

Figure 6: Schema excerpt for the ads.advertiser_metrics logical table.

The dimensions and metrics elements are sets of columns of a particular grouping type. Dimensions and metrics generally behave as expected—dimensions are classifications used to determine the grouping key for aggregations, and metrics are measurements that can (usually) be aggregated.

If you look at the metric columns, you’ll notice that they have a source attribute. This contains a list of the physical tables candidates that provide the metric. This serves as the starting point for seeding the candidates in table resolution.

Table Resolution

Table resolution is the process by which a query plan is generated for a request based on the logical table being queried, the time range, and the requested schema. It occurs over several phases, where a set of rules is applied to physical dataset candidates based on the columns in the request to determine the right tables to query. The major phases are

Schema eligibility. Candidates that don’t contain every dimension or any metric in the request are discarded.
Dataset availability. Tables define an availability window in terms of freshness and retention. Candidates whose availability window doesn’t intersect with the request interval are discarded.
Column continuity. For each metric, construct a mapping from time range to candidate. When these intervals overlap, prefer candidates that are closer in cardinality to that of the request.

From this result, we now have a set of physical tables, columns, and time ranges to query.

Figure 7 contains excerpts from the physical tables mentioned above.

YAML configuration for ad metrics, including sections for realtime engagement, realtime ad spend, and historical metrics. Each section defines time series parameters (freshness and retention), dimensions (ad account, campaign, and group UUIDs), and metrics (impressions, clicks, ad spend) with data types and aggregation methods.

Figure 7: Excerpts from an example physical table.

Explaining every part of these configurations is beyond the scope of this blog, but a few things to call out:

Tables describe their availability windows. The realtime tables have overlapping availability with the daily table 3-8 days ago.
Each table has the same dimensions. This is important for result assembly, described later.
Metrics describe how they can be rolled up. All these metrics use the SUM transform, but other aggregation functions are available.

For the above request and configuration, table resolution produces the output shown in Figure 8.

Comparison of three tables: Table 1 (ads.realtime_metrics) and Table 2 (ads.realtime_ad_spend) both use ad_campaign_uuid as a dimension and cover the time range 2026-01-05 to 2026-01-08. Table 1 tracks count_impressions and count_clicks, while Table 2 tracks sum_ad_spend_local_e5. Table 3 (ads.daily_metrics) also uses ad_campaign_uuid, covers 2026-01-01 to 2026-01-05, and tracks count_impressions, count_clicks, and sum_ad_spend_local_e5.

Figure 8: Output of table resolution for the example request.

That is, this request will result in three queries being executed, one against each of the realtime tables to retrieve a subset of the metrics for the most recent 3 days, and another against the daily table for the remaining 4 days.

The example here is somewhat simplified. Physical table dependency graphs can be multiple layers deep, and logical tables can reference physical tables across multiple database technologies.

Query Generation and Execution

After table resolution, the query engine is responsible for orchestrating data retrieval. Each table undergoes the following steps:

A query is generated to match the schema and request semantics, possibly doing aggregations, applying predicates, and running transformations.
The query is sent to a query runner that submits the query, waits for completion, and retrieves the response.

Both operations are database-specific, with implementations for our internal OLAP database, Docstore, Apache Hive™, and more. Note that these queries can be mixed for a single consumer request. The DAL can retrieve data from both the OLAP database and Docstore simultaneously, for example.

Here’s the query generated for ads.realtime_metrics for the above example. The other two tables produce similar queries with slightly different schemas and predicates.

SQL query selecting ad account and campaign UUIDs, summing impressions and clicks from ads.realtime_metrics where the account UUID matches 'abcd1234...' and the timestamp is between January 5 and January 8, 2026.

Figure 9: Query generated for the ads.realtime_metrics table for the example request.

The engine executes queries in parallel to minimize latency. After a response has been received from each query, we move to the final stage: result assembly.

Result Assembly

In result assembly, the individual responses to each query are assembled into a final result. There are several kinds of operations in result assembly that are conditionally invoked depending on the outcome of table resolution.

Concatenation. Assembly concatenates datasets that are temporally adjacent. For example, if table resolution indicates that data should be read from a daily table and an hourly table, the responses from each will be concatenated together.
Joins. Complementary datasets—meaning datasets that individually cover a subset of the metric columns—are joined together using the dimensions of each row.
Rollups. Although aggregations are usually performed during query execution, in some situations data must be rolled up during result assembly.
Scalar transformations. We also support a variety of other types of transformations, including virtual columns, normalizations, and reference joins.

Once assembled, the response is sent to the client.

Using the DAL in Advertiser Reporting

The simplicity and flexibility of the DAL enables powerful use cases for advertiser reporting. Most experiences are backed by a single logical table, ads.advertiser_metrics. The UI sends a schema describing the shape of the data it wants to see. Figure 10 shows some examples of this.

Matrix chart displaying which schema fields and predicates are included in various advertising performance and summary reports, with checked boxes indicating inclusion for each report type.

Figure 10: Sample report definitions showing their schemas and predicates.

Notice how the definitions of these reports are very similar, with only slight modifications to the schema and/or predicate needed to produce an entirely different result. By convention, performance reports are a time-series showing how metrics change over time, while summary reports show the same data aggregated into a single data point per metric. Other reports can be produced quite easily with similarly small modifications. For example, an advertiser can get a performance report across all campaigns in an account by including both ad_campaign_uuid and day_of in the schema simultaneously.

Advertisers can also use a report builder to specify exactly which dimensions and metrics to include—similar to a pivot table—as well as a third-party API for programmatic integrations. Through the DAL these implementations become greatly simplified, as the inherently dynamic nature of the experiences matches closely with the capabilities provided by FetchData.

Outcome and Next Steps

Usage of the DAL for advertiser reporting had a dramatic impact on the turnaround time for new reports. What used to be a multi-week process can now be accomplished in under two days. Furthermore, this has enabled much more sophisticated tooling for advertisers that would have been infeasible under the original architecture, with dynamic experiences ranging from user-specified filters to full-fledged report builders.

There are many exciting ways in which the DAL continues to evolve. Some of these include:

Use case onboarding. The DAL was initially built for advertiser reporting, but was designed from the beginning to be general-purpose and domain-agnostic. It has since evolved to serve as a general DAL for ads and expanded into non-ads domains. We expect this trend to continue with time.
Additional database integrations. Connecting the DAL to other databases is relatively straightforward via a connector interface. This makes it easier to connect data that spans across different database technologies, a crucial value add for the DAL.
Asynchronous requests. Async is useful in situations where you need to return larger volumes of data. The DAL extends relatively easily into some aspects of asynchronous execution, but some other parts require a bit of sophistication.
Extensibility mechanisms. While the typical request flow is fully generalized, domain-specific requirements are inevitable. The DAL provides a few extension points in the form of decorators that allow its behavior to be customized in a controlled manner to facilitate this complexity.

Conclusion

We discussed the implementation of a Data Abstraction Layer that provides an abstraction between the producer and consumer of datasets, which enables teams at Uber to move fast and build powerful experiences on top of data. We also described a case study of how the DAL was used to solve complex problems in advertiser reporting.

Acknowledgments

Cover Photo Attribution: “Beautiful aerial view of the teotihuacan pyramid of the sun and Moon in Mexico” by Marie Hernandez is licensed to Uber.

Apache®, Apache® Hive™, and Hive™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

If you’re interested in solving technically challenging problems like this one, Uber is hiring!

Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.

Written by

Daniel Musgrave

Staff Software Engineer

Daniel Musgrave is a Staff Software Engineer at Uber focused on building generic, powerful, and expressive data-adjacent solutions.

James Yang

Senior Software Engineer

James Yang is a Senior Software Engineer in the Uber Ads team working in measurement, attribution, and advertiser reporting.

Narendra Kumar

Senior Software Engineer

Narendra Kumar is a Senior Software Engineer on the Uber Ads team, focused on ad serving and infrastructure. He previously worked across advertiser reporting, data warehousing, and recommendations.

748 articles

Filter by:

All categories