Skip to main content
Data / ML, Engineering

How Uber Achieves Operational Excellence in the Data Quality Experience

5 August 2021 / Global
Featured image for How Uber Achieves Operational Excellence in the Data Quality Experience
DescriptionMocked QueriesMocked Assertion
Daily supply data count should be no less than 10,000SELECT COUNT(*) FROM supply WHERE datestr = CURRENT_DATE;query_value > 10,000
Supply data count should be consistent week over weekSELECT COUNT(*) FROM supply WHERE datestr = CURRENT_DATE;ABS(query0_value – query1_value) / query1_value < 1%
SELECT COUNT(*) FROM supply WHERE datestr = DATE_ADD(‘day’, -7, CURRENT_DATE);
CategoryPrerequisiteTest formula
FreshnessCompleteness metric is availablecurrent_ts – latest_ts where data is 99.9% complete < freshnessSLA
CompletenessUpstream and downstream datasets are 1:1 mappeddownstream_row_count / upstream_row_count > completenessSLA
Offline job to ingest sampled upstream data to Hivesampled_row_count_in_hive / total_sampled_row_count > completenessSLA
DuplicatesPrimary key provided by data producers1 – primary_key_count / total_row_count < duplicatesSLA
Cross-datacenter ConsistencyN/Amin(row_count, row_count_other_dc) / row_count > consistencySLA
Offline job to calculate Bloom-Filter value for sampled data in both DCsintersection(primary_key_count, primary_key_count_other_dc) / primary_key_count > consistencySLA
OthersN/AUser-generated custom tests. No standard formula.
Figure 1: Data Quality Platform architecture
Figure 2: Pruning obsolete tables based on lineage
Figure 3: Example AST structure for a test that compares the difference between two queries
Figure 4: Example Data Quality Incident based on sustain period
Figure 5: Incident Manager workflow
Figure 6: Sample Data Quality dashboard in Databook
Ying Zou

Ying Zou

Ying Zou is the Engineering Manager for the uMetric Computation team at Uber. She was previously the Engineering Manager for the Marketplace Data Foundation team. She led the data quality standardization based on existing solutions and promoted the platform to Uber-scale.

Wei Yan

Wei Yan

Wei Yan is a Senior Software Engineer on the Marketplace Data Foundation team. She is a main contributor for the Data Quality Platform.

Maggie Ying

Maggie Ying

Maggie Ying is an Engineering Manager for the Uber For Business’s Employee Products team. She previously was the Tech Lead for the Data Quality team. She was the pioneer of Data Quality at Uber by introducing the first self-serve data assertion platform that was used across the company.

Sanjay Sundaresan

Sanjay Sundaresan

Sanjay Sundaresan is a Senior Software Engineer on the Data Quality team at Uber. He is currently working on rearchitecting the data quality service to solve Uber’s future data quality challenges.

Sriharsha Chintalapani

Sriharsha Chintalapani

Sriharsha Chintalapani is a Senior Staff Software Engineer and the Tech Lead for Data Platforms at Uber. The Data Quality Platform provides quality checks and alerts to our Data Assets (tables, metrics) at Uber.

Isabel Geracioti

Isabel Geracioti

Isabel Geracioti is a former Software Engineer on the Metadata Platform team under Data Platform at Uber. She worked on metadata-based projects including data quality and lineage.

Posted by Ying Zou, Wei Yan, Maggie Ying, Sanjay Sundaresan, Sriharsha Chintalapani, Isabel Geracioti