Skip to main content
Engineering, Data / ML

Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi

March 16, 2023 / Global
Featured image for Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi
Image
Figure 1
Image
Figure 2 
Image
Figure 3
Image
Figure 4
Image
Figure 5
ScenarioHow is it Handled?
Incremental read on a single sourceUse Apache Hudi’s incremental reader and upsert to the target table
Incremental read + join with multiple raw data tablesUse Apache Hudi’s incremental read on the main table and perform left outer join on other raw data tables with T-24 hr incremental pull data
Incremental read + join with multiple derived and lookup tablesUse Apache Hudi’s incremental read on the main table and perform left outer join on other derived tables fetching only the affected partitions
Backfills use caseUse snapshot read on single or multiple tables within etl_start_date and etl_end_date
Type of TableHow is it Handled?
Partitioned– Use upsert to apply only the incremental updates
– Use insert overwrite to update all affected partitions when performing backfill operation
– Use targeted merge/update statements for non-incremental columns using Apache Spark SQL
Non-partitioned– Use upsert to apply only the incremental updates
– Use insert overwrite when joining incremental rows with full outer join on target table to update both the incremental and non-incremental columns (to avoid DQ issues on non-incremental columns)
Image
Figure 6
Image
Figure 7
Image
Figure 8
Image
Figure 9
Pipelinevcore_secondsmemory_secondsCostRun Time (mins)
Batch ETL of Dimensional Driver Table3,129,13023,815,200$11.39220
Incremental ETL of Dimensional Driver Table1,280,9286,427,500$2.4439
Difference1,848,20217,387,700$8.95181
% Improvement59.06%73.01%78.57%82.27%
Batch ETL of Driver Status Fact Table2,162,3625,658,785$3.3094
Incremental ETL of Driver Status Fact Table1,640,4383,862,490$2.4548
Difference521,9241,796$0.8546
% Improvement24.13%31.74%25.75%48.93%
Image
Figure 10
Image
Figure 11
Vinoth Govindarajan

Vinoth Govindarajan

Vinoth Govindarajan is a former Staff Software Engineer on the Global Data Warehouse team. As a data infrastructure engineer, he was working to lower the latency and bridge the gap between the online systems and the data warehouse by designing incremental ETL frameworks for derived datasets. Next to his work, he contributes to a variety of open-source projects such as Apache Hudi and dbt-spark.

Saketh Chintapalli

Saketh Chintapalli

Saketh Chintapalli is a Software Engineer on the Global Data Warehouse team. His work primarily lies in data platform engineering, specifically in building reliable tooling and infrastructure for efficient data processing and pipelines, varying from batch to real-time workflows.

Yogesh Saswade

Yogesh Saswade

Yogesh Saswade is a Software Engineer on Uber's Delivery Data Solutions Team. He is the SME for anything on menu datasets. He worked on optimizing the performance (SLA & Cost) of the high-volume batch workloads to achieve near real-time analytics using Apache Hudi and Lakehouse ETL framework. He drove the YARN queue segregation initiative to achieve a scalable and federated resource structure. He is currently working on the humongous catalog data standardization.

Aayush Bareja

Aayush Bareja

Aayush Bareja is a Software Engineer working on the Uber Eats Delivery Data Solutions Team. He excels in using the Big Data stack to efficiently obtain canonical data for various analytical workloads, including batch, incremental, and real-time processing using technologies such as HDFS, Spark, Hive, Apache Flink, and Piper.

Posted by Vinoth Govindarajan, Saketh Chintapalli, Yogesh Saswade, Aayush Bareja