Skip to main content
Engineering, Data / ML

How Uber Migrated from Hive to Spark SQL for ETL Workloads

June 12 / Global
Featured image for How Uber Migrated from Hive to Spark SQL for ETL Workloads
Image
Figure 1: Hive ecosystem.
Image
Figure 2: Shadow testing with AMS.
Image
Figure 3: Query in Hive.
Image
Figure 4: Query in Spark SQL.
Image
Figure 5: Query translation design.
Image
Figure 6: Query in Hive.
Image
Figure 7: Query in Spark SQL.
Payload in Hive:
Figure 8: Payload in Hive.
Image
Figure 9: Payload in Spark SQL.
Image
Figure 10: Payload in Hive.
Image
Figure 11: Payload in Spark SQL.
Image
Figure 12: Sampling using hash.
Image
Figure 13: Data validation flow.
Image
Figure 14: Hive merge files config.
Image
Figure 15: Original Plan.
Image
Figure 16: Modified plan.
Kumudini Kakwani

Kumudini Kakwani

Kumudini Kakwani is a Staff Software Engineer at Uber, currently working on the Spark team. With a strong background in building data platforms, Kumudini has previously worked on a range of impactful projects, including Hive, Data Observability, and Machine Learning platforms.

Akshayaprakash Sharma

Akshayaprakash Sharma

Akshayaprakash Sharma is a Staff Software Engineer at Uber, currently working on the Data Observability Team. Akshaya has previously worked on Hive, Spark, Vertica and Data Reporting Tools.

Nimesh Khandelwal

Nimesh Khandelwal

Nimesh Khandelwal is a Senior Software Engineer on the Spark team. He is focused on projects on modernizing and optimizing the Spark ecosystem at Uber.

Aayush Chaturvedi

Aayush Chaturvedi

Aayush Chaturvedi is a Senior Software Engineer at Uber, currently part of the Spark team, where he contributes to optimizing large-scale data infrastructure. He has also worked on improving Uber Maps coverage and creating pickup/drop-off efficiency.

Chintan Betrabet

Chintan Betrabet

Chintan Betrabet is a Senior Software engineer working on the Uber Data Platform team. He has worked on building automated solutions to monitor and measure data quality at scale for all critical datasets at Uber.

Suprit Acharya

Suprit Acharya

Suprit Acharya is a Senior Manager on Uber’s Data Platform team, leading Batch Data Compute Engine (Spark, Hive), Observability, Efficiency, and Data Science Platforms.

Posted by Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya