Skip to main content
Engineering, Data / ML

Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts

June 29, 2023 / Global
Featured image for Fast Copy-On-Write within Apache Parquet for Data Lakehouse ACID Upserts
Image
Figure 1: Logic and physical file views of table upserts
Image
Figure 2: Row-level index for Apache Parquet 
Image
Figure 3: Comparison of traditional copy-on-write in Apache Hudi and new copy-on-write 
Image
Figure 4: The new copy-on-write within Parquet file 
Image
Figure 5: Benchmarking results of new copy-on-write comparing with traditional Delta Lake
Xinli Shang

Xinli Shang

Xinli Shang is a Manager on the Uber Big Data Infra team, Apache Parquet PMC Chair, Presto Commmiter, and Uber Open Source Committee member. He is leading the Apache Parquet community and contributing to several other communities. He is also leading several initiatives on data format for storage efficiency, security, and performance. He is also passionate about tuning large-scale services for performance, throughput, and reliability.

Kai Jiang

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Huicheng Song

Huicheng Song

Huicheng Song is a Staff Software Engineer at Uber. He focuses on big data file format and building automated systems to ensure various compliance requirements at large scale.

Jianchun Xu

Jianchun Xu

Jianchun Xu is a Staff Software Engineer on Uber's Data Infra team. He mainly works on big data infra and data security. He also has extensive experience in service deployment platforms, developer tools, and web/JavaScript engines.

Mohammad Islam

Mohammad Islam

Mohammad Islam is a Distinguished Engineer at Uber. He currently works within the Engineering Security organization to enhance the company's security, privacy, and compliance measures. Before his current role, he co-founded Uber’s big data platform. Mohammad is the author of an O'Reilly book on Apache Oozie and serves as a Project Management Committee (PMC) member for Apache Oozie and Tez.

Posted by Xinli Shang, Kai Jiang, Huicheng Song, Jianchun Xu, Mohammad Islam