Skip to main content
Backend, Data / ML, Engineering

Cost Efficiency @ Scale in Big Data File Format

25 January 2022 / Global
Featured image for Cost Efficiency @ Scale in Big Data File Format
Figure 1: Apache Parquet File Format Structure
Figure 2: Space savings when translating to ZSTD
Query
Q1546,077652,543562,616
Q2870,472639,184213,914
Q3240,781353,926191,614
Q4132,490271,814 93,082
Q5337,208380,638 109,012
Figure 3: Query Performance comparison among ZSTD/SNAPPY/GZIP
Figure 4: The reduced size in percentage vs. with compress levels after translating the compression from GZIP to ZSTD
Figure 5: The write time vs. compress levels after translating the compression from GZIP to ZSTD
Figure 6: The read time vs. compression levels after translating the compression from GZIP to ZSTD
Figure 7: Using Column Pruning Tool to translate tables
Figure 8: Size reduction comparison for different columns sorting
Xinli Shang

Xinli Shang

Xinli Shang is the ex–Apache Parquet™ PMC Chair, a Presto® committer, and a member of Uber’s Open Source Committee. He leads several initiatives advancing data format innovation for storage efficiency, security, and performance. Xinli is passionate about open-source collaboration, scalable data infrastructure, and bridging the gap between research and real-world data platform engineering.

Kai Jiang

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Zheng Shao

Zheng Shao

Zheng Shao is a Distinguished Engineer at Uber. His focus is on big data cost efficiency as well as data infra multi-region and on-prem/cloud architecture. He is also an Apache Hadoop PMC member and an Emeritus Apache Hive PMC member.

Mohammad Islam

Mohammad Islam

Mohammad Islam is a Distinguished Engineer at Uber. He currently works within the Engineering Security organization to enhance the company's security, privacy, and compliance measures. Before his current role, he co-founded Uber’s big data platform. Mohammad is the author of an O'Reilly book on Apache Oozie and serves as a Project Management Committee (PMC) member for Apache Oozie and Tez.

Posted by Xinli Shang, Kai Jiang, Zheng Shao, Mohammad Islam