Skip to main content
Backend, Data / ML, Engineering

Cost Efficiency @ Scale in Big Data File Format

25 January 2022 / Global
Featured image for Cost Efficiency @ Scale in Big Data File Format
Figure 1: Apache Parquet File Format Structure
Figure 2: Space savings when translating to ZSTD
Query
Q1546,077652,543562,616
Q2870,472639,184213,914
Q3240,781353,926191,614
Q4132,490271,814 93,082
Q5337,208380,638 109,012
Figure 3: Query Performance comparison among ZSTD/SNAPPY/GZIP
Figure 4: The reduced size in percentage vs. with compress levels after translating the compression from GZIP to ZSTD
Figure 5: The write time vs. compress levels after translating the compression from GZIP to ZSTD
Figure 6: The read time vs. compression levels after translating the compression from GZIP to ZSTD
Figure 7: Using Column Pruning Tool to translate tables
Figure 8: Size reduction comparison for different columns sorting
Xinli Shang

Xinli Shang

Xinli Shang is a Manager on the Uber Big Data Infra team, Apache Parquet PMC Chair, Presto Commmiter, and Uber Open Source Committee member. He is leading the Apache Parquet community and contributing to several other communities. He is also leading several initiatives on data format for storage efficiency, security, and performance. He is also passionate about tuning large-scale services for performance, throughput, and reliability.

Kai Jiang

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Zheng Shao

Zheng Shao

Zheng Shao is a Distinguished Engineer at Uber. His focus is on big data cost efficiency as well as data infra multi-region and on-prem/cloud architecture. He is also an Apache Hadoop PMC member and an Emeritus Apache Hive PMC member.

Mohammad Islam

Mohammad Islam

Mohammad Islam is a Distinguished Engineer at Uber. He currently works within the Engineering Security organization to enhance the company's security, privacy, and compliance measures. Before his current role, he co-founded Uber’s big data platform. Mohammad is the author of an O'Reilly book on Apache Oozie and serves as a Project Management Committee (PMC) member for Apache Oozie and Tez.

Posted by Xinli Shang, Kai Jiang, Zheng Shao, Mohammad Islam