Schedule rides in advance

Selective Column Reduction for DataLake Storage Cost Efficiency

20 September 2023 / Global

Background

As Uber continues to grow, the sheer volume of data we handle and the associated demands for data access have multiplied. This rapid expansion in data size has given rise to escalating costs in terms of storage and compute resources. Consequently, we have encountered various challenges such as increased hardware requirements, heightened resource consumption, and potential performance issues like out-of-memory errors and prolonged garbage collection pauses.

To address these challenges and optimize storage cost efficiency, we have embarked on several initiatives. These include the implementation of Time to Live (TTL) policies for aging partitions, the adoption of tiered storage strategies to transition data from hot/warm to cold storage tiers, and the optimization of data size at the file format level. In this blog, our primary focus will be on the latter aspect—reducing data size within the Apache Parquet^™ file format by selectively deleting unused large size columns.

When tables are created, table owners enjoy the flexibility of defining their desired columns without strict governance. However, it is common for certain columns to remain unused, resulting in the unnecessary occupation of significant storage space. Through careful analysis, we have uncovered substantial data savings on a large scale simply by removing unused columns from specific tables.

Introduction of Apache Parquet^™ Data Structure

Since Apache Parquet^™ is our main data lake storage file format, the column deletion is implemented in Parquet^™ file format.

Apache Parquet^™ is a columnar file format, meaning that all data belonging to the same column is stored together in a column chunk. The diagram below illustrates the comparison between row-oriented storage and column-oriented storage. In this table, there are 4 columns, and each column contains 4 fields.

In the row-oriented storage, the data is stored row by row.

Figure 2: Row-oriented storage on disk or network

In the column-oriented storage, the data is stored column by column.

Figure 3: Column-oriented storage on disk or network

Inside Apache Parquet^™, the data is partitioned into multiple row groups. Each row group contains one or more column chunks, which correspond to a column in the dataset. The data within each column chunk is stored in the form of pages. A block consists of pages and represents the smallest unit that needs to be read entirely to access a single record. Within a page, except for the dictionary page encoding, each field is appended with its value, repetitive level, and definition level.

The diagram below illustrates the structure of a Parquet^™ file. For further details, please refer to the official Parquet^™ file format documentation.

Possible Solution

To remove selected columns from existing data, one common approach is to use Spark^™ for reading the data, eliminating the unnecessary columns in memory, and writing back the modified dataset. However, let’s delve into the process of how a Parquet^™ file is changed, assuming there is no reshuffling of data across files, as reshuffling can consume additional resources due to its expense.

The steps involved in modifying a Parquet^™ file include decryption, decompression, decoding, and record disassembly during the reading phase. These operations are necessary to access and process the data within the file. Similarly, during the writing phase, record assembly, encoding, compression, and encryption steps are performed to prepare the data for storage. It’s important to note that these steps are computationally expensive and contribute significantly to the time required for reading and writing Parquet^™ files. The resource-intensive nature of these operations can impact overall performance and efficiency, prompting the need for a more efficient solution to remove columns without relying heavily on these steps.

Figure 4: The process of translating file to remove columns with original Parquet^™ reader and writer

In order to optimize the process, it is crucial to find a method that minimizes the reliance on the costly steps mentioned above. By doing so, we can reduce the computational overhead and improve the overall performance of column removal in Parquet^™ files. This optimization becomes even more important when dealing with large datasets where the impact of these operations is magnified. Finding a streamlined approach to selectively remove columns without resorting to extensive data processing steps can significantly enhance the efficiency and speed of the process.

Selective Column Reduction Solution

Since Parquet^™ is a columnar file format and the data in the same column is stored together, instead of performing the expensive steps above, we can just copy the original data for a given column and send it back to disk and bypass those steps. For the columns that need to be removed, we can just skip and jump to the next column that doesn’t need to be reduced. In the end, the offset will be changed and we can update metadata to reflect that. In this process, there would be no data shuffling needed at all. The data in the same file will be output to a new file that contains all the data except for the ones in the column to be removed.

Figure 5: The new process of translating file to remove columns with selective column pruner

The “Copy & Skip” process can be described as below. Assume columns col_b and col_c are to be removed. The hybrid reader and writer started with copying col_a and there are no expensive steps like encryption, compression, encoding, etc. Instead, it is just literally byte buffer copy. When starting col_b, we see col_b is the column to be eliminated, so we move on to the next column and find col_c also needs to be removed, then we move on to col_d and start copying. In the end, the new file contains two columns’ data (col_a and col_d) and excludes columns (col_b and col_c).

From the above Copy & Skip process, we can see it is vastly improved by bypassing the expensive steps. It is worth noting that there will be no reshuffling data needed, which could be another huge improvement because reshuffling is also very expensive.

Figure 6: The copy & Skip process in selective column prunnner

Benchmarking

We have benchmarking tests with a local host for reading files that have sizes from 1G, 100M, and 1M in bytes. As a comparison, we use Apache Spark^™ as the baseline for the test. From the chart below, we can see the Selective Column Pruner is ~27x faster when the file is at ~1G and ~9x faster when the file size is 1M.

Parallelization of Execution

So far we have discussed the improvement on a single file which builds up the foundation for reducing columns in a dataset. But in reality, a dataset is organized as tables and sometimes is further divided into partitions that usually have a list of files. We need an execution engine to parallelize the execution of the Selective Column Pruner. Apache Spark^™ is the distributed execution engine that is widely used. It is a little confusing that in the above section, we use Spark^™ for executing the reduction directly. Here, we just use the Spark^™ Core to parallelize the execution instead of relying on its Data Source to operate on data directly.

The code samples are as below:

Figure 8: Code snippet for parallel execution in Spark^™

During execution, generally we deal with the dataset as the unit. For each dataset, we list all the partitions if it is a partitioned table, and inside the partition we list all the files under it, and then assign tasks to each file to execute the column reduction using the Selective Column Pruner we discussed above. The dataset snapshot can be described as below:

Figure 9: Using Column Pruning Tool to translate tables

Conclusion

In summary, our selective column reduction approach within the Parquet^™ file format has proved to be a highly efficient solution for optimizing data storage. By eliminating unused columns, we can reduce storage costs, minimize resource consumption, and enhance overall system performance. As Uber continues to evolve, we remain dedicated to exploring innovative ways to improve data management practices and maximize the value of our data assets while minimizing associated infrastructure costs.

Apache^®, Apache Parquet^™, Parquet™, Apache Spark^™, and Spark^™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Cover Image Attribution: Image by Ian and under CC BY 2.0.

Xinli Shang

Xinli Shang is a Manager on the Uber Big Data Infra team, Apache Parquet PMC Chair, Presto Commmiter, and Uber Open Source Committee member. He is leading the Apache Parquet community and contributing to several other communities. He is also leading several initiatives on data format for storage efficiency, security, and performance. He is also passionate about tuning large-scale services for performance, throughput, and reliability.

Kai Jiang

Kai Jiang is a Senior Software Engineer on Uber’s Data Platform team. He has been working on Spark Ecosystem and Big Data file format encryption and efficiency. He is also a contributor to Apache Beam, Parquet, and Spark.

Ryan Chen

Ryan Chen is a Software Engineer on the Uber Data Infra team. He primarily works on Kafka and ZooKeeper, with additional experience in Spark and Parquet from this project.

Jing Zhao

Jing Zhao is a Principal Engineer on the Data team at Uber. He is a committer and PMC member of Apache Hadoop and Apache Ratis.

Mingmin Chen

Mingmin Chen is a Director of Engineering and head of the Data Infrastructure Engineering team at Uber. He has been leading the team to build and operate Hadoop data lake to power multiple Exabytes of data in storage, Kafka infrastructure to power tens of trillions messages per day, and compute infrastructure to power hundreds of thousands compute jobs per day. His team builds highly scalable, highly reliable yet efficient data infrastructure with innovative ideas while leveraging many open-source technologies such as Hadoop (HDFS/YARN), Hudi, Kafka, Spark, Flink, Zookeeper etc.

Mohammad Islam

Mohammad Islam is a Distinguished Engineer at Uber. He co-leads the Data cost-efficiency effort and also leads Data Security and compliance efforts. He is an Apache Oozie and Tez PMC member.

Karthik Natarajan

Karthik Natarajan is a Senior Manager, Engineering at Uber. He leads Ingestion and Datalake storage - Hudi within the Uber Data Platform Organization.

Ajit Panda

Ajit Panda is a Senior Manager, Engineering at Uber. He leads the Security, privacy and compliance engineering within the Uber Data Platform Organization.

Posted by Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda

Category:

Engineering

Data / ML