Uber Submits Hudi, an Open Source Big Data Library, to The Apache Software Foundation
April 19, 2019 / GlobalThe ability to manage and access petabytes of data quickly is crucial to the scalable growth of an entire data ecosystem. Still, this combined need for scale and speed does not always naturally fit into existing batch and streaming system architectures.
Developed in 2016 under the codename “Hoodie,” Hudi was built to address inefficiencies across ingest and ETL pipelines that required upsert and incremental consumption primitives in Uber’s Big Data ecosystem. To share these benefits with the broader Big Data community, Uber open sourced Hudi in 2017.
In January 2019, we submitted Hudi to the Apache Incubator, thereby furthering our open source commitment and ensuring the long-term sustainability and growth of Apache Hudi under The Apache Software Foundation’s open governance and guidance.
“Given Uber’s use of so many great Apache projects, we believe The Apache Way of community-driven, open source development will enable us to evolve Apache Hudi in collaboration with a diverse set of contributors,” said Vinoth Chandar, co-creator of Hudi. “We look forward to working with The Apache Software Foundation to implement best practices and bring new ideas to the project.”
Over time and with the help of the Big Data open source community, Hudi has evolved into a general purpose big-data storage system that enables:
- Snapshot isolation between ingestion and query engines, including Apache Hive, Presto, and Apache Spark
- Support for rollbacks and savepoints to recover datasets
- Auto-manage file sizes and layout to optimize query performance and directory listings
- Near real-time ingestion to feed queries with fresh data
- Asynchronous compaction of both real-time and columnar data
In a testament to its scalability, Hudi currently manages over 4,000 tables storing several petabytes of data at Uber, while lowering Apache Hadoop warehouse access latencies from several hours to under 30 minutes. Hudi also powers hundreds of incremental data pipelines at lower costs and with greater efficiency than previous solutions used by the company.
Going forward, the project will live with The Apache Software Foundation. Please check out the Apache Hudi project page for technical documentation and community engagement guidelines.
Posted by Uber
Related articles
Most popular
Modernizing Logging at Uber with CLP (Part II)
Sparkle: Standardizing Modular ETL at Uber
Introduction to Kafka Tiered Storage at Uber
Charting the mobility evolution: excerpts from Uber’s latest industry paper
Products
Company