Stay up to date with the latest from Uber Engineering

How Uber Migrated from Hive to Spark SQL for ETL Workloads

June 12 / Global

Introduction

Apache Hive^™ on Apache Spark^™ has been the preferred engine for ETL workloads at Uber. Hive on Spark supports a wide range of use cases across various verticals like compliance, financial reporting, planning, forecasting, fraud, and risk analysis. Before the migration, there were about 18,000 Hive ETL workflows generating around 5 million queries per month, contributing to significant percentage of Uber’s total Yarn usage. Additionally, Hive was used for interactive use cases, handling around 150,000 interactive queries monthly.

This blog talks about our migration journey from Hive to Apache Spark SQL^™ and the challenges faced on the way.

Motivation

We decided to move from Hive to Spark SQL because of compute efficiency and modernization. Spark SQL offers better performance in Spark 3 than Hive for the same query due to features like adaptive query execution, dynamic partition pruning, and more. When we compared Spark SQL with Hive, initial workload results showed up to 4x performance benefits from Spark SQL.

Spark SQL also offers a robust and active OSS community. In contrast, Hive is becoming obsolete in the OSS community. At Uber, we use Hive on Spark, which has been discontinued in OSS since Hive 3.

This change also helps us simplify batch analytics at Uber. A lot of Uber teams have shifted Hive workloads to Spark for better efficiency.

Architecture

To understand where we started, Figure 1 shows the components of and around the Hive ecosystem at Uber.

Piper: Enables users to execute Hive queries via scheduled or trigger-based workflows.
Naga: Proxy service for submitted Hive queries.
Serverless Hive: Naga submits a Hive payload directly to yarn, bypassing HiveServer, enabling better reliability and zero down-time deployments.
HiveCLIActions: SDK used via Piper workflows, which executes queries through a proxy (Naga), supporting both serverless (via Spark) and serverful modes for different payloads.
HiveDBIActions: SDK used via Piper workflows, which directly connects to Hive servers using Oracle^® Java Database Connectivity, mainly for DDL commands or queries requiring data streaming.
HiveETL Framework: Simplifies ETL workflow execution by automating Hive task creation (stage, promote) based on user-provided configurations and queries.
QueryBuilder: Offers a UI for executing Hive Select queries interactively.
Beeline: Terminal-based tool for executing ad-hoc Hive queries.

Migration Strategy

At Uber, there are more than 18K workflows that run about 200K Hive queries (~50K unique payloads) in a day, distributed across 440 teams. Asking users to migrate their workflows from Hive to Spark SQL would have required a lot of developer effort and time across Uber. Therefore, we decided to perform the bulk of this migration in an automated way with minimum involvement from ‌end users. We decided on the following migration strategy:

For parallel shadow testing, Hive workflows would run on Spark SQL in shadow mode and be validated against the production runs.
Once the shadow testing for the workflow was successful, any Hive queries submitted from the subsequent workflow runs would be translated dynamically and run on Spark SQL.
The source code of the workflow would then be updated to submit Spark SQL instead of Hive.

We developed three core components to facilitate the shadow testing:

Query Translation Service: Translates Hive configs and queries into SparkSQL.
Data Validation Service: Compares output datasets generated by Hive and their corresponding SparkSQL translations to ensure data correctness.
Auto Migration Service: Orchestrates the end-to-end shadow testing process for all production Hive workflows.

Automated Migration Service (AMS)

To automate the migration of Hive workloads to Spark SQL, we developed a new service – AMS (Automated Migration Service). AMS is a centralized solution to validate Hive queries on Spark SQL via shadow testing by translating Hive queries to Spark SQL and running them in shadow mode. It compares the output of both queries and ensures that the performance of the query remains the same, if not better.

Design

AMS orchestrates the shadow testing by monitoring production Hive workflows and triggering the corresponding Spark SQL shadow runs. Once both Hive and Spark SQL queries finish, it triggers data and performance validations. After enough runs of all the queries belonging to a workflow have been validated, the workflow is marked READY_TO_MIGRATE.

Query Translation Service

There were two major challenges in the translation of HiveQL to Spark SQL: syntax differences and safe shadow runs.

The first challenge was the difference between the syntax of Spark SQL and HiveQL. If the user writes basic queries, then the syntax for both query engines remains the same. But when a user writes complex queries with named structs, lateral explode, and more, then we need to translate these queries to execute them successfully in Spark SQL, as each query engine has a different syntax to support these queries. Figure 3 shows a basic example of the syntax differences between Hive and Spark SQL.

These differences become more profound as the complexity of queries increases.

The second challenge was around safe shadow runs. When the query is executed in shadow mode in Spark SQL, we need the query syntax to be compatible with Spark SQL and the table names, database names, and locations to be modified so that the production data doesn’t get impacted. However, this may also lead to other errors like “Table not found,” which also need to be taken care of.

Existing Solutions

When exploring open source solutions, we came across LinkedIn^® Coral, a query translation service. However, with its current architecture, using Coral poses some challenges. Coral only supports select-based queries, with no support for DDL or DML queries. It doesn’t support all select queries like grouping sets or group by index. It’s also dependent on RelNode from Apache Calcite^™, which makes it very restrictive to extend support for different queries.

We decided to reuse a few modules from Coral and devised a new system that can address all the challenges listed above and provide an exhaustive and extensible solution that can cater to other use cases at Uber.

Proposed Design

Taking inspiration from Coral, we designed the Query Translation Service, shown in Figure 5.

The Input Query Engine Side Transformations layer aims to parse the query and generate a SqlNode. To generate a SqlNode, it follows these steps:

The query is initially parsed to generate an AST (abstract syntax tree) representation of the query.
AST nodes are transformed to SqlNode via a visitor pattern.
The SqlNode generated after this stage is query engine agnostic.
Now that the SqlNode is visited via a visitor pattern, transformers can apply on this SqlNode for any transformations if required for changing table names, removing a node, modifying a node, or replacing a node.

The Target Query Engine Side Transformations layer aims to generate the target query engine’s query from SqlNode. It follows these steps:

The SqlNode generated from the previous step is visited and transformed into a SqlNode per the target query engine.
The query-specific SqlNode is then unparsed using a query engine-specific dialect and the target query string is returned.

Note that we removed the dependency on RelNode that was part of Coral, as not all constructs in Hive and Spark SQL could be expressed with it.

With this, our QTS (Query Translation Service) can cater to the following use cases:

Use case 1: Hive to Spark SQL translations.

Use case 2: Multi-query support in a single request with context of previous queries. Let’s say we want to translate multiple queries at once (let’s call a collection of these multiple queries as a payload). Now QTS can receive a payload and return a translated payload.

Payload in Hive: — Figure 8: Payload in Hive.

Use case 3: Safe translations for shadow runs. For shadow runs, we’d need to replace table/database/locations and also take care that we don’t see unnecessary failures due to this replacement.

In the example in Figure 11, we changed the database name to some shadow database named amsdb and changed the table name by appending the suffix as _ams. We also replaced other_table. We’d need to add a create statement so the payload doesn’t fail with the table not found.

Data Validation Service

This migration aimed to leverage Spark’s performance benefits without altering the existing business logic. However, ensuring that the data generated by Spark queries matched the expectations set by Hive queries posed a unique challenge.

Automated Data Consistency Validation

To address this challenge, we used Uber’s in-house Data Quality Platform, a central data monitoring and alerting solution. The platform allows users to create assertions that can be executed on a schedule or on-demand, alerting users whenever an assertion fails. For our specific problem, we developed a new assertion model that executed a series of validations to compare the data produced by Hive and Spark.

The assertion model was designed to perform a chain of validations, each contributing to a final mismatch between the Hive and Spark datasets. Here’s how it works:

Row Count Validation. Compare the number of rows in each dataset. In case this mismatches, we short circuit the execution as there’s an obvious difference identified.
Data comparison: Row-level checksum comparisons. To determine whether the data is exactly equal, we use row-level checksums as they offer the most granular insights and the best debugging experience. This approach helps in distinguishing between genuine mismatches and false positives effectively.

The benefits of this model are that it enables precise identification of mismatched rows, making debugging straightforward. It also provides the flexibility to define acceptable thresholds for mismatches (like the percentage of differing rows).

One drawback of this model is that it’s computationally intensive due to the overhead of checksum transformations and cross-system comparisons. It also presents high storage costs for capturing mismatched rows, especially in scenarios involving significant discrepancies.

Consistent Sampling

When comparing sampled data, it’s important to have consistent sampling across both sources to ensure that the same rows (if matching) are sampled to ensure we don’t report false mismatches because of different rows chosen from either source.

To achieve this, we identify sampling key columns using either primary/unique columns or all columns, if no primary keys are defined. Based on the numeric hash of the sample key(s), we select only the rows that have a hash value less than the sampling threshold. Figure 12 shows an example. It uses delivery_trip_uuid and user_uuid as the sample keys and selects only rows where the clause holds true.

Figure 13 summarizes the entire validation flow.

Challenges

We faced some challenges when implementing this solution.

One challenge was ‌floating point arithmetic. Aggregation functions like SUM or AVG on floating-point columns often introduce precision errors, leading to frequent mismatches between production and shadow datasets. To mitigate this, a mismatch tolerance was introduced, and results were rounded to a specified precision. However, this solution wasn’t foolproof and required manual effort to identify and exclude such columns from validation.

We also had to handle stringified JSON. Many data sets store JSON as string columns, which often causes mismatches during validation. To address this, a custom UDF was developed to sort the JSON keys before computing checksums, ensuring consistency.

Columns derived from non-deterministic functions such as ROW_NUMBER, RAND, COLLECT_LIST, COLLECT_SET, and CURRENT_TIMESTAMP also introduced variability, causing mismatches.

Identifying and excluding these columns required manual intervention during the validation process.

Lastly, validation mismatches often occurred due to frequent updates to input datasets or circular dependencies, as there could be a lag or runtime differences between production and shadow runs.

Bridging the Gap Between Hive and Spark

We encountered many failures at different stages of ‌shadow testing because of differences between Hive and Spark. In this section, we’ll talk about those differences and how we bridged the gap between the two, so that the migration could be done without user intervention and without requiring any query changes.

Execution Failures

There are multiple DDL query patterns which Hive supports that aren’t supported by Spark SQL. For example, drop partitions using </> comparators. To support these patterns in Spark, we used HiveDriver in the Spark SQL runner app to run these queries instead of spark.sql.

Spark also didn’t support group_by on complex data types like maps and structs. Since this was a heavily used feature in Hive at Uber, we ported this open PR SPARK-34819 from open source into Uber’s fork to add this support.

Increasing executor memory and spark.sql.shuffle.partitions resolved most of the OOM issues we faced.

Data Mismatches

There are certain feature differences in Hive and Spark, which resulted in different output values. Here are the approaches we took to resolve them.

We added Hive-compatible behavior in Spark behind a config. One of these behaviors was boolean-to-string conversion.

Hive	Spark
true => “TRUE” false => “FALSE” “” => false “any_string” => true	true => “true” false => “false” “false” => false “true” => true

Another was timestamp to BigInt/double conversion/coercion.

Hive	Spark
cast(1714542982 as timestamp) => 1970-01-20 20:15:42.982	cast(1714542982 as timestamp) => 2024-05-01 11:26:22

The last behavior was skipping the header in CSV tables. Hive respects the table property skip.header.line.count, while reading a CSV table while Spark doesn’t.

We also translated some expressions in Hive to Spark SQL. These include cast as varchar and built-in functions. Cast as varchar(x) truncates the string to the specified length in Hive. Spark doesn’t support varchar. To mitigate this, we translated such expressions to use function substr instead to achieve the same behavior.

Certain built-ins in Hive and Spark behave differently. For example, regexp_like returns a null for null pattern in Spark and returns false in Hive. For such cases, we registered Hive’s built-in as a temporary function in the query and used the same within the query in translated Spark SQL.

Performance Gaps

When we migrated workflows from Hive to Spark, we observed that the number of files created by Spark was very high compared to Hive. This is because Hive runs an additional merge files stage at the end of the query based on the following configs, which merge smaller files into bigger files.

This could‌ cause NameSpaceQuota issues for a lot of users and overwhelm the HDFS namenode.

Spark has a rebalance feature, which introduces a shuffle to reduce the number of partitions according to the specified AdvisoryPartitionSizeInBytes. We built upon this feature and introduced a rebalance node in the logical plan of all Spark SQL queries before the DataWritingCommand.

We tweaked the AQE rules CoalesceShufflePartitions and OptimizeSkewInRebalancePartitions to coalesce/split partitions based on the file target size for rebalance instead of AdvisoryPartitionSizeInBytes. This way, we could create bigger files without impacting the parallelism of the rest of the job.

Hive and Spark also use different hashing algorithms for bucketing. Spark supports writing both HiveHash and MurmurHash while writing to bucketed tables, but ends up creating too many files (#tasks * #buckets) while using HiveHash. To resolve this, we added support for HiveHash in Rebalance, so the hashing algorithms remain the same when rebalanced partitions are created, resulting in a 1:1 mapping between the Hive bucket file and the Spark partition after rebalance.

Lastly, after migrating from Hive to Spark, downstream workflows encountered FileNotFoundExceptions due to Spark’s use of unique UUIDs for file names, unlike Hive’s fixed patterns. This caused failures in in-flight readers when files were deleted during partition overwrites. To fix this, we implemented partition snapshot isolation in Spark for selected datasets with a high number of downstream reads.

Conclusion

We successfully migrated all interactive and ETL workflows to Spark SQL, moving about 5 million monthly queries. This migration resulted in an overall 50% reduction in runtime and resource usage.

Cover Photo Attribution: “04. Battery Tunnel Empty” by MTA New York City Transit / Leonard Wiggins is licensed under CC BY 2.0.

Apache^®, Apache Calcite^™, Apache Spark^™, Apache Spark SQL^™, Apache Hive^™, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

LinkedIn^® is a registered trademark of LinkedIn Corporation and its affiliates in the United States and/or other countries.

Oracle^®, Java, MySQL, and NetSuite are registered trademarks of Oracle^® and/or its affiliates.

Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.

Kumudini Kakwani

Kumudini Kakwani is a Staff Software Engineer at Uber, currently working on the Spark team. With a strong background in building data platforms, Kumudini has previously worked on a range of impactful projects, including Hive, Data Observability, and Machine Learning platforms.

Akshayaprakash Sharma

Akshayaprakash Sharma is a Staff Software Engineer at Uber, currently working on the Data Observability Team. Akshaya has previously worked on Hive, Spark, Vertica and Data Reporting Tools.

Nimesh Khandelwal

Nimesh Khandelwal is a Senior Software Engineer on the Spark team. He is focused on projects on modernizing and optimizing the Spark ecosystem at Uber.

Aayush Chaturvedi

Aayush Chaturvedi is a Senior Software Engineer at Uber, currently part of the Spark team, where he contributes to optimizing large-scale data infrastructure. He has also worked on improving Uber Maps coverage and creating pickup/drop-off efficiency.

Chintan Betrabet

Chintan Betrabet is a Senior Software engineer working on the Uber Data Platform team. He has worked on building automated solutions to monitor and measure data quality at scale for all critical datasets at Uber.

Suprit Acharya

Suprit Acharya is a Senior Manager on Uber’s Data Platform team, leading Batch Data Compute Engine (Spark, Hive), Observability, Efficiency, and Data Science Platforms.

Posted by Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya

Category:

Engineering

Data / ML