Skip to main content
Engineering, Data / ML

Building Uber’s Data Lake: Batch Data Replication Using HiveSync

4 September / Global
Featured image for Building Uber’s Data Lake: Batch Data Replication Using HiveSync
Image
Figure 1: Bi-regional architecture of the Uber data lake.
Image
     Figure 2: HiveSync architecture: control plane and data plane separation. 
Image
                             Figure 3: HiveSync replication service.
Image
          Figure 4: Data Reparo analyzes and resolves inconsistencies across data centers (DC1) and (DC2).
Image
         Figure 5: Creation of audit logs for replication via Hive Metastore Event Listener. 
Image
    Figure 6: FSM for a Copy Partition job.
Image
Figure 7: Replication server workflow.
Image
         Figure 8: Directed acyclic graph of replication jobs.
Image
    Figure 9: Static sharding across multiple servers.
Image
Image
                                     Figure 10: Dynamic sharding for a single server.
Radhika Patwari

Radhika Patwari

Radhika Patwari is a Software Engineer II on the HiveSync team at Uber. She has worked on enhancing DistCp performance and is currently focused on developing disaster recovery solutions for SRC (Single Region Compute) failover drills.

Trivedhi Talakola

Trivedhi Talakola

Trivedhi Talakola is a Software Engineer on the HiveSync team at Uber. He has worked on bootstrapping hudi datasets to secondary regions and performance issues in their incremental replication.

Rajan Jaiswal

Rajan Jaiswal

Rajan Jaiswal is a Software Engineer I on the HiveSync team at Uber. He has contributed to the cloud migration efforts for Uber’s batch data lake, authored the chain replication and topology analyzer, and is one of the lead contributors to the system’s reliability.

Chayanika Bhandary

Chayanika Bhandary

Chayanika Bhandary is a former Senior Software Engineer on the HiveSync team at Uber. She has contributed to improving the replication SLA of datasets and efficient fixing of any inconsistencies that might have crept in.

Mukesh Verma

Mukesh Verma

Mukesh Verma is a Staff Software Engineer on the HiveSync team at Uber. He re-architected it for horizontal scalability, kickstarted cloud replication, and led key infrastructure upgrades including DistCp 3.x migration.

Sanjay Sundaresan

Sanjay Sundaresan

Sanjay Sundaresan is a Senior Engineering Manager for the batch storage infra team. He leads the HiveSync and Cloud Migration teams at Uber.

Posted by Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan