Skip to main content
Engineering, Data / ML

How Uber Scaled Data Replication to Move Petabytes Every Day

29 January / Global
Featured image for How Uber Scaled Data Replication to Move Petabytes Every Day
Image
Figure 1: High-level Distcp architecture.
Image
Figure 2: Diagram showing Distcp from the /src/ directory to the /dest/ directory. 
Image
Figure 3: HiveSync architecture: Data replication workflow using Distcp.
Image
Figure 4: Increased HDFS FSUtils latency directly affects Distcp Copy Listing task. 
Image
Figure 5: Threads blocked on RPC calls.
Image
Figure 6: Multiple parallel calls from different copy job requests create contention on the HDFS client.
Image
Figure 7: Moved Copy Listing and Input Splitting processes from Hive Sync server (Client) to AM.
Image
Figure 8: Observed 90% reduction in Distcp job submission time.
Image
Figure 9: p99 latency averaged at around 10 mins for the busiest replication server even after moving this task to Application Master.
Image
Figure 10: Copy Listing task V2 workflow.
Image
Figure 11: Improvement in Copy Listing latency on a Hive Sync server with 6 threads.
Image
Figure 12: Copy Committer task V2 workflow.
Image
 Figure 13: Mean concat latency dropped by 97.29% using 10 threads.
Image
Figure 14: More than 50% of Distcp jobs are assigned a single mapper each.
Image
Figure 15: Uber Job workflow.
Image
    Figure 16: Scale of Hivesync across on-premise and cloud data centers.
Image
  Figure 17: Data migrated from on-premise to cloud via the Hivesync service.
Abhay Yadav

Abhay Yadav

Abhay Yadav is a Senior Software Engineer with the HiveSync team at Uber. He has been working on the reliability and performance improvement of Hive Sync cross-DC replication.

Radhika Patwari

Radhika Patwari

Radhika Patwari is a Software Engineer II on the HiveSync team at Uber. She has worked on enhancing DistCp performance and is currently focused on developing disaster recovery solutions for SRC (Single Region Compute) failover drills.

Sanjay Sundaresan

Sanjay Sundaresan

Sanjay Sundaresan is a former Senior Staff Engineer for the Batch Storage Infra team. He led the HiveSync and Cloud Migration teams at Uber.

Posted by Abhay Yadav, Radhika Patwari, Sanjay Sundaresan