Skip to main content
Uber logo

Schedule rides in advance

Reserve a rideReserve a ride

Schedule rides in advance

Reserve a rideReserve a ride
AI, Data / ML, Engineering

Elastic Distributed Training with XGBoost on Ray

July 7, 2021 / Global
Featured image for Elastic Distributed Training with XGBoost on Ray
Fig 1. Comparison of XGBoost Ray fault tolerance modes in the event of a single node failure during training. In practice, elastic training can often reduce total training time significantly with minimal impairments to accuracy.
ConditionImpacted WorkersValidation Log LossTrain ErrorTrain time (s)
Fewer workers00.4053590.1325911441.44
Fewer workers10.4060410.1331031227.45
Fewer workers20.4058730.1328851249.45
Fewer workers30.4050810.1322051291.54
Non-elastic10.4056180.1327542205.95
Non-elastic20.4050760.1324032226.96
Non-elastic30.4056180.1327542033.94
Elastic10.405560.13271231.58
Elastic20.4058850.1326551197.55
Elastic30.4054870.1324811259.37
Fig 3. An illustration of Successive Halving. Each iteration drops the worst half of trials and doubles the budget for the rest until it spends the maximum budget. Each line corresponds to a single trial or configuration. (Source: https://www.automl.org/blog_bohb/)
TechniqueMean Efficiency Gain (%) per StudyMean Accuracy Change (%) of Best Candidate per Study 
16.3+0.06
52.8-0.54
68.9-0.59
69.9-0.78
Fig 5. A graphical representation of using dynamic resource allocation to improve model training performance under time constraints. HyperSched progressively reduces and eventually eliminates exploration in favor of deeper exploitation of fewer trials by dynamically allocating more parallel resources.
Fig 6. The Ray Estimator expects a serializable function and its set of required arguments. When Estimator.fit() is called, the Estimator takes in a backend to start a remote cluster and execute the serializable function on the cluster.
Fig 7. High-level overview of the flow from Spark (DataFrames) to Ray (distributed training) and back to Spark (Transformer). Ray Estimator encapsulates this complexity within the Spark Estimator interface.
Fig 8. Packaged model returned by the Ray Estimator is a native Spark Transformer that can be used in a batch or real-time serving pipeline
Fig 9. The Ray Estimator can remotely execute any serializable function. Ray Tune and XGBoost Ray provide functions that cleanly abstracts away the complexity for running distributed hyperparameter search using nested distributed training jobs
Fig 10. A representation of using Ray Tune to run nested distributed XGBoost Ray jobs for distributed hyperparameter optimization
Michael Mui

Michael Mui

Michael Mui is a Staff Software Engineer on Uber AI's Machine Learning Platform team. He works on the distributed training infrastructure, hyperparameter optimization, model representation, and evaluation. He also co-leads Uber’s internal ML Education initiatives.

Xu Ning

Xu Ning

Xu Ning is a Senior Engineering Manager in Uber’s Seattle Engineering office, currently leading multiple development teams in Uber’s Michelangelo Machine Learning Platform. He previously led Uber's Cherami distributed task queue, Hadoop observability, and Data security teams.

Kai Fricke

Kai Fricke

Kai Fricke is a Software Engineer at Anyscale, the company behind the distributed computing platform Ray. He mostly works on developing Machine Learning libraries on top of Ray, like Ray Tune, or RaySGD.

Amog Kamsetty

Amog Kamsetty

Amog Kamsetty is a Software Engineer at Anyscale where he works on building distributed training libraries and integrations on top of Ray. He previously completed his MS degree at UC Berkeley working with Ion Stoica on Machine Learning for database systems.

Richard Liaw

Richard Liaw

Richard Liaw is a Software Engineer at Anyscale, where he currently leads development efforts for distributed Machine Learning libraries on Ray. He is one of the maintainers of the open source Ray project and is on leave from the PhD program at UC Berkeley.

Posted by Michael Mui, Xu Ning, Kai Fricke, Amog Kamsetty, Richard Liaw