Tasked with training the machine learning models that power the sensing and perception systems used by our Advanced Technology Group (ATG) and Maps organizations, Alex and his team built Horovod while in the process of developing Uber’s in-house deep learning platform. As existing open source deep learning solutions were unable to meet our desired performance, usability, and scale, the team concluded that all these frameworks needed was a little extra help.
If his own team had these challenges, Alex reasoned, it was highly likely that researchers outside of Uber were faced with similar problems, and he wanted to give back. Released in September 2017, the Horovod framework makes it faster and easier for AI practitioners to train their TensorFlow, Keras, and PyTorch models with only six lines of code.
Currently, Horovod is leveraged by organizations such as NVIDIA, who use Horovod to scale testing of their GPUs, and the Oak Ridge National Laboratory, a supercomputing-focused research institute sponsored by the U.S. Department of Energy. Over the course of 2018, Horovod has been integrated with various deep learning ecosystems, including AWS, Google, Azure, and IBM Watson. Inspired by this wide adoption, InfoWorld named Horovod one of the year’s best open source software projects in machine learning.
We sat down with Alex to discuss his path to building Horovod, the future of his team’s successful project, and what he enjoys most about being a member of the AI open source community:
How did you first get interested in machine learning?
I studied AI and ML at university in Moscow; at the time, those fields were predominately reserved for academia and not widely used in production environments.
Prior to Uber, I worked on Microsoft’s Bing Data Mining team. We leveraged data analytics, and I thought it would be so cool to build models that could predict data trends instead of just doing analytics. Machine learning and deep learning were becoming very popular, and we started incorporating more and more of those types of technologies into our products.
In this role, my team built the metrics platform leveraged by our A/B experimentation solution to analyze experiments that teams would run on Bing.com. We made software to analyze them at scale, with thousands of experiments running at any point in time.
What brought you to Uber Seattle?
Uber was one of the few companies at the time that started working on deep learning in Seattle, and that was a great opportunity. The Uber Seattle Engineering office was working on a variety of interesting engineering projects, one of which was supporting our AI, self-driving, and maps teams by building solutions to accelerate their ML models.
How did you first get interested in open source?
I really liked how big companies like Google and Baidu were open sourcing their innovations in deep learning, because giving back to the broader AI community accelerates the field for everyone by leaps and bounds.
I approached developing Horovod with this same mindset: it was never a question of whether or not we would open source it. When I joined Uber, I was really excited to get involved with this community and open source our work.
What inspired you to build Horovod?
Horovod was a component of the third iteration of Uber’s deep learning platform. During the first iteration, you would make a ZIP file of your code, drop it onto the platform, press the train button, and after 10 minutes it would either train your model or give it an exception. Our users, in other words, members of Uber ATG, AI Labs, and our Maps team, really didn’t really like that because they would get feedback very late in the training cycle.
The second iteration of this platform allowed users to interactively train their models and then, if they wanted to do run the model at scale, they would use a TensorFlow parameter server. We tried to pitch it to many teams at Uber, but none of them were able to use it because the UX was suboptimal. We, the engineers that built it, were even struggling with using it!
By the third iteration, which featured Horovod, we understood how distributed deep learning worked and how to make it scale. In this third iteration, we applied some newer techniques we learned from Baidu, which was using MPI, and Facebook, which published a paper about scaling vision model training to 256 GPUs. Leveraging these techniques, we realized that instead of having our own version of TensorFlow, we could accomplish the same things by just making our platform an addition to TensorFlow that you could install regardless of what TensorFlow version you want to use. This way, users wouldn’t have to install a custom version of TensorFlow.
That prompted the idea to open source our third iteration, Horovod, as an independent piece of software. Given our experiences, we thought that a lot of teams would have the same problem, both inside and outside Uber, and it would be nice to help them.
How did Horovod evolve to become framework agnostic?
Initially, Horovod was just built for use with TensorFlow, but later, some of the teams fell in love with Keras, and PyTorch allowing them to leverage openly published research easier, which prompted us to add support for both of those frameworks.
This enabled us to have same way of scaling deep learning model training across multiple frameworks. As an infrastructure team at Uber, we were involved in helping teams scale their training and maintaining a single, framework-agnostic infrastructure, which would make our lives much more easier.
To my knowledge, Horovod is one of the few framework-agnostic solutions that scales other deep learning frameworks. Both TensorFlow and PyTorch have their own mechanisms for distributed training which are evolving in a direction similar to Horovod. However, user experience across frameworks varies and requires tweaks to the infrastructure. Our goal was to have the same infrastructure for any framework that we wanted to scale.
What specific business areas does Uber use Horovod for?
We use Horovod for most deep learning solutions at Uber. Nearly every team at Uber who works with sophisticated deep learning models trains with Horovod. From the Uber Advanced Technologies group to our Maps teams, it’s used primarily to train sensing, perception, and forecasting systems. Without Horovod, our teams would hit a limit of how much training they can do in a single day. Since they work through so much data, they need something that will help them scale their training, and Horovod fits this role.
How do other companies use Horovod?
Most organizations that use Horovod leverage it to train deep learning models, or integrate it with their deep learning products for their own customers. NVIDIA and Amazon use Horovod very heavily. IBM uses it both as part of its open source deep learning solution, FfDL, and in its IBM Watson Studio, which is its managed offering. Databricks features it in their deep learning offering as well.
How does your team make design decisions for Horovod?
We make decisions based on the needs of both internal and external project contributors. For instance, NVIDIA and Amazon use Horovod to scale deep learning in their ecosystems, and frequently contribute code back.
We love collaborating with these companies and our other contributors to help them optimize their model training with Horovod. We encourage all users to give back to Horovod, because the only way our project can improve is if we’re able to meet the needs of our contributors.
Over the past year, Horovod has been adopted by several companies and research institutions, and was just accepted as an official hosted project of the LF Deep Learning Foundation. What the open source community’s response to Horovod?
Once we open sourced the project, the initial response to Horovod was very positive. It was very cool to see my first open source project reach so many people and be adopted so quickly across so many deep learning teams across the industry and academia. Now, when I go to conferences people actually know of Horovod and they’re excited to integrate with it or use it or at least reference it in their research. All these things make me really happy.
When we agreed that Horovod was going to join LF Deep Learning Foundation, we were really excited. This decision recognizes the impact that Horovod has made on the community, and it will help us further increase the amount of contributions from the community so that we can improve our project. Feedback is probably one of the best parts of open sourcing Horovod because at the end of the day, we just want deep learning to accelerate faster. With teams across the world using the project, it will accelerate faster—it’s sort of like a very, very distributed form of training!
What is most gratifying about being a project lead in the open source community?
Developing Horovod made me a big open source enthusiast. I believe that open source is improving technology as a whole because it lets companies waste less energy on re-inventing wheels all the time. Instead, engineers spend energy on making new innovations.
How would you define Uber’s open source culture?
By the time that I started working on Horovod, I was actually very surprised how easy it was to join the open source community at Uber. I was welcomed with open arms to both pitch ideas to Uber’s Open Source Committee and to execute on them.
In my role at Uber, I get to spend part of my time working with the Horovod open source community by participating in conferences, educating teams outside of Uber on how to use the software, and improve our offerings, taking into account both internal and external feature requests and working with contributors to incorporate their changes. Uber’s enthusiasm around open source is inspiring.
What advice would you give to an engineer who hasn’t open sourced anything yet and is debating whether or not to open source their code?
I would say, after you open source your code, follow up on every piece of user feedback, issue, or bug raised because they are way more important than your own opinions on your code. If you don’t nurture your community, it will be much harder to grow and improve your project beyond the scope of your own resources.
We want your feedback! Try out Horovod for yourself and accelerate your deep learning ecosystem.