Reliability at Scale: Engineering an Uneventful New Year’s Eve

30 November 2017 / Global
Share
Facebook
X social
Linkedin
Envelope
Software engineers Matt Schallert, Katie Tezapsidis, Carissa Blossom, and Tom Croucher discuss what it takes to prepare our systems for holidays and other high traffic events.
While most spend New Year’s Eve watching the ball drop or celebrating with friends (festive party hat in tow), Uber Engineering has historically treated the holiday like a final exam. With users worldwide relying on Uber for safe and reliable transportation to and from their celebrations, New Year’s Eve often marks our highest traffic event of the year, with Halloween as a close second. Our ability to handle user traffic on both holidays is the culmination of many months of planning, load testing, and future-proofing our systems.
Uber Engineering built our “Uber World View” tool to track demand across our global markets.At Uber’s current scale, however, major events like Halloween and New Year’s Eve are anything but special. With lessons learned from holidays past and an ever-growing suite of on-call tools at the ready, our networks are available, reliable, and elastic enough to handle high traffic loads year-round. 
On the back-end, several teams are responsible for maintaining the reliability of our networks, but two stand out—Site Reliability and Observability Engineering: 
Site Reliability Engineering (SRE) partners with Uber’s development teams to improve product reliability while growing and operating infrastructure at scale.
Observability Engineering is dedicated to providing metrics, tracing, and alerting for all the other engineering teams at Uber.
Located in offices across the globe, these teams work to ensure continuity and scalability across Uber’s services by creating and maintaining applications as well as triaging and fixing issues in real time.
Now that we know who is responsible for helping you make it to your New Year’s Eve soiree safely and on time (sequins intact), what does it take to prepare for holidays and other large-scale events? Below, members of our Site Reliability and Observability Engineering teams discuss how we keep our networks extensible and reliable in the face of peak traffic. 
 
Matt Schallert, Software Engineer, Observability Engineering
Figure 2: Uber Engineering maintains several dashboards that help us gauge the health of our services. These particularly come in handy during our preparations for high traffic events like New Year’s Eve and Halloween.What is your role on the Observability Engineering team?
Matt Schallert (MS): I work on building and operating Uber’s metrics infrastructure to give engineers visibility into high resolution metrics across billions of unique time series. Teams at Uber use these metrics to measure everything happening with a service and the infrastructure it runs on. These metrics drive real-time alerts, anomaly detection, service health, hardware, and infrastructure. 
In the lead up to large-scale events, engineers need to measure more things than normal and we have to scale to that demand. Our goal is to allow every engineer to measure, tune and mitigate to make sure we always provide the best user experience possible, even during times of peak traffic.
Why are Halloween and New Year’s Eve such important holidays for Uber?
MS: These are important nights for Uber because many people rely on us to have a safe, efficient, and less stressful way of getting around. Particularly in the United States, Halloween helps us test our services and infrastructure for the traffic increase we see globally during New Year’s Eve. In this way, Halloween and New Year’s Eve planning are deliberately and strategically tied together. Since Halloween precedes New Year’s Eve by a few months, it gives us time to apply our learnings from October to get ready for what is historically our biggest night of the year.
How does Uber load test in preparation for high traffic events? 
MS: The Site Reliability Engineering team runs large-scale event drills to simulate the platform running at our predicted trip volume for the event. This simulated trip volume exercises every dependent service and flow of the user experience from requesting a ride to billing after the end of a trip. If a service begins to degrade during the drill, we pause the exercise to address the issue. 
It should come as no surprise that our team’s highest priority during these tests is to fix any bottlenecks before the next one so that we can continue to iterate and strengthen our network. The frequency of these drills increase as we get closer to the event, until we’re eventually running drills multiple times a day.
 
Katie Tezapsidis, Software Engineer, Observability
Katie Tezapsidis (L) and Matt Schallert (R), members of Uber Engineering’s Observability team, discuss new tools and techniques for making monitoring system health more efficient.Can you give us an overview of the preparations required in the lead up to Halloween and NYE? 
Katie Tezapsidis (KT): The capacity planning process starts months in advance with Uber’s data scientists analyzing information such as the number of riders we have supported on the platform during previous high traffic events with sophisticated machine learning techniques. This analysis is then used to develop a forecasting model that compares the current year’s trajectory with the previous year’s. This process gives us two bands, an upper predicted value and a lower predicted value. We then conduct multiple load tests at the highest predicted value leading up to the event. We also run these load tests with failure simulations such as failed regions to help us anticipate potential technical issues as well as test our processes for response.
How do individual teams prepare for large-scale load testing?
KT: In addition to system-wide drills, individual teams load test their own services independently by using Uber’s self-service load test platform. Our load testing platform provides a harness for executing and scaling up load tests written in a variety of languages. For example, a microservice that indexes nearby drivers will index large volumes of geographically-dispersed mock drivers to validate workload at a higher volume. This on-demand platform allows teams to run drills on a regular basis and during specific events like canary deployments.
All of these preparations are important, but what if something happens to our systems during production? How does Uber detect performance degradations in real time?
KT: Testing the most visible degradations is done by a system that runs outside our network and exercises product flows. Each team monitors their services using extensive whitebox monitoring which looks at the state of our internal systems. We have created over 100,000 real-time alerts that will automatically mitigate these issues. If any are unsuccessful, our On-Call Dashboard will page an on-call engineer to resolve them. 
 
Carissa Blossom, Software Engineer, Marketplace SRE
Carissa Blossom is a site reliability engineer on the Uber Marketplace team.The Uber Marketplace team is responsible for tackling some of the company’s hardest quantitative problems by creating and maintaining technologies that power our real time marketplace of drivers, riders, and other users. How does the Marketplace team prepare for high traffic events like Halloween and NYE? What precautions do you take to future-proof our systems? 
Carissa Blossom (CB): For major holidays like Halloween and New Year Eve, we project for traffic one and a half months in advance to ensure that we can handle demand with comfortable wiggle room in case real traffic scales faster than our projections. As far as specific preparations go, transitioning from using bare metal servers to containerized services running on Mesos has been a major part of future-proofing that has enabled us to auto-scale our services. 
How far in advance do you prepare for major holidays like Halloween and NYE? 
CB: In the past, we started preparing for these events months in advance, and on the big day(s) themselves, people would take on special shifts to monitor the system during these holidays. We have now progressed to the point that these systems are already well-prepared for major events and little additional work is needed on the day-of these events. 
My favorite SRE analogy draws the connection with the preparedness element of firefighting. Like firefighting in the physical world, the majority of SREs jobs aren’t about putting out that one big fire but about working to prevent these fires by developing and maintaining the systems, like sprinklers in your home, which help mitigate issues as quickly and efficiently as possible. 
What is the trickiest part about preparing for major holidays? 
CB: High traffic events are both exciting and challenging because they are the main test for the extensibility of our systems. The less work that needs to be done in the days leading up to big events, the better our preparations. Uber Engineering has reached a place where insanely high traffic events like Halloween and New Year’s Eve come and go without SRE having to do anything extra or out of the ordinary for our apps to work smoothly. 
Our preparations are considered successful if the user experience on these holidays is as reliable and seamless as on any other time a trip is requested. On the flipside of this, the most rewarding part of high traffic events for me is how connected to the customer base it makes you feel. I love hearing from drivers about how well the day went for them and knowing that I played a part in making that happen. 
What are some of the day-to-day preparations the SRE team does to ensure that our systems are ready for primetime? 
CB: One important cross-team effort is our Capacity Safety Team, which includes volunteers from Marketplace, Maps, SRE, and other teams involved in the core Uber trip flow. We have weekly drills that utilize Hailstorm, an internal tool which utilizes test accounts to simulate extra traffic on the system. The point of these drills is to stress the system up to a set number of riders and drivers on trips in a single data center, thereby identifying any issues that might emerge live under similar conditions. 
Through these drills, we have been able to make major holidays a non-event because we have already tested that the system and microservices can easily handle the holiday traffic. As our services evolve, we’re focusing less on preparing for a single event and more on empowering the system to scale itself. 
 
Tom Croucher, Staff Software Engineer, Site Reliability Engineering
Tom Croucher is a staff software engineer on Uber’s Site Reliability Engineering team, responsible for improving and maintaining reliability across our services.How did this year’s Halloween preparations compare to previous years?
I joined Uber in 2014, and during my first few years, it was all hands on deck—everyone was at the office, actively monitoring. Since then, our preparation for high traffic events has matured to keep up with the scale and scope of our services. This year, I was on-call on the Saturday before Halloween and was able to sign-off before peak traffic, which speaks to our level of preparedness.  
From a technical perspective, success is experiencing firsthand the abilities of these systems to remain robust no matter the circumstances. Every year we have an informal bet about when our peak will be; this year we weren’t even watching the system to see who’d won because we were so confident we’d be notified if anything went off the rails. 
How many teams are involved in preparing our systems for Halloween and NYE? 
TC: All of them. Historically, we have always done less prep for New Year’s Eve than for Halloween because the former is our test run. We can use metrics from Halloween to gauge how much traffic we can expect on New Year’s Eve, and as such, how each team should prepare. 
One thing that is great about Uber Engineering culture is that engineers who work on a particular system are responsible for running that system. For core systems, we have a number of SRE teams embedded on feature and product engineering teams that partner on load testing, infrastructure, capacity, and performance analysis. Our core SRE team gives non-SREs the technical tools to succeed, but we let them solve their problems since they are in the right place to understand the specifics of their systems.
What is the biggest piece of advice you would give other companies preparing for their own high traffic events? 
TC: Ultimately, a lot of this your success comes down to planning. If you’re actually considerate about how you build and grow your systems with reliability and scalability in mind, events like Halloween and New Year’s Eve aren’t a big deal; they happen every year! While traffic growth isn’t 100 percent predictable, these trends and patterns from year-to-year become something you can understand. 
If you’re prepared, nothing should be a surprise. Moving from a reactionary mindset to a planning mindset requires discipline but it is so worth it once you get there.
https://www.youtube.com/watch?v=KxDWs1JRU70?rel=0&showinfo=0
If engineering an uneventful New Year’s Eve interests you, consider applying for a role on our team! 
Matt Schallert and Katie Tezapsidis are software engineers on Uber’s Observability Engineering team. Carissa Blossom is a software engineer on Uber’s Marketplace SRE team, and Tom Croucher is a staff engineer on Uber’s SRE team. 
Molly Vorwerck
Molly Vorwerck is the Eng Blog Lead and a senior program manager on Uber's Tech Brand Team, responsible for overseeing the company's technical narratives and content production. In a previous life, Molly worked in journalism and public relations. In her spare time, she enjoys scouring record stores for Elvis Presley records, reading and writing fiction, and watching The Great British Baking Show.
Posted by Molly Vorwerck
Category:
Engineering