Stay up to date with the latest from Uber Engineering
Engineering a Job-based Forecasting Workflow for Observability Anomaly Detection

May 16, 2018 / Global
Share
Facebook
X social
Linkedin
Envelope
At Uber, we combine real-time systems monitoring with intelligent alerting mechanisms to ensure the availability and reliability of our apps. In our push to empower our engineers to author more accurate alerts, Uber’s Observability Applications team sought to introduce alert backtesting—the ability to determine if, and when, a given alert configuration would have gone off in the past, thereby making it easier to predict future alerts.
Our team’s goal of building this functionality motivated us to entirely overhaul our anomaly detection platform’s workflow, reimagining the process by which engineers can both request new forecasts and ensure we account for past periods of time. The resulting pipeline, which revolves around a formalized notion of a forecasting job, enables intuitive and performant backfilling of forecasts, paving the way to more intelligent alerting.
uMonitor and alert backtesting
To maintain the reliability of our services, our on-call engineers need to be notified and alerted as quickly as possible in the event of an outage or other abnormal behavior. uMonitor, Uber Engineering’s alert authorship and management platform, enables that oversight by letting engineers define alerts on their service’s key metrics. 
At its core, a uMonitor alert consists of a metric query, a set of alerting thresholds, and a set of one or more actions to execute in the event that the monitored metric exceeds its corresponding thresholds. These actions can be thought of as hooks that, for example, might allow the alert in question to page the team responsible for that service with a push notification for extreme metric deviations, while only sending a non-intrusive email for relatively minor ones. On top of these baseline attributes, uMonitor also provides authors with a suite of robust configuration options that further bolster alerting behavior in more nuanced ways—for example, by alerting with respect to dynamic thresholds (as opposed to traditional static ones) via anomaly detection, or by suppressing alerts until a threshold violation has been sustained for a certain amount of time.
Figure 1: uMonitor enables engineers at Uber to be notified of abnormal behavior in production. 
As a result of this flexibility, alerts at Uber are comprised of myriad free-floating parameters, all of which the alert author controls. In fact, uMonitor is so customizable that the Observability Applications team even uses it to monitor itself (via a separate instance, of course). For instance, the team configured an alert that pages its on-call engineer whenever global alert executions grind to a halt—an indicator that something likely went wrong in the system itself.
With this flexibility, however, comes complexity; for some teams, managing, maintaining, and refining their alerts on uMonitor can become as involved a project as the services they are monitoring. Furthermore, functionality such as anomaly detection, while crucial for certain classes of alerts, introduces a level of opacity to the alert that can be difficult for engineers to accept or trust. As such, it can sometimes be difficult for an engineer to anticipate the circumstances under which a given alert might fire or determine the optimality of one’s configuration; sometimes, a solution to a suboptimal alert might be as simple as a configuration change that’s hiding in plain sight.
With this challenge in mind, we sought to implement alert backtesting in uMonitor. This feature lets engineers “dry run” a given alert configuration to visualize how it would have responded to recent, historical production behavior, enabling a responsive feedback loop by which an engineer can gradually fine-tune his or her alert before finally saving it. We believe that alert backtesting will help to remove some of the opacity and disconnect inherent to the alert authorship process, ultimately allowing engineers to define more accurate and actionable alerts for Uber’s services.
Inefficiency in isolation
Since uMonitor’s anomaly detection functionality relies on dynamic thresholds generated by a separate forecasting layer, F3, we were unable to simply roll out this feature to users. We realized that the set of assumptions that underpinned F3’s current workflow prevented us from releasing alert backtesting to all uMonitor users without imposing a severe burden on the underlying metrics system—the storage and query ecosystem that F3 both supports and is supported by. 
Since its inception in late 2015, F3 has offered an exclusively iterative workflow. Relative to a specific point in time, consumers can request a new forecast for the immediate near future. By design, F3’s sole responsibility was to generate new forecasts one-by-one in isolation.
Backtesting alerts that utilize anomaly detection, on the other hand, require the ability to guarantee that any given time series metric is fully covered by forecasts for a given time range. The ideal system would backfill historical forecasts for any sub-ranges that are missing them. Such functionality would have useful applications. For instance, consider a scenario where an engineer would like to determine whether or not anomaly detection would help to improve their alerts’ signal-to-noise ratios. These alerts have never been subjected to anomaly detection, and therefore do not have pre-existing historical forecasts in the forecast store for comparison. With a dedicated backfill API, the engineer could simply take their desired time range and request that those historical forecasts be created, immediately providing baseline forecast values for comparison.
Admittedly, a dedicated forecast backfill API is not strictly necessary, given that consumers can simply query the forecast store, which would provide them with all the information they need to compute the time ranges that are missing forecasts and send the corresponding requests to F3. In fact, we used this method to implement a preliminary version of uMonitor’s backtesting functionality. This approach, however, unearthed a more fundamental inefficiency with F3 that was impossible to work around so long as forecasting requests are handled in isolation. 
That inefficiency related to F3’s relationship with the underlying metrics system. For each forecast request, F3 requires a specific set of historical data for the corresponding time series metric; this data forms the basis of F3’s subsequent forecasting computations. Historically, F3 queried the metrics system for those windows of historical data during every forecasting operation—even when the data requirements for those operations have significant overlap. Under this naive implementation, our team noticed that, asymptotically speaking, roughly 90 percent of all requested data would be redundant.
Figure 2: Having forecasting jobs that are responsible for the same time series at adjacent time ranges retrieve their own own requisite data results in serious redundancy, burdening the metrics system. 
This redundancy represented a non-trivial amount of additional overhead on the metrics system that our team decided was unacceptable. We concluded that, were we to launch alert backtesting to uMonitor’s users without addressing these underlying issues, we introduced the risk of intermittent and completely arbitrary upward spikes in load on the metric query service. If the query service, at some point in time, was unable to account for such a spike, F3 risked bringing it down entirely, overwhelming the very ecosystem that it was intended to bolster and strengthen.
Optimizing in aggregate
As such, our team’s efforts to implement universal alert backtesting in uMonitor required us to entirely re-approach F3’s forecasting workflow. We needed a design that supported not only the service’s canonical, iterative workflow, but also a separate bulk backfilling mechanism.
We noted that F3’s fundamental operation is to produce forecasts for time series metrics that are valid over bounded time ranges. In other words, while the service was able to individually execute separate forecasting operations in isolation, it was missing a layer of abstraction that would allow it to collectively process these operations.
With this requirement in mind, we decided to formally introduce the service to the notion of a forecasting job. This new formalized abstraction of F3’s fundamental operations brought the ability for the service to not just execute them, but also optimize around them. By concretely outlining the attributes that uniquely define and identify an atomic forecasting operation—attributes such as the query-string corresponding to the time series metric under forecast, the covered time range, as well as the specific model that the forecast values correspond to—the service could now perform computations on those jobs in aggregate. This new capability paves the way for several key optimizations that collectively lead to a much more performant, effective, and intuitive API for forecast backfilling.
For instance, this job framework empowers the service to natively synchronize and coordinate forecasting operations. By keeping track of which jobs have been kicked off, completed, or are currently in progress—whether in some in-memory cache or a dedicated database layer—F3 can determine if any inbound request is redundant with another that was made previously or is currently being processed. 
That same mechanism also allows the service to operate at a higher level of abstraction than its previous, more interactive use case. Consider, for example, the case where a consumer of F3 would like to guarantee for some metric that an entire time range has full forecast coverage. With an awareness of what forecasts have been created in the past, as well as which ones are currently being created, F3 can now efficiently determine which sub-ranges are missing forecasts and work to fill in the gaps with the appropriate number and specification of jobs.
Figure 3: By maintaining a persistent record of which forecasting jobs are either completed or in progress, F3 can infer what jobs are required in order to “fill in the gaps.” Overlaps between jobs for any given point in time, as shown here, are tolerable. 
As previously discussed, this job framework lets F3 operate on forecasting operations in aggregate. This capability has immediate operational payoff by enabling data pooling. Consider, for instance, a case in which a given forecasting request results in multiple forecasting jobs, each with their own historical data requirements. If those jobs cover time ranges that are adjacent or near-adjacent, fulfilling each job’s data requirements in isolation results in significant redundancy in network traffic and query load. Jobs as abstractions over forecasting operations enable us to, for example, consolidate the data requirements of each job into one overarching set of query windows, as illustrated in Figure 4. This consolidated data request can then be sent all at once to the query service, cutting out significant unnecessary load while still ensuring that each forecasting job gets the data that it needs.
Figure 4: Allowing forecasting jobs to subselect their data post-query from a common pool minimizes unnecessary load on the metrics system due to forecasting. 
Tying it all together: The new workflow
The introduction of the job data type lets us drastically generalize F3’s forecasting pipeline to natively and efficiently support the new backfilling use case. While the iterative workflow has been kept largely untouched, the optimizations to F3’s underlying workflow are significant. 
Now, with a historical log of all previously and currently executing forecast jobs, F3 can provide uMonitor and other consumers with a simple, intuitive API that only asks for the range of time that they would like coverage over for a given time series. F3 can then simply partition that time range into its requisite jobs with respect to those sub-ranges that are already accounted for—whether by pre-existing forecasts or ones that are in the works. With this set of jobs, the service can now calculate the minimum comprehensive set of historical data that will account for all of its requirements. 
By making one single request to the metrics system for that data, the service can then partition the data as needed, processing each job with only the subset slices that they require. Finally, all jobs can have a record of their completion logged to F3’s historical job log, effectively communicating to all future forecasting requests that those given time ranges are fully accounted for and do not need further processing. This newly-introduced capacity for data pooling in our production deployment of F3 has allowed us to reduce the service’s burden on the underlying time-series metrics store by as much as 90 percent.
Figure 6: By introducing an ability to sub-select data from a common pool of historical data, we reduce F3’s burden on the underlying metrics store by as much as 90% (figure not drawn to scale). 
In other words, the introduction of one simple data structure to the system was the key to a suite of optimizations and derivative features that, collectively, led to not just a more intuitive service-level API, but also heightened reliability and efficiency.
Takeaways
Despite its trappings in time series forecasting and anomaly detection, this project ultimately came down to a question of API design. uMonitor’s need to support backtesting of alerts that use anomaly detection without detrimentally affecting other crucial components of our  observability ecosystem forced us to entirely redesign F3’s workflow, giving rise to questions such as: what sort of information or details are users responsible for? Similarly, with that set of assumptions in place, what kind of interface would make the most sense from their perspectives?
This design approach turned out to be crucial for building a solution that is both performant and intuitive. By deciding on the ideal end state irrespective of the current state of affairs and working backwards, we were able to pinpoint exactly what features were missing from the service’s current state that would enable the behavior that we wanted. Those missing features boiled down to a single data structure—one that formalized the service’s fundamental operation and therefore allowed the service to optimize based off of it. That data structure, as well as the subsequent optimizations that built on its central idea, allowed us to achieve significant scalability claims for F3. As a result of this project, F3 and our Observability Anomaly Detection platform are in a much better position to address the increasingly sophisticated alerting and monitoring needs of our engineers and the Uber platform as a whole.
If you would like to empower engineers at Uber to monitor their production systems, the Observability Applications team would love to speak with you.
Many thanks to Shreyas Srivatsan on the Observability Applications team for advising this project.
Jonathan Jin
Jonathan Jin is a software engineer on Uber’s Observability Applications team in New York. He likes dry, minerally white wines, folk music, and light-roasted pour-over coffee.
Posted by Jonathan Jin
Category:
Engineering