Stay up to date with the latest from Uber Engineering

Omphalos, Uber’s Parallel and Language-Extensible Time Series Backtesting Tool

January 24, 2018 / Global

Share
Facebook
X social
Linkedin
Envelope
Operating in more than 600 cities worlwide, Uber leverages aggregate data about user preferences to improve rider and driver experiences. Over time, these anonymously aggregated data points generate millions of time series, which can be viewed broadly or at a deep level of time granularity. These time series forecasts are essential to much of our decision making, from marketplace optimization and cost modeling to hardware capacity planning. 
To enable fast, flexible, and accurate forecasting and provide greater reliability and consistency in our models, Uber’s Forecasting Platform team created Omphalos. Omphalos is a time series backtesting framework that generates efficient and accurate comparisons of forecasting models across languages and streamlines our model development process, thereby improving the customer experience. In this article, we discuss the design, implementation, and applications of this new framework.
Forecasting at Uber 
When leveraging time series data, forecasting algorithms often require chronological testing, otherwise referred to as backtesting. Simply put, models applying these algorithms should not be trained with date values past the forecast horizon. 
A common practice arbitrarily splits a time series into training and validation sets while preserving chronology, similar to randomly choosing one training pass (as displayed in Figure 1). Given Uber’s business metric variance and relatively short lifespan, however, we found this method unreliable. For example, time series models trained with and without data from New Year’s Eve show significant forecasting performance differences given the rapid growth (year-by-year) of our business. 
Two forms of backtesting
To achieve a more consistent measure of model performance, we designed a backtesting procedure that applies a cross validation logic that accounts for the sequenced nature of the dataset. 
This procedure applies two types of backtesting: sliding window and expanding window. While both have their applicable use cases, the sliding window form achieves a favorable balance between model accuracy and training time, especially when it comes to testing high frequency data such as daily and hourly time series. The expanding window form, on the other hand, is used more often in weekly, monthly, or quarterly time series where the number of historical points are limited.
Next, we outline how these two types of backtesting can be applied to forecasting: 
Sliding window 
The sliding window method requires three hyperparameters: training window size, forecasting window size (horizon), and sliding steps, detailed below:
Training window size: the number of data points included in a training pass
Forecasting window size: the number of data points to include in forecasting
Sliding steps: the number of data points skipped from one pass to another
Figure 1: In the sliding window backtesting model, a fixed-size training window (in black) slides over the entire history of a time series and is repeatedly tested against a forecasting window (in orange) with older data points dropped. 
Expanding window 
The expanding window form requires four hyperparameters: starting window size, ending window size, forecasting window size, and expanding steps, outlined below:
Starting window size: the number of data points included in the first training pass
Ending window size: the number of data points included in the last training pass
Forecasting window size: number of data points included for forecasting
Expanding steps: the number of data points added to the training time series from one pass to another
Figure 2: In the expanding window backtesting model, a training window (in black) expands over the entire history of a time series and is repeatedly tested against forecasting window (in orange) without dropping older data points. 
Measuring forecast accuracy
Once a time series forecasting algorithm is tested using either of these two backtesting methods, Omphalos automatically generates arrays of forecasted values and corresponding actual values from which we can calculate various measures of forecast accuracy.
While these measures typically assume a single training pass, we need to tailor those measures to better reflect model performance in a backtesting setting. Additionally, since Uber operates in hundreds of cities across the globe, model accuracy can vary across city metrics. For example, in a typical forecast looking two weeks ahead, we have observed a large Weighted Mean Absolute Percentage Error (wMAPE, a performance measure we use) range when the same model is applied to different cities. Therefore, we need a reliable way to summarize overall model performance to determine which model we should use.
To meet these challenges, we developed an automatic model evaluation strategy that compares the median wMAPE by city and training pass, while setting a threshold for the least accurate wMAPE and median bias. A histogram of wMAPE can also be used during a manual check, if necessary. 
With this new technique at the ready, we just needed a backtesting tool to apply it.
Building Omphalos
While it is usually not difficult to implement a backtesting procedure  in one language , it takes much more effort to align model performance measures across languages, even between two data scientists on the same team. As a result, we decided to create a language-extensible framework that could apply our automatic model evaluation strategy; model performance should be comparable across languages as long as the same backtesting configuration is used. 
We also needed the framework to effectively orchestrate dozens of algorithms running against hundreds (or even thousands) of time series while remaining lightweight. To accomplish this, we chose to write the framework in Go because of its robustness and scalability.
Incorporating language-extensibility
Given the variety of languages used by Uber’s engineering and data science teams, we needed to ensure that Omphalos could be used with languages beyond Go. In the industry at large, R has long been the default language in time series forecasting as a result of its well-known forecast package, and Python has emerged as an important player in this domain as well due to its well-supported machine learning packages like Scikit-Learn and TensorFlow.  For these reasons, we decided to build interfaces to support these three languages with the capability to incorporate others in the future. 
When configuring an algorithm for testing with Omphalos, we require three functionalities: 
The filepath to the algorithm entry code
A model generation function that accepts the input data frame
A forecast function that accepts the model and forecast horizon, and returns the quantile forecast 
The console layer orchestrates model training in different languages using Go’s exec package. This system gives us the flexibility to test any algorithm that supports these requirements, reuse models that do not require retraining at every backtest step (neural networks), and use common libraries (e.g., applying the forecast package in R as the forecast function).
So far, we have tested ten R, three Python, and four Go time series forecasting algorithms, as listed in Table 1, below:
Table 1: Time series forecasting algorithms tested using Omphalos include those traditionally applied in R, Go, and Python. 
Robustness and scalability
In addition to language-extensibility, we wanted Omphalos to be robust and scalable enough to meet Uber’s forecasting demands. More specifically:
The tool should harness the computational capability critical for providing faster iteration and better feedback to our users. 
It should be able to run locally on a workstation or remotely on a CPU/GPU enhanced server. 
It should be simple for users to easily adapt and interface their algorithms into the framework.
To facilitate these conditions, the core of Omphalos’ design is a command-line interface console layer that loads a user-defined configuration file and time series data, feeds that data into forecasting algorithms implemented in different languages, collects forecasted values, and generates comprehensive backtesting reports.
Figure 3: The Omphalos architecture takes in time series data and a user-defined configuration file incorporating specified parameters as input, and produces a comprehensive backtesting summary report as output.Given Go’s concurrent design, the console we built can coordinate and supervise many single-threaded R and Python algorithms to fully utilize all available CPUs. Omphalos also facilitates the concurrent backtesting of time series, with console layer triggering X times Y goroutines—each of them responsible for running backtesting of one algorithm vs. one time series.  
Forecasting results from each training pass are then stored separately and used to calculate summary metrics such as median wMAPE.
Using Omphalos 
Omphalos has two main functionalities that allow us to more quickly and efficiently backtest time series: comparing models within and across languages to determine the best fit for a given dataset and leveraging our AutoForecaster API to test forecasting accuracy.
Model comparison
Given Uber’s variety of forecasting use cases, we needed a tool that could seamlessly compare models across forecast metrics, algorithm runtimes, and error rates. With Omphalos, we can juxtapose the performance of neural networks in TensorFlow (Python), to traditional statistical algorithms in R, to a 500ms lightweight implementation in Go. 
To test Omphalos’ efficiency, we compared eight popular forecasting algorithms in R to our AutoForecaster API (an in-house tool that provides accurate predictions for any given time series), using daily completed trip counts from twenty cities around the globe.
Table 2: When comparing popular forecasting algorithms in R with AutoForecaster, we used daily completed trip counts from 20 cities as our input data and configured a sliding window backtesting procedure with training window size, forecasting window size, and sliding steps set at 189, 14, and 14, respectively. 
Given this capability, Omphalos has become an integral part of our model development process. Now, data scientists can compare their own model to many existing ones without any extra effort; once a given model is proven most effective, it can quickly be adopted by others. 
Auto-Forecaster 
Our Forecasting Platform team built Auto-Forecaster to provide accurate predictions for any given time series with minimal input, enabling data scientists across the company to forecast across unlimited use cases. To gauge the performance of this algorithm, however, we needed to test it on a wide range of time series over different time periods, and Omphalos was an excellent candidate for the job.
To test its performance, we wanted to determine if Auto-Forecaster could provide forecasts for hundreds of thousands of time series across hundreds of user-defined dashboards—simultaneously. Our tests included: building a histogram of error performance (as depicted in Figure 4), identifying types of time series that were performing very well or very poorly, and ensuring enforcement of rejected time series (those that provide unreliable forecasts). A simple calculation showed that running these forecasts serially with an average forecast time of 2 seconds would take 155 hours or about 6.5 days.
Figure 4: A histogram of backtested errors when leveraging Auto-Forecaster for a given time series can help us understand how our algorithms are performing. 
By taking advantage of Omphalos’ ability to orchestrate concurrent processes, however, we immediately split the forecast jobs into 24 cores on a single server (with each core corresponding to one time series). With other methods, a single forecast may take only seconds, but testing time series multiplicatively often takes days. With Omphalos,  we determined that it takes only a quarter of a day—a roughly 150-hour time difference— to run these forecasts in R and Python. 
Next, we tested this dataset on Omphalos with our Go implementation of Auto-Forecaster. This resulted in faster performance (0.5 seconds on average) and better concurrency management, reducing the feedback time to 1 hour—a now 154-hour time difference from the serial process. This type of performance improvement allows us to incorporate full dataset backtesting into a continuous integration pipeline, leading to more accurate forecasting.
Future work 
Although Omphalos was originally designed as a language-extensible and parallelized backtesting framework for time series forecasting algorithms, we have identified two areas where we could further leverage this tool:
Constructing empirical prediction intervals
Prediction intervals generated by forecasting algorithms are typically too narrow, leading to an underestimation of risk; on the other hand, since we often use ensembles of forecasting models, it can be tricky to generate ensemble prediction intervals. Thus, we are exploring the possibility of using Omphalos to construct empirical prediction intervals to better inform forecast-related business decisions.
Onboarding anomaly detection algorithms
Forecasting and anomaly detection are inherently connected. In the future, we plan to add new features to Omphalos so that both sliding window and expanding window backtesting methods can be directly applied to anomaly detection algorithms. This will greatly expand our forecasting capabilities, in turn facilitating improved user experiences across our services.
If you are interested in pushing the boundaries of time series forecasting or anomaly detection, consider applying for a role on our team!  
Header Image/Figure 3 Attribution: The Go Gopher is a mascot of the Go programming language and was created by Renée French. The R logo is used under the terms of CC-BY-SA 4.0, as instructed by The R Foundation.