Analyzing Experiment Outcomes: Beyond Average Treatment Effects

7 November 2018 / Global

Share
Facebook
X social
Linkedin
Envelope
At Uber, we test most new features and products with the help of experiments in order to understand and quantify their impact on our marketplace. The analysis of experimental results traditionally focuses on calculating average treatment effects (ATEs). 
Since averages reduce an entire distribution to a single number, however, any heterogeneity in treatment effects will go unnoticed. Instead, we have found that calculating quantile treatment effects (QTEs) allows us to effectively and efficiently characterize the full distribution of treatment effects and thus capture the inherent heterogeneity in treatment effects when thousands of riders and drivers interact within Uber’s marketplace. 
Besides providing a more nuanced picture of the effect of a new algorithm, this analysis is relevant to our business because people remember negative experiences more strongly than positive ones (see Baumeister et al. (2001)). In this article, we describe what QTEs are, how exactly they provide additional insights beyond ATEs, why they are relevant for a business such as Uber’s, and how we calculate them.
Differentiating between QTEs and ATEs
To better understand how QTEs differ from ATEs, let us focus on a specific example. Assume that we want to analyze the impact of an improved algorithm for matching a rider with the most appropriate driver given a specific destination. 
For this hypothetical example, assume that the outcome metric of interest is the time it takes the driver to pick up the rider, also called the estimated time to arrival (ETA). Using the potential outcomes framework developed by Professor Donald B. Rubin (see Imbens and Rubin (2015)), we denote the assignment of rider  to the treatment algorithm with  and  otherwise. We denote the potential outcome for each individual as . That is,  is the ETA for rider  under the incumbent or control algorithm, and  is the ETA under the new or treatment algorithm. Of course, we only observe one outcome for rider  because we cannot assign them to both the new and the old algorithm. We denote the observed outcome as  with
Also, define the function
with . In other words,   is the cumulative distribution function (CDF) of ETAs under the new algorithm, and  is the CDF of ETAs under the incumbent algorithm.
By far the most widely used way of characterizing the difference in outcomes is by focusing on the (population) ATE, i.e., . 
Even though we do not observe the same rider under both algorithms, assuming the experiment design satisfies a set of regularity assumptions, we can estimate the ATE by comparing the average ETA of those exposed to the new algorithm to the average ETA of those exposed to the incumbent algorithm.
Averages are effective at summarizing a lot of information into a single number. We might learn, for example, that the average ETA of the new algorithm is no different from the average ETA of the old algorithm (an ATE of zero). But does this really mean there is no meaningful difference between the two algorithms? Given the large amount of aggregated and anonymized data leveraged by teams at Uber, can we do better than just analyzing the ATE?
ATEs do not allow us to understand heterogeneity in treatment effects
Precisely because averages reduce all information into a single number, they can mask some of the subtleties of the underlying distributions. For example, imagine that Figure 1, below, depicts the ETAs across riders for the treatment group (blue solid line) and the control group (red dashed line). Both distributions have the same mean and, thus, the ATE would be zero. However, the figure also reveals that the right-hand tail of ETAs under the new algorithm is much fatter than under the old algorithm. That is, there are a number of riders that experience ETAs far longer than the longest ETAs under the old algorithm. These experiences of longer ETAs under the new algorithm are balanced by a lot of experiences of lower ETAs, as seen by the increased mass towards the left tail of the treatment distribution.
Figure 1: The result of a hypothetical experiment shows that the distribution of ETAs generated by the new algorithm is wider than the one generated under the old algorithm. Both short and long ETAs are more common under the new algorithm. 
Note that this heterogeneity in treatment effects across riders need not necessarily be due to observable components like location of the request, time of day, or the weather. If this was the case, we could imagine a slightly more complex experiment analysis that would try to control for these factors and may lead to sufficiently informative ATEs conditional on these observable factors. But in fact, the sheer number of drivers and riders interacting with each other in Uber’s marketplace suggests that there will be heterogeneity in treatment effects that is unexplainable by any observable factors. It is in this scenario where QTEs really provide additional insights not found by simply looking at the ATE, even after conditioning on any imaginable observable factor. 
Ignore this heterogeneity at your own peril
But even if there are differences in treatment effects across riders, do they matter for the business? Is it business-relevant that some riders experience longer ETAs under the new algorithm, or is all that matters that riders experience no difference in ETAs on average? 
Since most riders interact with the Uber platform on multiple occasions, they experience different ETAs over time. And research suggests that negative experiences loom larger in people’s memories than positive experiences. That is, even though a given rider experiences on average the same ETAs generated by the new algorithm, the fact that there will be a number of ETAs that are longer than under the incumbent algorithm may lead to that particular rider thinking that ETAs have gotten worse. This implies that accounting for the difference in outcome distributions beyond comparing the average ETAs is important for the business, which is where QTEs enter the picture.
Quantile treatment effects allow us to capture this heterogeneity
In order to capture the idea that long ETAs have gotten longer, we define the QTE as the difference in a specific quantile  of the outcome distribution under treatment and that same quantile of the outcome distribution under control. That is,
Using the same distributions for ETAs as in Figure 1, Figure 2, below, depicts graphically the QTE for the 95th percentile, i.e., . Note that the QTE defined in this way cannot tell us what the difference in ETA for a specific rider is. In other words, the QTE as defined here does not allow us to learn how long the ETA generated by the new algorithm is for a specific rider whose ETA was at the 95th percentile under the incumbent algorithm. It only allows us to compare the 95th percentile of ETAs in the distribution across all riders for the treatment group to the 95th percentile in the distribution across all riders of the control group. But because we do not observe the same rider under both algorithms, we cannot say anything about the correlation between  and  for a given rider  (without making any further assumptions). Thus, all we can hope for to learn from an experiment is information about the marginal distributions of the outcomes of interest.
Figure 2: The 95th percentile ETA under the new algorithm is larger than the 95th percentile ETA under the incumbent algorithm, leading to a positive QTE. 
Given the large amounts of data that we can analyze after an experiment, we can, of course, calculate the QTE for many different quantiles, for example from the 1st through the 99th. If we plot all of them in a single figure, the resulting figure might look like Figure 3, below:
Figure 3: Plotting the QTEs on the vertical axis against the quantiles shows that they are negative up until about the 60th percentile and positive above the 60th percentile. This is another way of seeing that both short and long ETAs are more frequent under the new algorithm compared to the incumbent one. 
The figure shows that, as seen from the inspection of the two different outcome distributions in Figure 1, the QTE was negative for low quantiles and positive for high quantiles. In other words, short and long ETAs are both more frequent under the new algorithm. 
Figures like this have allowed us to gain much more nuanced insights into the impacts of our experiments at Uber. For example, analyses of QTEs have allowed us to detect deteriorations to our marketplace from specific algorithms. These deteriorations occurred at extreme outcomes for a metric and were easily detected in the QTE at the 95th percentile. At the same time, the ATE was small enough so as not to raise any concerns.
Calculating QTEs through quantile regression
Similar to using linear regression to calculate ATEs, we can use quantile regression to calculate QTEs (see Koenker (2005)). One advantage of doing so is the ability to rely on existing literature, cited below, that develops robust inference methods for the estimates, comparable to robust inference for linear regression.
Whereas linear regression models the conditional mean function of the outcome of interest, quantile regression models the conditional quantile function. In order to estimate the QTE, we specify the conditional quantile function
Then  and  (see Koenker (2005)). Thus, a quantile regression of the outcome of interest on a constant and a treatment indicator allows us to estimate the QTE at the -th quantile, just like a linear regression of the same type estimates the ATE. 
Similar to linear regression coefficients, quantile regression coefficients can be determined as the solution to a specific optimization problem. For a given quantile the coefficients and  are the solution to 
where
and  is the indicator function (see Koenker (2005)). In contrast to the case of linear regression, the objective function for quantile regression is not differentiable, and there are several different ways of calculating the minimum. One possibility is to write the minimization problem as a linear program and use an appropriate solver. At Uber, however, we solve the optimization through an algorithm suggested by David R. Hunter and Kenneth Lange in an article for the Journal of Computational and Graphical Statistics. By developing an efficient implementation of this algorithm using optimized linear algebra routines, we have found that this algorithm scales quite well to the often millions of observations we need to analyze for a single experiment. 
By characterizing quantile regression coefficients as the solution to a minimization problem, we can derive their limiting distributions using the theory for (non-differentiable) M-estimators. With the limiting distribution, we can then derive confidence intervals for the QTEs. Similar to the case for linear regression, a number of robust inference results are available in the literature. Thus, for example, there are results for inference robust to heteroskedasticity (Kim and White (2003)), autocorrelation (Gregory et al. (2018)), and cluster-robust standard errors (Parente and Santos Silva (2015)). 
Moving forward
Quantile treatment effects (QTEs) enable data scientists at Uber to better identify when degradations in our algorithms lead to, for example, longer rider pick-up times, offering a more precise alternative to average treatment effects (ATEs). This increased precision in analyzing the effects of experiments then allow us to refine the mechanics behind estimated times of arrival (ETAs) and other metrics in a more targeted way, leading to an improved rider experience on our platform.
If tackling some of the industry’s biggest data science challenges interests you, consider applying for a role on our team!
Acknowledgments
Akshay Jetli, Stephan Langer, and Yash Desai were instrumental for the technical implementation of the ideas discussed in this article. In addition, I profited from many helpful discussions with Sergey Gitlin.
References: 
Baumeister, R. F.; E. Bratlavsky; C. Finkenauer; and K. D. Vohs, (2001), “Bad Is Stronger Than Good,” Review of General Psychology, 5(4), 323 – 370. 
Imbens, G., and Rubin, D., (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge: Cambridge University Press. 
Gregory, K. B.; Lahiri, S. N.; and Nordman, D. J., (2018), “A smooth block bootstrap for quantile regression with time series,” Annals of Statistics, 46(3), 1138 – 1166.
Hunter, D. R., and K. Lange, (2000), “Quantile Regression via an MM Algorithm,” Journal of Computational and Graphical Statistics, 9, 60-77.
Kim, T. H., and H. White, (2003), “Estimation, Inference, and Specification Testing for Possibly Misspecified Quantile Regressions,” in Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later, edited by T. Fomby and R. C. Hill, 107–132, New York (NY): Elsevier.
Koenker R., (2005), Quantile Regression, New York: Cambridge University Press.
Parente, P. and Santos Silva, J. (2015), “Quantile Regression with Clustered Data,” Journal of Econometric Methods, 5(1), 1-15.