Uber is a worldwide marketplace of services, processing thousands of monetary transactions every second. As a marketplace, Uber takes on all of the risks associated with payment processing. Uber partners who use the marketplace to provide services are paid for their work even if Uber was unable to collect the payment. Fraud response is thus a very important operational component of Uber’s global marketplace.
Industry-wide, payment fraud losses are measured in terms of the fraction of gross amounts processed. Though only a small fraction of gross bookings, these losses impact profits significantly. Furthermore, if a fraudulent activity is not discovered and mitigated immediately, it could soon be further exploited, resulting in serious losses for the company. These dynamics make early fraud detection vital to the company’s financial health.
Modern fraud detection systems are a combination of classic 1980s AI (also known as an “expert system”) and modern machine learning. We would like to share the journey on how we build the best-in-class automatic fraud detection system and process, leveraging both machine algorithms and human knowledge.
Explainability Of Decisions
It’s important to understand the value of expert-driven rules in fraud detection. Explainability is of paramount importance when it comes to fraud detection. Uber plays the role of the societal operating system. Incorrect fraud-related decisions can be disruptive not just to individual users, but to whole communities.
Fraud cannot be stopped by some black-box AI algorithms. The reliance on the engagement of human experts and human-readable rules gives the opportunity to humanize the process, examine past decisions, and have full traceability for every decision.
Signals of Payment Fraud
Following are the 2 primary payment fraud types that we are concerned with:
- DNS stands for “do not settle”, which occurs when the order has been fulfilled or a trip has been completed, but the payment could not be collected
- Chargeback happens when the payment has been collected, but the user disputes the transaction, and the bank refunds the money
In both cases, the losses detected change over time. If an analyst analyzes losses from transactions made on one day, these losses can change depending on what day they look at the data. This is because the data may not be fully mature. Payments that are initially uncollected can be collected after the date of the transaction. Alternatively, users may only choose to dispute losses days or weeks after the initial transaction date. Therefore, losses for a single day may increase or decrease over time. However, when enough time has passed from the initial transaction date, the fluctuations in losses will be minimal, and these losses are termed fully mature.
Therefore, when analysts see early spikes in immature losses, they need to make a judgment call on whether or not these spikes are due to increased activity on the platform, or are signals indicative of fraudulent activity.
The Role of Risk Analysts in Manual Fraud Detection
Fraud analysts use thousands of fraud signals to manually set up the rules that stop various fraudulent operations. Some signals come from the marketplace transaction and some are produced by various models. This process is highly manual, labor-intensive, and time-consuming. However, the challenge with using supervised learning models to stop fraud is that they are great at catching past fraud patterns, but can’t stop unseen and evolved fraud patterns. Therefore, analysts play an integral role in analyzing, detecting, and mitigating fraud at Uber.
What is RADAR?
RADAR is an AI fraud detection and mitigation system with humans in the loop.
RADAR monitors fine-grained segments of Uber’s marketplace, detects the start of a fraud attack, and generates a rule to stop it. The fraud analysts are involved to review the rule and approve it as needed.
Uber has to deal with millions of unpaid orders in every region across the globe, and across many different payment methods used by its customer base. Manual investigation has the benefit of catching new fraud MOs that have never been observed before or are difficult to detect algorithmically, but it is time- and labor-intensive.
By introducing RADAR, we can ingest and analyze fraud patterns at a much faster rate, and only pull in analysts to investigate and mitigate the most complex instances of fraud. Therefore, we get the best of both worlds: the ability to respond to and mitigate new instances of fraud through continued use of manual review, while simplifying existing fraud detection and mitigation processes through a data-driven AI approach with RADAR.
RADAR fraud protection rules are generally short-lived and targeted reactions to the unexpected attacks. Each rule is monitored and carefully measured for relevance. Eventually the more generic models and rules cover the pattern detected by RADAR.
Environment and Engineering Processes
RADAR runs on top of Apache Spark® and Peloton. Peloton helps us easily schedule the jobs and scale as needed to process the data. Apache Spark® provides an easy-to-use Python® and Scala API, which lets us separate data processing code from data access code. This makes it possible to do unit and integration testing of data processing and AI code.
We have a rigorous process of automated testing for RADAR data pipelines, where synthetic data is used to validate the anomaly detection algorithms. Our assumptions about data generation processes and attack patterns are described in the code and probabilistic programming models. This makes the engineering team much more responsive to changes in the data generation processes and attack patterns.
RADAR detects fraud attacks by analyzing the activity time-series from the Uber payment platform and converts data insights into human action.
We look at the data in 2 different time dimensions. We will define some terminology to discuss time series data below:
- OT – “Order time” when the specific order has been fulfilled.
- PSMT – “Payment settlement maturity time” signifies the time when the payment processing data comes in. Depending on the specific payment processing system the payment can take days and weeks to completely settle.
The goal of the RADAR data pipeline is to collect data for activity time-series generation, implemented using the Apache Spark® SQL API and the rolling window function.
- RADAR uses AthenaX to continuously aggregate the risk-enriched payment stream data from live payment streams
- The data is aggregated on Hive from Kafka streams using Marmaray
- RADAR builds and persists single-dimensional hourly time series in OT dimension RADAR currently uses selective slices for PSMT dimension for explainability
The time series generation component uses Apache Spark® window API and custom aggregation functions to generate all the time series in one shuffle cycle.
The logic is expressed in Apache PySpark® API, which makes it possible to unit test the data processing on the local development machine and make it part of the CI/CD cycle without having access to the secure data.
Uber Orbit Model for Time Series Decomposition
Orbit is an open-source Python® package for Bayesian time series modeling. It implements a suite of structural Bayesian time series models while providing a general interface for Bayesian time series forecasting. This package is written and designed from a strict object-oriented perspective with the goals of reusability, ease of maintenance, and high efficiency.
RADAR uses the KTR (Kernel-based Time-varying Regression) model inside orbit for training, inferences, and forecasting of time series models. The time-varying coefficient model is an important tool to explore and detect the dynamic pattern within time series. By nature, KTR is a general additive model (GAM), which makes the decomposition natural and interpretable. The time series can be decomposed into trend, seasonality, and regression components (if any).
Detection of the Attack
We build the time series decomposition model for each one-dimensional time series in the OT dimension for each aggregate signal and for each PSMT slice. A fraud attack manifests itself as a time series anomaly. We collect dozens of short-term and long-term anomaly signals, which are combined to determine if the fraud attack is in progress. The anomaly is defined as a significant variation from the forecast. We use the Uber Orbit model as described above for time-series forecasting in OT dimension used to establish the anomaly baseline.
Smart Thresholds and Prioritization
After we detect time series anomalies, we need to determine which attacks in which segments might lead to significant losses. This way, we can choose to prioritize the most severe attacks in terms of losses for analysts to review. When RADAR detects anomalous losses in the time series decomposition model, these losses have not settled into full maturity yet.
Therefore, after we detect these anomalous losses, we need to project them to full maturity in order to predict how they will settle over time. Now we use the PSMT dimension to forecast the losses. We are using the losses from historical data to produce this forecast.
Next, we apply the forecast as a smart threshold to predict if the attack severity will be anomalous at full maturity. If we exceed this smart threshold at full maturity, then we will choose to prioritize this loss.
Once we have filtered the losses down to the most severe anomalies, we engage the analyst team by using Jira as the task-tracking tool. However, all anomalies, whether or not they were prioritized or selected for Jira, are logged in the backend so that we have a historical record of all prioritization decisions.
Jira as a Feedback Loop Mechanism
Once we post the relevant and prioritized tasks to Jira, it is up to the analysts to review these anomalies and utilize them in their fraud investigations. By working closely with these analysts, we are able to create a strong feedback loop where we can analyze how well our methodology measures up against the existing status quo, and fine-tune RADAR to improve its performance.
Converting Data Insights into Action
Generally, relevance is a very important component of practically any anomaly detection system. Too many irrelevant signals create a noise of alerts, to which people are unable to respond. Our 2-dimensional forecasting system allows us to prioritize the alerts and make them relevant and actionable.
RADAR Protect is a component that finds the pattern of the attack and makes it easier to stop it.
The system finds the activity patterns, which lead to fraud. At the core of the system, there is a pattern-mining algorithm called FP-Growth. It has the capability of discovering common patterns within the categorical feature space.
Feature selection is an important component of the associative rule mining process. In our experiments, we had to deal with the exploding number of patterns identified by the algorithm if the feature selection process is not done carefully. The challenge with the feature selection process is that it has to be fully automated.
We started from the classic MRMR algorithm and introduced a few modifications, which take advantage of the specifics of our dataset. Specifically, for associative rule mining feature relevance:
- We are working exclusively with categorical features—numerical features are converted to categorical via various means
- Our dataset is extremely imbalanced, so we can get more aggressive at eliminating some features
- We are able to take advantage of the imbalance in feature value distributions
Redundancy removal requires the calculation of mutual information, which is quadratic in terms of the number of features. Our attempts to use the existing scalable feature selection methods did not work because it does not support high-cardinality features, which is important for the pattern discovery in the case of imbalanced classes. After multiple experiments, we came to the conclusion that this specific piece works better when scaled vertically and not horizontally.
We will explain further how we are combining the vertical and horizontal parallelism to achieve full utilization of our resources while dealing with scalability bottlenecks.
Finding Patterns to Generate Rules
In order to find patterns, we have to encode all selected feature-value pairs as unique items (one-hot encoding), so we can find common fraud patterns.
This transformation converts our dataset from a list of rows to a list of itemsets, where each item represents a particular key-value pair combination. Finding the most common set of key-value pairs will lead to the discovery of a fraud rule.
FP-Growth uses the distributed mergeable data structure called FP-Tree to accelerate the mining of common subsets (patterns) of the elements. FP-Growth produces the dataset of patterns, which are commonly associated with fraudulent activities. Unfortunately, not all rules are helpful for us. Many patterns are too broad and cover many good trips. Launching these rules will produce too many false positives and be disruptive to the customer experience.
The second step in pattern finding is the rule selection process. The rule candidates are validated to find those that pass the high precision requirements needed to be enabled.
Rule selection is similar to the query percolation problem in the sense that we have a fixed set of records and a wide range of queries, which need to be matched and scored against this fixed set of records.
We found that our feature selection helps to narrow down the list of viable rules. After a few experiments, we found that this percolation works better on the vertically scaled Apache Spark® driver and does not use Apache Spark® horizontal scaling.
We follow the 2-step process of percolating on the sample dataset to preselect the rule candidates and then validating the candidates on the full dataset. The FP-Growth algorithms use horizontal Apache Spark® scaling. The sample-based percolation uses vertical scaling on the Apache Spark® driver. The follow-up validation uses horizontal scaling running traditional SQL queries, which are generated based on the rules.
Final rules are presented to analysts for review. Feedback from the analysts on the rule is incorporated into the system to improve the rule generation process.
Scaling RADAR Protect
As we can see, the RADAR Protect pipeline consists of different steps, each using different scaling models. Some tasks utilize a large cluster and some are executed on the driver. This produces uneven utilization, leaving the large clusters underutilized for some periods of time.
To solve this problem, we use parallel execution of Protect jobs. The list of detected fraud attacks goes through the blocking queue and we allow parallel execution of the jobs to fully utilize the available resources.
The anomaly alerts and rules from RADAR provide a concrete groundwork to action on fraudulent users. RADAR uses Uber’s rule engine, Mastermind, to push rules into production. Mastermind maintains a repository of thousands of rules and has been upgraded with several new capabilities like optimized rapid evaluation of rules, suggestive actioning, and the addition of several new metrics.
However, even the small mistakes could lead to incorrect actioning on thousands of users on the Uber platform. This is where the human comes into the loop. Using the outputs of RADAR, pre-written queries, and pre-evaluated metrics, the analysts review the rules and the actions before productionalizing them. This ensures that the false positives and unnecessary incidents are avoided, because user experience at Uber is of paramount importance.
In this blog, we presented the RADAR system and how it brings together many components of Uber’s technical ecosystem to solve a complex business problem. More importantly, it shows how technology can help optimize and streamline the existing labor-intensive tasks to make room for focusing on more advanced problems and being creative.
While this system is actively utilized now, it should not be thought of as complete or static. Rather, it is one step in a long process of the experimentation, which continues to this day.
While some will see this blog as a technological reference, the important lesson to learn from this project is that successful implementation of machine learning projects requires bringing together technology, data, algorithms, and people. Our system serves humans and has humans in the loop.
Uber risk analysts play a very important role in this process and this technological solution would not be possible without their involvement and continued feedback.