Building Scalable Streaming Pipelines for Near Real-Time Features
24August,2021 / Global
Background
Uber is committed to providing reliable services to customers across our global markets. To achieve this, we heavily rely on machine learning (ML) to make informed decisions like forecasting and surge. As a result, real-time streaming pipelines, which are used to generate the data and features for ML, have become more popular and important.
At Uber, we leverage Apache Flink to build the real-time streaming pipelines, and build platforms like Gairos and AthenaX to simplify development. However, there are still many challenges, such as scalability, due to either the complexity of computation or the amount of real-time data to be processed.
In this article, we will use the pipelines that generate demand and supply features, as an example to introduce some of the challenges we faced and how we solved them. In particular we will explain how we tune the real-time pipelines with the performance tuning framework.
Architecture
The figure below shows the high-level architecture: Streaming Pipelines in Apache Flink are responsible for the feature computation and ingestion. For the rest of the article, we will discuss these pipelines in detail.

Feature Computation
This section details how to aggregate raw events, such as the demand and supply events, by their geospatial and temporal dimensions, as well as by global product (UberX, etc.) for any given hexagon (c.f., here). The simplified computation algorithm is as follows:
- Count the number of raw events from distinct riders and drivers by hexagon and global product type in a 1-minute window
- Apply the Kring Smooth to multiple rings, up to ring-20 (discussed later) on the 1-minute window
- Aggregate the smoothed values of each ring on multiple sliding window sizes up to 32 minutes
In total, one real-time pipeline generates 54 features for a hexagon each minute, using the combination of 9 rings (0, 1, 2, 3, 4, 5, 10, 15, 20), and 6 window sizes (1, 2, 4, 8, 16, 32).
Next, we discuss step 2 of the algorithm:
Kring Smooth
The Kring Smooth process calculates the geospatial aggregation by broadcasting the event counts of a hexagon to its Kring neighbours. In other words, the feature value of a hexagon for a particular ring takes into account the event counts from all hexagons within that ring.
In order to calculate the feature value aggregated on the ring R for a given hexagon H, the equation is:
Where:
- Num(i) is the number of hexagons of ring i
- Nij is the jth hexagon of ring i
- f(H, 0) is the number of events originated from Hexagon H
So let’s check the following example to see how to compute the values of 3 features: ring 0, ring 1, and ring 2 of the hexagon A, following the equation:
Num(0) = 1
Num(1) = 6
Num(2) = 12
f(A, 0) = 1
f(A, 1) = (f(A, 0) + f(B1, 0) + f(B2, 0)) / (Num(0) + Num(1)) = (1 + 2 + 1) / 7 = 4 / 7
f(A, 2) = (f(A, 0) + f(B1, 0) + f(B2, 0) + f(C1, 0) + f(C2, 0) + f(C3, 0)) / (Num(0) + Num(1) + Num(2)) = (1 + 2 + 1 + 3 + 2 + 1) / (1 + 6 + 12) = 10 / 19

The pipeline follows the equation to calculate the values of features for multiple ring sizes, up to 20.
Temporal Aggregation
After the Kring Smooth completes for a one-minute window, step 3 of the algorithm is to further aggregate the smoothed event counts on larger windows, up to 32 minutes. In order to calculate the aggregation on a larger window for a given hexagon H, the equation is: