Measuring Performance for iOS Apps at Uber Scale

20 April 2023 / Global

At Uber, we obsess over delivering highly performant and reliable experiences to our partners and customers. We treat degradations to app performance the same way as any other functional regressions.

Before investing effort into improving performance and reliability, we need to understand how our app is performing in production. We use various tools to derive metrics that allow us to measure app performance and reliability. We also establish baselines to ensure that new feature development is consistent with the high bar we hold for performance and reliability.

App Performance

This article is the first in a series discussing how we measure performance at Uber and the different challenges Uber faces in terms of scalability when measuring different performance metrics. Today we’ll be focusing on measuring app startup performance on iOS, but future articles will cover other performance metrics and other mobile platforms.

At Uber, we monitor multiple critical metrics, ranging from UI flow latency to memory usage; from bandwidth to UI jank. In this article, we’ll be talking about a critical industry-standard metric: app launch times. Being one of the major performance metrics which directly impacts the customer experience, we’ve discovered that end users have a limited tolerance for a slow app when they want to travel and reach their destination fast.

For app launch, we specifically measure cold app start duration, which is made up of the first app process creation, our main.swift code being initialized, various network calls to fetch real-time content, and the first rendering pass needed to populate the screen. This could occur after a fresh install, an app update, the first app launch since device reboot, or if the app was killed or otherwise evicted from memory in a previous session. This is in contrast to a hot launch when the app is already initialized, in memory, and just brought to the foreground from the background.

In addition to signposting these key flows within the app for post-cold-launch metrics, we have also established a data pipeline to ingest hitch rate and hang rate metrics from Apple, which provides deeper insight into what’s going on at the OS layer. Below, we’ll discuss how we measure these performance metrics and what tools and processes we’ve built to ensure we never regress.

Startup Latency

To determine the app’s startup latency, we measure the time it takes from when the user taps the icon to when the first responder view is available to the user. The first responder view in this context is the first view that accepts user input.

With the introduction of pre-warming in iOS 15, the OS could decide to launch the app process into memory based on device conditions to anticipate the user’s intentional launch of the app and reduce the amount of time the user has to wait until the app is usable upon their next intentional launch. This introduced complexities into how Uber measured iOS cold startup latency since it was no longer feasible to signpost the time from process creation to first responder view availability anymore. This led to redesigning Uber’s process for measuring cold startup latency.

Startup Measurement Pre-iOS 15

Before the introduction of pre-warming, we utilized a simple method of measuring cold startup latency where we logged the time between specific events (“signposting”) during startup and used those times to establish the complete duration for cold startup latency.

To do this, we divided our startup sequence into two major spans of measurement, the sum of which makes up our cold startup latency:

Pre-main: This is measured as the time it takes before the main() function is called in the application after the process is created. We use a Mach kernel call shown below to get the process creation time in our main.swift entry point.

Post-main: This is the duration between the main() function being complete, up until our first screen is interactive.

These “pre-main” and “post-main” spans were further divided into other sub-spans based on the phases of cold startup latency we wanted to measure. This is what our old startup measurement would look like (sans sub-spans):

Figure 1: Illustration of the cold launch measurement spans before app pre-warming

With the introduction of pre-warming in iOS 15, the main() entry-point for the app may be invoked before the user taps the app icon, initializing the launch sequence, but pausing before the call to UIApplicationMain. This resulted in inflated pre-main span measurements because our spans did not accommodate this launch sequence pause during pre-main. This resulted in a misleading 130% increase in our total launch measurement, leaving us with unreliable app launch metrics.

Startup Measurement in the World of Pre-Warming

With pre-warming, our “signposted” startup latency metrics were no longer reliable, leading us to investigate more deterministic means of measuring our app’s startup latency.

We ultimately decided to leverage MXAppLaunchMetric from Apple’s Metric Kit, but this had several problems we had to solve before it could be adopted for our use case:

MXAppLaunchMetric measured the time between app process initialization and the didFinishLaunch() call of our app. While this can be a suitable proxy for startup latency measurement, it was a regression from our prior measurement, where we include the time taken for the initial application UI to render.
MetricKit data is measured per user on a daily aggregate basis (past 24 hours window). Our prior measurements had per-session measurement granularity.

To tackle both of these issues, we built a new metric pipeline that collected MetricKit data at a user level and aggregated it with our custom signposted measurements at a session level to establish a new metric for startup latency.

Because the concept of pre-main was no longer valid in our measurements, we came up with new definitions for the phases of app startup latency:

Pre-launch: Instead of our custom measurement for pre-main, we now used MXAppLaunchMetric to get the time from process instantiation to didFinishLaunch().
Post-launch: Because we wanted to measure the full startup experience, we still relied on custom time tracing from didFinishLaunch() until the point in the app where we determined that the UI was fully rendered.

Figure 2: Illustration of the cold launch measurement spans accounting for app pre-warming

Similar to our previous spans, we further divided pre-launch and post-launch spans based on the app launch sequence. For instance, to add support for Uber apps that handled scene state transitions through scene delegates, we divided our post-launch span measurement into sub-spans “PostLaunchBeforeWindow” and “PostLaunchAfterWindow”.

Figure 3: Illustration of the cold launch measurement spans accounting for app pre-warming with scene delegates

Since we rely on MetricKit for pre-launch metrics, we can only get this at a user level every 24 hours. However, because we measure post-launch metrics ourselves, we have the flexibility to measure the post-launch latency per user session. In the next few sections, we explain how this flexibility helped us get a more complete picture of our app’s startup metrics.

Joining User-based and Session-based Metrics

We made the decision early on not to process MetricKit data on the client, but rather to send this data in a semi-structured JSON format to the backend. For MXAppLaunchMetric, the data is represented as a histogram, in the form of MXHistogram, with each bin/bucket indicating the number of times the app launch metric fell within a certain range of values during the 24-hour reporting period.

By sending the complete histogram, we increased the amount of data we were sending to the backend, but that also gave us an increased amount of flexibility in processing the data:

Using the buckets provided in MXHistogram, we can analyze how many times a user experiences high launch time (i.e, the number of pre-launch times that are greater than 5 seconds for a single user).
It also allowed us to offload processing computation off users’ devices and decoupled changes to available MetricKit data from our regular weekly app releases.

The snippet below illustrates what the MXAppLaunchMetric-related parts of the MetricKit data dump might look like.

Figure 4: Sample of histogram data representing MetricKit app launch times

Figure 5: Visual representation of collected histogram data representing MetricKit app launch times

With this data, bucketStart is the beginning value for the interval for a bucket, bucketEnd is the ending value for the interval, and bucketCount is the number of observed samples that fall within that bucket.

We convert this histogram data to a scalar value to be able to make sense of it at an aggregate level across our production users. We utilized a simple approach to calculate the average for each histogrammed metric to generate this scalar value, where n in the following equation represents the number of buckets.

Figure 6: Equation used to convert histogrammed MetricKit data into a scalar value

Using this equation, we were then able to get a scalar value that represents the average for any histogrammed metric provided by MetricKit.

We set up a data pipeline that combined the following two metrics at a user level:

The ingested data from Metrickit which represents the user’s average startup latency i.e, pre-launch data over 24 hours which we converted to a scalar value.
The ingested data from our custom post-launch data, which is collected for each session.

Combining these two metrics at a user level gave us a more complete startup latency metric. We also store this data’s 50th, 75th, 90th, and 95th percentile aggregation in a separate database, giving us a more holistic view of the user’s startup latency over time.

Processing MetricKit Data

Due to the decision to send MetricKit JSON data to the backend for processing, the data pipeline had to accommodate some complexities in dealing with a large scale of data. To sanitize this JSON data, the data pipeline pre-processes millions of rows of unstructured JSON, accounting for different locale strings, missing data, data type conversions, and timestamp conversions from local time to standard time, and converts all memory-related data to MB after sanitization.

While processing these, we also process histogram values to scalar values, such as app startup latency and hang rate. This pre-processed data is stored in an intermediate dataset where we map each JSON key from MetricKit data to a more structured relational dataset column.

Figure 7: Example mapping of MetricKit JSON exit metrics to columns

With this aggregated data, we can answer some questions that MetricKit may not answer by itself, such as:

How our launch, hang, or any other MetricKit histogrammed metrics change in aggregate from one app version release to another. These metrics help us to catch regressions by configuring thresholds and alerts.
How our launch, hang, or any other MetricKit histogrammed metrics compare from one device or OS version to another. This helps us monitor any regressions introduced by new OS versions and how the app performs on low-end devices.

Concluding Thoughts

By re-architecting how we measure our startup latency, we can more reliably measure this metric in the world of pre-warming. These changes have allowed us to leverage existing OS instrumentation for latency while still capturing a more holistic measurement for user-perceived latency than what the OS can capture by itself.

We use this new startup latency as a guardrail metric to ensure that hundreds of code changes and dozens of features launching weekly do not regress our startup latency beyond a baseline. In addition, this data is used for opportunity sizing for improvements to our app’s launch sequence. For example, since we measure the discrete sub-spans that make up our startup sequence, we can determine the impact of optimistically pre-fetching information on startup latency and determine the relative business impact.

For the next blog post in the series, we intend to showcase how we measure and tackle other reliability issues, such as memory leaks causing out-of-memory (OOM) issues and app responsiveness on mobile across Uber.

We hope that our learnings are helpful for other teams wanting to measure their app’s performance and reliability.