
This article was written in collaboration with the Kotlin team at JetBrains.
At Uber, we strive to maintain a modern tech stack in all our applications. A natural progression in the Android space was to start adopting Kotlin, a modern multi-platform programming language and an increasingly popular alternative for Android development that fully interoperates with Java.
However, with over 20 Android applications and more than 2,000 modules in our Android monorepo, Uber’s Mobile Engineering team had to carefully evaluate the impact of adopting something as significant as a new language. The many important facets to this evaluation include developer productivity, interoperability, run and build performance overhead, developer interest, and static code analysis. On top of these developer considerations, we had to ensure that this decision didn’t impact the Uber user experience on our Android apps.
To facilitate the success of this adoption, we launched an initiative, in collaboration with JetBrains, to measure Kotlin build performance at scale across different project structures, a process that informed our decisions for best practices for Android development.
Design considerations
The goal was simple: measure Kotlin build performance at scale and understand the tradeoffs of different project structures. To achieve this, we established the following conditions for our model structures:
- Code should be functionally-equivalent. This does not necessarily mean that Kotlin or Java sources would be identical in implementation, just that they reflected functional parity for how we would potentially write it in that language (for example, Gson TypeAdapter vs. Moshi JsonAdapter).
- Code should be non-trivial. Trivial examples are not enough, as they often do not reflect real world conditions. To most accurately execute our tests, we needed to leverage non-trivial code we would use in production environments.
- There should be lots of large, diverse modules. We want to measure not only by volume of projects, but also see how individual Kotlin projects scale with size.
We were in a unique position to perform such a measurement because we generate our network API models and services for Android from Apache Thrift specifications. On the wire, these are sent as JSON using a Retrofit/OkHttp/Gson based network stack. Structs/exceptions/unions are generated as standard value types (POJOs). We generate one project per .thrift file, and projects can depend on other generated projects that match the Thrift “include” statements.
This project structure results in 1.4 million lines of code (LoC) across 354 different projects that we can compare. Additionally, since code is generated, we can control the morphology of these projects; for instance, we can generate them with only Java code, or only with Kotlin code, a mix between the both of them, and enable or disable annotation processors, among other combinations.
Based on configurability, we came up with a matrix of 13 different scenarios for fine-grained understanding of different project structures and tooling tradeoffs:

We named the process of generating the 354 projects for each of the 13 configurations an experiment. In total, we successfully ran 129 experiments.
Additional design notes and caveats
Below, we highlight some additional design considerations and knowledge we had in mind before embarking on this project:
- Uber already used Buck as our Android/Java build system, so we did not test using tools like the Kotlin Gradle daemon and incremental Kapt.
- During this benchmark test, builds were clean and the cache was turned off. Buck caches the result of already computed rules to speed up future builds, definitely something you don’t want to do while performing a benchmark to reduce variability between runs.
- Buck’s multi-thread build was turned off. Since we cannot infer that the work performed by each thread throughout a build execution is deterministic, we do not want multi-thread mode to interfere with the times from the compiler thread.
- We wanted to measure pure kotlinc/javac performance, and as such, did not use Kotlin’s compiler avoidance functionality. Compiler avoidance/caching mechanisms can vary significantly between build systems, so we decided not to index on it for this project.
- We decided to execute our experiments in our CI machines because these experiments ran so slowly, and our CI boxes were much more powerful than personal machines.
- These are pure JVM projects. Android projects may have other considerations such as resources, R classes, android.jar and Android Gradle Plugin.
- Buck support for Kotlin was added by the open source community and it is not being actively maintained. This may have implications on performance in the sense that Buck’s implementation may not be as heavily optimized as first-party tools.
- Analysis that involves project size comparison was done entirely at the source code level, and no generated bytecode was taken into account.
- Our build performance data relates to compilation time rather than build time. Build times are tightly coupled to the build system in use, e.g., Gradle Incremental Builds or Buck Parallel building. We wanted our analysis to be build system agnostic and keep the focus as close to kotlinc vs. javac as possible.
Based on these considerations, we created a project generation workflow that enabled us to develop hundreds of models with which to compare build performance for our new Kotlin-based applications.
Project generation workflow
Our standard model generation pipeline is a simple command line interface around a project generator. It reads from a directory of Thrift specs, infers project dependencies, and then generates a flat set of projects that reflect those specs. These projects, in turn, contain a Buck file with a custom genrule that invokes the code generator to generate the appropriate source files for the project. During our build performance experiments, we ran all the code gen pieces separately so the only measured piece is the compilation step.

To simplify this setup, we created a `BuildPerfType` enum with the aforementioned matrices, and added a `–build-perf` option to the project generation CLI. Then, all that the analysis script had to do was run a command, such as:
Given our stack’s usage of Buck, we leverage OkBuck to wrap the Buck usage. The BuildPerfType enum member contains all the required information to generate a project for that specification, including potential custom arguments to kotlinc, dependencies (including, Kotlin stdlib, and Kapt, etc.), and fine-grained arguments and language controls to the code generation.
At the code generation level, we implemented support for generating Java and Kotlin code using JavaPoet and KotlinPoet. We already implemented a flexible plugin system in the code generation to support custom post-processing, so adding the necessary controls to facilitate these new variants was easy.
To support the generation of mixed source sets, we added support for specifying exactly which Thrift elements should be generated in each language. To support Kapt-less generation, we implemented support for optional direct generation of classes that would otherwise be generated during annotation processing. Namely, support for generating Dagger factories (example) and Moshi Kotlin models (based on this pull request).
Figure 2, below, shows the distribution of the generated projects based on their size, as measured by the number of files. On average, there are 27 files per project (i.e., the average total files across all 354 projects in the 13 build performance types). The average number of lines per file is 200 (i.e., the average number of files divided per average number of lines, which is the sum of the average number of lines of code, comments, and blank lines for Java and Kotlin).

Experiment execution
To run our experiment, we took the following steps:
- Instrument the process. This mainly means going inside our build system and making it issue the metrics we need for this analysis.
- Consolidate the data. We had to agree on a format for the data before shipping it to the database. The way this data is indexed would directly impact our ability to build visualizations in Kibana, our front-end system.
- Ship it to our in-house databases. At Uber, there are multiple databases for these sort of metrics, each one optimized for specific scenarios. We chose ElasticSearch and Kibana for this experiment as the visualizations that we wanted could be better built in it.
- Repeat it consistently. We needed a large enough amount of data in order to eliminate any outliers that could harm the data analysis.
A Python script orchestrated the experiment execution; the language of choice for this type of experiment has no impact on experiment performance and was chosen based on team familiarity. Build performance data was provided by our build system in the shape of a Chrome traceable file, and although this is a standard feature of the build system, we still had to modify our internal Buck fork in order to associate the data we needed with the context of the project in which it was being collected.

For the project morphology-related data, e.g. number of files, the number of lines that are blank, comments, or code and the number of generated classes and interfaces, we used a mix of the Count Lines of Code (CLoC) CLI and regular expressions, an analysis that looks into the generated project source files and not its compiled bytecode. Once all data was collected, it was assembled into a single JSON file and committed to a separate Git repository. This entire process was run in the CI environment every two hours for about two weeks. Afterwards, another part of the script was responsible for synchronizing the results repository and shipping the data to our in-house databases, where it could be analyzed.
