Scaling Adoption of Kerberos at Uber

19 January 2023 / Global

Introduction

At Uber, we have been operating an Apache Hadoop^® based Data analytics platform since 2015. As adoption picked up exponentially in 2016, we decided to secure our platform with Kerberos authentication. Since then, Kerberos has become a critical component of our security infrastructure supporting not only Uber’s Hadoop ecosystem but also other services that are considered mission critical to our tech stack.

Uber has a large and diverse tech stack deployed on thousands of machines and storing more than 300 PB of analytics data. Systems using Kerberos authentication include YARN, HDFS, Apache Hive^™, Apache Kafka^®, Apache Zookeeper^™, Presto^®, Apache Spark^™, Apache Flink^®, and Apache Pinot^™, to name a few. Besides open-source systems, many of the internally developed platforms and services also use Kerberos for authentication. All these services need to securely communicate with each other, as well as provide secure access for all their user clients.

In this blog post, we’ll share some of the key challenges and the solutions that enabled us to scale adoption of Kerberos for authentication at Uber.

Kerberos

Before we go into describing what steps we have taken to scale Kerberos at Uber, let’s take a high level view of Kerberos.

Kerberos is a computer network authentication protocol. It is named after Cerberus, the three-headed watchdog of Hades in Greek mythology. The protocol was first developed at MIT in the early 1980s, and the current version release (v5) dates to 1993. Though it’s quite an old protocol, its advantages have allowed Kerberos to meet distributed system requirements without any major changes since the original design.

As the Kerberos name suggests, the protocol has 3 core actors:

Key Distribution Center (KDC): the Kerberos server that stores principals (users or services) with their credentials and allows them to prove their identity to each other
Client: a process run by a user or service that needs access to some resource
Resource Server: a server that provides a resource to network clients

The protocol works by exchanging tickets between the actors:

Each Kerberos network service authenticates with KDC, usually during startup, and keeps the authentication ticket valid while running
If a Client needs to access some network resource protected by Kerberos, first it logs in with KDC and is granted an expirable authentication ticket (TGT—ticket-granting-ticket). The TGT will allow the client to request access to any resource without client credentials (SSO—more about it later)
With this ticket the Client needs to request a service ticket from KDC for a specific Resource Server
Now the Client can authenticate with the Resource Server by providing the server ticket

One important thing worth mentioning is that Kerberos allows its clients to authenticate using either of two methods: with a clear password or with a secure key file called ‘keytab’.

Kerberos has some important advantages that allowed it to be widely adopted by various systems, which is why we use it at Uber:

Hadoop ecosystem (used heavily at Uber) natively supports Kerberos-based authentication
Credentials are never sent over the network unencrypted
Tickets have a limited lifespan, so after a ticket is expired a client will have to authenticate with KDC to get a new ticket
It’s a Single Sign-On (SSO) protocol, meaning that a client needs to authenticate to KDC once to be able to access any Kerberos-protected resource later while the ticket is valid
All clients and servers/services are mutually authenticated
Tickets can be complemented with a host information, allowing one compromised host not to affect security on other hosts in the network
We can build a hierarchy of independent Kerberos realms (domains) by setting trust relationship between them
The system is proven, reliable, and has a good track record of making infrastructures secure

Some specific reasons why Hadoop chose Kerberos as its main authentication protocol can be found in this paper. For more details about Kerberos refer to RFC 4120 or Wikipedia.

Standardizing Account Types

Authentication starts with identities. As Uber went through a hyper growth stage, the number of employees and teams leveraging the Data platform on a daily basis grew rapidly. Without a proper process in place, it became hard for us to guide users across various departments and job functions (operations, data scientists, analysts, engineers) to leverage the appropriate accounts in the right way to accomplish their tasks.

The authentication process involves two parties: client and server. Whereas a user is always a client, a service can represent both parties. We had many different use cases and types of accounts and it required some standardization for easier growth, so we started by establishing the following nomenclature to communicate best practices to our users:

Account Type	Description
User Account	For personnel (employee) access, ad hoc data exploration Example: employee-account@uber.com
Service Account	For running daily pipelines, which the team will be maintaining even when the original pipeline author leaves the company. A service account is backed by a Linux group of the same name so that project members can be added to the group and gain access to the data written by the service account. Example: uber_eats@DATA.UBER.COM
System Account	For systems that make up the Data infrastructure stack. Unlike service accounts, no group members are added here to avoid any security risks. Example: hdfs@DATA.UBER.COM
Service-Host Account	This enables us to scope the principal and keytab credential to a specific {service, host} combination. In an event where the keytab is compromised, we would need to rotate only that specific {service, host} keytab, instead of rotating service keytab across several hosts (in contrast to a Service Account). This is typically used by services that are deployed across a large fleet of hosts. Example: hdfs/us-west-123.internal@DATA.UBER.COM

Besides defining accounts and describing them in wikis, we built self-serve capabilities for creating service accounts and using them for access control policies. While onboarding new datasets or ETL jobs, if a service does not have an account or it’s a new service, the owning team is redirected to the self-serve account creation process. Newly created accounts are made available for use within 24 hours.

Figure 2: Service Account Creation Process

We also ensure that a corresponding keytab is automatically generated and distributed to edge hosts with the appropriate permissions.

Accounts Propagation

Figure 3: Propagating Accounts to Production Infrastructure

Uber relies on Microsoft Active Directory (“Uber AD”) for users and group management. Uber AD is the source of truth for personnel accounts and groups membership information. As we mentioned earlier, service and system accounts are represented as groups in AD with specific properties to distinguish them from regular user groups.

It is typical within a company to have multiple Kerberos realms for separation of concerns. Data org, as an internal organization inside Uber, decided to have a dedicated Kerberos setup—the MIT Kerberos implementation, backed by OpenLDAP^® database. Data Kerberos stores principals for hosts, service, and system accounts. Uber’s personnel accounts (employee@uber.com) can authenticate with Kerberos to access Data resources facilitated by the one-way trust with Uber AD (a different realm).

Our Data services can be deployed either in Docker containers or as regular OS processes on bare-metal hosts. To reduce the load on Active Directory, we implemented a users and groups sync process to propagate users and their group memberships to the entire Data stack.

This laid out the foundational setup to enable employees and services to better leverage our authentication and authorization infrastructure.

Kerberos Architecture

Currently Kerberos infrastructure provides authentication for over 100,000 principals in total. To support that scale of requests volume and maintain low latencies, we deploy multiple instances of Kerberos servers in all regions.

While Kerberos servers can be deployed in various configurations, we chose a multi-provider setup for high availability: all provider nodes act independently, contain the same information, and can be easily swapped with each other.

Each provider forms an independent cluster with itself along with multiple KDC consumer nodes. All the data from a provider is replicated to its cluster’s consumer nodes. Kerberos supports multiple approaches to set up a replication between its nodes: it can be done by KDC, or by a backend database. We use OpenLDAP as a database for KDC, so all the Kerberos replication relies on OpenLDAP. To replicate the data, only consumer nodes need to know about a provider node, which lets us easily add or remove consumer nodes from a cluster. Promoting a consumer node to a provider is also a relatively easy and quick operation in this setup.

To avoid write conflicts between provider nodes, we run only a single provider in the read-write mode at any point in time, while all other provider nodes are read-only.

Each cluster has its consumer nodes added as a DNS A record under the same region-based DNS name (kdc.region-1.data.uber, kdc.region-2.data.uber, etc.), so all the clients authenticate to Kerberos using short and well-known DNS names instead of specific host names. Meanwhile the provider nodes are hidden from the clients and only accessible within the Kerberos infrastructure services.

The regional clusters with multi-provider setup and tens of consumer nodes enabled us to scale Kerberos to satisfy the request volume and increase adoption. In the next section, we will describe the supporting infrastructure around this Kerberos deployment.

Keytab Distribution Pipeline

Overview

When Kerberos was set up back in 2016, a new operational task that the team had to take on was registering new principals and distributing keytabs (files with principal credentials) securely to the requesting team. Our initial approach involved manually logging on to Kerberos kadmin hosts to register accounts, generate keytabs and scp-ing them to a shared location predetermined with the team. This approach had multiple problems:

The number of ad hoc requests to generate new principals and keytabs introduced more work for a team already overwhelmed with production incidents and infrastructure scalability issues
The manual provisioning of principals and keytabs stood in the way of the vision of achieving fully automated deployments of services
Security issues when credentials were passed from hand to hand without any strict and defined process
Each manual interaction with kadmind posed a risk to accidentally delete or corrupt production data

In a previous blog, we described how we did a major makeover containerizing the entire Hadoop infrastructure stack to enable us to scale and operate Hadoop. As part of the larger company-wide effort to achieve fully automated infrastructure, we automated generation and distribution of keytabs using a system that we internally call Keytab Distribution Pipeline, which essentially distributes keytabs across Data services and does so automatically.

The Keytab Distribution Pipeline (KDP) is primarily responsible for the following:

Observing predefined sources for changes in metadata, which is then used as input to create/delete Kerberos principals
Creating and deleting keytabs corresponding to the principals from above and ensuring that they are made available in a secure vault
Providing and maintaining a client that can securely fetch the keytabs from the secure vault

We invested heavily in simplifying the onboarding process for KDP for any new service—the work that a service owner team needs to do involves defining a config (~3 lines) and adding a client call in the service codebase (another ~3 lines). The ease of onboarding to KDP and automated management of keytabs contributed heavily to the growing adoption of Kerberos.

There are 2 major categories of use cases for keytabs:

Services that use service or system accounts to run automated pipelines or jobs
Large fleet services (HDFS, YARN, etc.) that use service-host principals (e.g., hdfs/us-west-123) for authentication during node addition to service clusters

Focusing on ease of onboarding for the above-mentioned use cases, we identified and integrated KDP automation with 3 systems:

Active Directory (AD): source of truth for service and system accounts
Sources for information about each service and the hosts on which they are deployed
a. Odin: Uber’s internal orchestration system for stateful services
b. Peloton: Uber’s internal orchestration system for stateless services

Kerberos-Bridge

The first component in the pipeline is the Kerberos-Bridge service (a “bridge” between the Kerberos infrastructure and external metadata providers). It periodically fetches services and hosts metadata from the metadata providers and based on it updates the corresponding “keytab entries” in an internal Git repository. Keytab entries are not actual Kerberos keytabs yet–they are metadata for the keytabs containing the principal name, the owner information to set correct permissions for the keytab, and some other supporting information. Choosing Git as an internal repository allows us to use its built-in features, such as auditing through Git history, rollbacks with Git reverts, access control, and peer reviews when needed to make manual changes to the entries.

Keytab-Manager

Now we have reached the core part of the pipeline–the Kerberos provider host (provider hosts are hidden from the clients and only KDP can interact with them). We run the following processes on the host: KDC (used for authentication and not relevant in this pipeline), KDC Admin (kadmin—client to administer Kerberos principals and keytabs), OpenLDAP, and our Uber internal Keytab-Manager service.

Keytab-Manager is a glue between open-source KDC/OpenLDAP systems and our internal Uber services. Similar to Kerberos-Bridge it monitors the Keytab Entries Repository for any new changes. Once Keytab-Manager sees a change in the repository (e.g., a new keytab entry `hdfs/us-west-123@DATA.UBER.COM` was added (meaning the HDFS service was deployed on a new host us-west-123), it creates a new Kerberos principal with the same name by calling local kadmin process, which is backed by OpenLDAP database. After the principal is created the service generates a corresponding keytab file (hdfs_us-west-123.keytab).

The keytab file is securely stored in the Secrets Store with appropriate granular permissions, which is later used while retrieving keytabs. The Keytab-Manager service also monitors the Secrets Store and updates it with any new or removed keytab files, and sets their permissions based on the owners in the corresponding keytab entries.

Keytab-Fetcher

The final component in our pipeline is Keytab-Fetcher, a library that any service can integrate into its codebase and fetch its keytab from the Secrets Store. The orchestration system agents (deployed on the entire Uber fleet) are responsible for leveraging the Keytab-Fetcher to download the keytab(s) and make them available to the service containers.

The entire KDP is secured by SPIRE, which allows services to authenticate before updating and retrieving secrets from the Secret Store. The Secrets Store will then only allow a service to access keytabs if they are permitted to do so.

KDP has been in production for over 2 years at Uber. It has been generating and distributing over 1000 keytabs per week. The high degree of automation freed the team from being involved in registering principals and distributing keytabs. The automation system also enables us to rotate keytabs as needed. We spent a minimal amount of time monitoring KDP dashboards to ensure the entire automation system is working as expected.

Concluding Thoughts

The architecture we have chosen for Kerberos and the automation we have built with the Keytab Distribution Pipeline improves developer productivity, the security of our Data stack, reliability (as now everything is automated), and enables us to easily scale authentication for Data services.

In our next blog post, we intend to showcase how we manage rotation of keytabs across the fleet through a centralized automated system.

Apache®, Apache Hadoop, Hadoop®, Apache Hive™, Hive™, Apache Kafka®, Kafka®, Apache Zookeeper™, Zookeeper™, Apache Spark™, Spark™, Apache Flink®, Flink®, Apache Pinot™, and Pinot™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Alexander Gulko

Alexander is a Staff Software Engineer working on the Data Security team based in Seattle, WA. While the team is responsible for all aspects of data security and compliance, he primarily focuses on Authentication and leads various initiatives in that area.

Matt Mathew

Matt is a Sr. Staff Engineer on the Engineering Security team at Uber. He currently works on various projects in the security domain. Previously, he led the initiative to containerize and automate Data infrastructure at Uber.

Posted by Alexander Gulko, Matt Mathew

Category:

Engineering

Backend

Security