Stay up to date with the latest from Uber Engineering

Automating Kerberos Keytab Rotation at Uber

June 18 / Global

Share
Facebook
X social
Linkedin
Envelope
Introduction
We previously published a blog about how we scaled adoption of MIT Kerberos™ at Uber. We built an automation system called KDP (Keytab Distribution Pipeline) that generates and distributes Kerberos keytabs (credentials) to systems that must authenticate with Kerberos. 
With the help of this system, we drove adoption of Kerberos authentication for several critical use cases. Some of the key use cases include:
Fetching search indexes from Apache HDFS™ that powers Uber Eats search
Enabling Apache Flink® security for several hundreds of streaming analytics applications
Authentication for all 250 Apache Zookeeper™ clusters used by data infrastructure
Batch analytics infrastructure authentication comprising of 20+ systems
Growth in use cases drove up the number of keytabs to over 100,000 over 5 years. Due to the volume and security requirements of the use cases it supports, we chose to enable periodic rotation for Kerberos keytabs. 
Rotating over 100,000 Kerberos keytabs presents significant challenges, primarily due to two key factors:
Scale and complexity: Keytabs are distributed across thousands of applications and nodes, making manual rotation infeasible. There are no Kerberos APIs, open-source tools, or industry references for automating this at such scale, necessitating a custom solution. Rotating keytabs also requires coordination with Kerberos KDC and synchronization with the client, since Kerberos is based on symmetric key cryptography.  
Risk of service disruption: Keytab rotation invalidates the previous key version (kvno) immediately—any delay in distributing the updated keytab can lead to failed authentication. For applications powering customer-facing features, this could result in service disruptions. Timing the rotation precisely is critical to avoid any downtime.
To address these challenges, we designed and built robust, failure-domain-aware automation that safely orchestrates keytab rotation without human intervention. In this blog, we’ll walk through the architecture and solutions that enabled us to achieve safe, scalable keytab rotation. Note that this blog requires an in-depth knowledge of Kerberos and its protocol.
Architecture
Keytabs are distributed to different systems (or workloads) using Uber’s internal SMP (Secret Management Platform). With 100,000 keytabs distributed to tens of thousands of workloads, it’s impractical to rotate keytabs manually. SMP provides a pluggable framework for enabling automatic rotation of secrets at scale.
SMP keeps track of all the secrets at Uber. For every secret, it keeps track of important metadata such as the last updated time, secret type, and the secret provider that’s responsible for generating and deleting the secret. Additionally, SMP holds information about the rotation policy (N days) for every secret type. A component called the Secret Lifecycle Manager takes care of scheduling Cadence workflows calling the necessary Secret Provider as a secret gets closer to the rotation date. 

Figure 1: Keytab architecture.
In the context of Kerberos Keytabs, there are two main components within KDP that work together to manage the life cycle of keytabs. The existing Kerberos Bridge was modified to implement the Secret Provider interface. With this, Kerberos Bridge supports the ability to generate, rotate, and delete keytabs. Kerberos Bridge receives requests from the Secret Lifecycle Manager to rotate Keytabs.  
An on-host component called Keytab Manager resides alongside Kerberos KDC (Key Distribution Center). This component is responsible for issuing commands to change passwords and generate new keytabs as requested. KDP keeps the internal state of keytabs issued for different principals and how they are consumed by different workloads. The Keytab Manager is responsible for uploading the latest keytab to the Secret Store. Workloads that need to authenticate using Keytabs are configured to fetch Keytabs periodically from the Secret Store. 
With this integration between SMP and KDP, we were able to scale out automatic rotation of keytabs without any human involvement. The next set of challenges we had to solve were around ensuring automation does the work safely without disrupting our systems.
Minimizing Auth Failures During Rotation
We looked further into the Kerberos protocol to identify and solve the possibility of failures if keytabs on either the client or server side were to undergo rotation while any of the Kerberos requests were in progress. Here are cases where we found kvno mismatch during keytab rotation can cause failures:
AS_REQ: A client authenticating with KDC to get TGT can fail if the client is using an older keytab. This can happen when a password change for the client has happened on the KDC side (kvno 2), but the client is using the old keytab (kvno 1) to make AS_REQ.
TGS_REQ: Keytab isn’t involved in this part of the protocol, so it’s not a failure point for keytab rotation.
AP_REQ: A client’s request to a server can fail because the client received a service ticket from KDC using an older server kvno. This can happen when KDC provides a client with a service ticket based on the server’s kvno 1 while the server can’t decrypt since its password got rotated and started using a keytab with kvno 2 to decrypt.

Figure 2: Kerberos protocol.
We’ll go into details about the problem for the two possible failure cases and describe how we minimized the chances of failure.

During AS_REQ
To implement Keytab rotation safely with the above architecture, we put together a timeline chart to understand the sequence of events that would happen during rotation.

Figure 3: Timeline of keytab rotation during AS_REQ.
The timeline steps were: 
t0: Keytab 1 is generated with kvno 1.
t1: The application/client uses keytab 1 to obtain TGT 1. The TGT is valid for 24 hours.
t2: SMP invokes KDP to rotate the keytab. KDP rotates the principal’s key to kvno 2, generates a new keytab 2 with kvno 2, and uploads it to the Secret Store.
t3: The application fetches keytab 2 from the Secret Store.
t4: The application uses keytab 2 to obtain TGT 2. This typically happens when TGT 1 is close to expiry.
Between t2 and t3, the application can’t obtain a new TGT, as keytab 1 is invalidated and keytab 2 isn’t yet available to the application. If t2 < t4 < t3, the application will experience authentication errors and possible downtime. To avoid this situation, we had to:
Ensure TGT 1 has sufficient remaining valid time before ‌ rotating the keytab. We use Kerberos audit logs to identify the last login time for the principal to predict how much time is remaining on the application side with TGT 1, before performing the rotation operation. If TGT 1 is close to expiration, we don’t rotate the keytab and wait for the next login to happen before rotating the keytab.
Minimize the time duration between t2 and t3 by making the application fetch the latest keytab as soon as it’s available. The application periodically fetches the keytab from the Secret Store based on a cron schedule. We have fine-tuned the schedule so that the latest keytab is fetched every 30 seconds.
This logic was implemented into the automation code to minimize the chance of authentication failures. In certain specific highly available applications, where authentication failure was unacceptable at all costs, we worked with the application owner to provide two different keytabs with different principals, and ensured that only one of them is rotated by our automation. This enabled the application with a fallback solution: retry authentication with a second keytab in case authentication fails with the first one.

During AP_REQ
In Uber’s environment, HDFS NameNode, YARN ResourceManager, and Presto Coordinator are examples where they act as servers in Kerberos context. HDFS clients connect with NameNode to read files, YARN clients connect with ResourceManager to submit jobs, and Presto® clients connect with Coordinator to submit queries.
Clients obtain service tickets from the KDC before it connects with the server. The service ticket is encrypted with the server’s key. In the case of rotation, the client may have received and cached a service ticket encrypted with server kvno 1, and when the server receives the request, its keytab may contain kvno 2 due to server key rotation. This can lead to failures as the server wouldn’t be able to decrypt a service ticket with kvno 2 if it was generated with kvno 1.
To avoid this failure on the server side, we ensure that the server keytab has both previous and current kvno (illustrated by a yellow bar in Figure 4), so it can try both keys to decrypt service tickets during the transition phase.

Figure 4: Timeline of keytab rotation during AP_REQ.
The timeline steps were: 
t0: Keytab 1 is generated with kvno 1 for the server (for example, an HDFS NameNode hdfs/us-west-123@DATA.UBER.COM).
t1: The application (like an HDFS client) obtains a service ticket 1 to the server encrypted with kvno 1.
t2: SMP requests to rotate the server keytab.
In KDC, a new key kvno 2 is generated for the server.
Keytab 2 is generated and contains both kvno 1 and kvno 2. It’s uploaded to the Secret Store and fetched by the server.
t3: The application sends service ticket 1 to the server and the server accepts it. The server knows about both kvno 1 and kvno 2 at this time.
t4: Service ticket 1 expires.
t5: The application obtains a service ticket 2 to the server encrypted with kvno 2.
t6: Keytab 3 is generated and contains kvno 2 only. It’s uploaded to the Secret Store and fetched by the server.
This is triggered 24 hours after t2. At this time, all service tickets issued before t2 and encrypted with kvno 1 have expired.
t7: The application sends service ticket 2 to the server and the server accepts it. The server only knows about kvno 2 at this time.
To distinguish between client and server, we identified applications that showcase behavior like client and server (and sometimes both), and captured that in KDPs metadata. This enabled us to automate the necessary logic to cover both cases described above.
Ensuring Safety of Automation
We took several measures to ensure the safety of automating keytab rotation. 
Minimizing Blast Radius
There were containerized applications that ran on thousands of nodes using a single Kerberos principal/keytab. If the problem suggested in the section above were to happen for this single keytab, it’s very likely that a significant part of (if not the entire) application would fail.
To prevent this scenario, we worked with the application owner to migrate from a single keytab to node-specific principals/keytabs. This increased the number of keytabs but enabled us to rotate them safely in smaller batches based on the application’s fault domains (described below). This limited amount of any possible disruption to the existing fault domains of the application.
For critical applications, we introduced a fallback keytab derived from a different principal (other than the node-level principal) that the application can fall back to in case of failures. This two keytab approach reduced the possibility of disruptions even further.

Introducing Rate Limits
The majority of systems that depend on Kerberos are managed by Odin, a Kubernetes®-like deployment system. Odin standardizes common concepts such as availability zones, clusters, and roles for nodes, and exposes this information and workload placement metadata in a queryable manner. 
We amended KDP to cache information about metadata, particularly around fault domains. This enabled us to build a simple rate-limiting feature. The objective of this feature is to avoid rotating keytabs for not more than N nodes of a particular role within a given application cluster. For critical control plane roles (Figure 5) which are fewer (tens) in number, we applied more restrictive rate limits and time delays between rotations. For less-critical data plane roles, which are larger in number (tens of thousands), we applied less restrictive rate limits to rotate their keytabs at scale as shown in Figure 6.

Figure 5: Rate limit config.
This simple feature took roughly 2 weeks to implement, but it has been the most effective feature that has been guarding the automation system from causing accidental chaos for critical applications that rely on Kerberos when their keytabs get rotated.

Cluster-Based Allowlist for Safe Rollout
A technology (like HDFS, YARN, Presto) has many clusters of varying importance, characterized by its tier. When rolling out keytab rotation for a technology, we use a cluster-based allowlist to enable gradual, controlled adoption. This approach enabled us to start with the least critical test clusters and progressively expand to more important ones. By doing so, we could learn from errors early on and refine our rollout strategy and rate limits before we enabled keytab rotation for critical tier clusters.

Figure 6: Allowlist and rate-limit config.
Enabling Automatic Deletion of Keytabs
With provisioning and rotation of keytabs automated, the last logical step was to enable automatic deletion of keytabs. 
We relied on some of the key features we built for factoring rotation to enable safe deletion of keytabs, like: 
Kerberos audit logs to identify which principals and keytabs are safe for deletion. 
Rate limits to ensure any accidental deletion of keytabs in use doesn’t take down the entire application.
There’s a pattern where the server uses its keytab to just decrypt the service ticket AP_REQ sent by the client. The server never logs in with Kerberos (that is, no AS_REQ from the server), so we should instead look for TGS_REQ where others request service tickets to this server.
We look for a set of signals, such as the existence of the system that consumed the keytabs, time period since last login and rate limit on the cluster, before proceeding to delete the keytab from the Secret Store and associated principal from KDC. With this, we enabled automatic deletion of keytabs.
Conclusion
Kerberos was originally introduced in Uber back in 2016 as we were scaling our Hadoop infrastructure to support analytics and machine learning use cases. Since then, we’ve scaled adoption of Kerberos to satisfy several critical use cases for Uber. We also built support for rotating all Kerberos keytabs automatically through integrations with the Secret Management Platform. Our metrics suggest that this system has rotated over 30,000 keytabs in a month at its peak. 
However, it’s time to explore the next major step function change for Kerberos at Uber. We’ve been busy evolving our architecture and systems to be cloud-native. We’ve also been investing in hardening and scaling Uber’s Workload Identity/PKI infrastructure. This opens up further opportunities to explore solutions such as PKINIT (introduced in RFC 4556), which can replace long lived keytabs with certificate based authentication. This brings us several benefits:
Eliminates the coordination overhead (due to symmetric cryptography) between KDC and client during keytab rotation. PKINIT is based on PKI – asymmetric cryptography, which increases safety and reliability of rotation (compared to keytabs).
Enables us to confidently reduce the certificate lifetime to be much shorter (hours) than the existing keytab rotation frequency (months), thereby improving the security posture of the systems that rely on Kerberos.
Aligns more closely with the rest of Uber’s modernization efforts, where we’re relying more and more on PKI and X.509 certificates for authentication.
In a future blog, we’ll describe key learnings from our experience modernizing authentication for data infrastructure.
Cover Photo Attribution: The cover photo was generated using OpenAI® ChatGPT.
Apache®, Hadoop®, HDFS™, Flink®, YARN, and ZooKeeper® are trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by Apache Software Foundation is implied by the use of these marks.
Kerberos™ is a trademark of the Massachusetts Institute of Technology (MIT) in the United States and/or other countries. No endorsement by MIT is implied by the use of these marks.
Kubernetes® is a registered trademark of The Linux Foundation. No endorsement by The Linux Foundation is implied by the use of these marks.
Presto® is a registered trademark of LF Projects, LLC.  No endorsement by LF Projects, LLC. is implied by the use of these marks.
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.

Junyan Guo

Junyan Guo is a Senior Software Engineer on the Data Security team at Uber. He currently leads development on Uber’s Kerberos infrastructure and also works on AI Security and Compliance at Uber.

Matt Mathew

Matt is a Sr. Staff Engineer on the Engineering Security team at Uber. He currently works on various projects in the security domain. Previously, he led the initiative to containerize and automate Data infrastructure at Uber.

Posted by Junyan Guo, Matt Mathew

Category:

Engineering

Backend

Security