Containerizing the Beast – Hadoop NameNodes in Uber’s Infrastructure

January 26, 2023 / Global

Introduction

There are several online references on how to run Apache Hadoop^® (referred to as “Hadoop” in this article) in Docker for demo and test purposes. In a previous blog post, we described how we containerized Uber’s production Hadoop infrastructure spanning 21,000+ hosts.

HDFS NameNode is the most performance-sensitive component within large multi-tenant Hadoop clusters. We had deferred NameNode containerization to the end of our containerization journey to leverage learnings from containerizing other Hadoop components. In this blog post, we’ll share our experience on how we containerized HDFS NameNodes and architected a zero-downtime migration for 32 clusters.

Motivation

Following are the motivating factors that led us to containerize HDFS NameNodes at Uber:

Consolidation on a single unit of deployment, Docker^® containers: Across Uber, we were adopting Crane, Uber’s next-gen infra stack. As part of this effort, we are standardizing containers as the unit of deployment. We were maintaining a hybrid deployment model–all DataNodes are containerized while HDFS control plane nodes (NameNodes, ZKFC, JournalNodes) were deployed using Debian packages on bare metal. Consolidating on Docker containers, meant reducing the number of build pipelines and streamlining the deployment process, thereby reducing the cognitive overhead on the team.
Adoption of next-gen hardware: Our existing NameNode servers were reaching end of life and Uber’s HDFS cluster requirements had outgrown the existing servers’ capacity. We worked closely with our partners within Uber’s Metal engineering team to design and develop a server that satisfies the performance requirements for NameNodes. The new generation servers are compatible with Crane in new zones we had built out. Hence, adopting these new servers also meant standardizing on containers to be Crane compatible.

Challenges

As we worked on the initial phase of the NameNode containerization and hardware upgrade effort, we foresaw the following three core technical challenges:

Service Discovery: HDFS is the storage layer for the entire batch analytics stack at Uber. Most clients access HDFS using a pair of hard-coded NameNode hostnames (in hdfs-site.xml). The variation of such clients (languages, versions, custom ones) across Uber used by 750k+ daily analytical jobs/queries and thousands of services made it cumbersome to perform any migrations that need hostname and/or IP address changes in a reliable manner. Any hostname/IP change poses a risk to breaking a certain set of clients due to unexpected behaviors. A smaller version of the service discovery problem exists among different components within the HDFS cluster: NameNodes, JournalNodes, ZKFC, DataNodes, HDFS Routers, Observer Namenodes. We had to devise a method to standardize client access to HDFS and its components and make them resilient to host/IP changes of components.
Addressing Performance Concerns: With 500M+ blocks per cluster, our NameNodes are configured with Java^® heap sizes of over 250GB. Java inside Docker has been notoriously famous for disparities in memory management. Hence containerization of NameNode poses additional risks that can deteriorate cluster performance and its ability to serve traffic. We had to resort to monitoring heap usage and other vitals meticulously over the course of the migration. This led us to run many iterations of load testing and tuning several settings to adhere to our SLA guarantees.
Zero Downtime Migration: At Uber, we hold a high standard on availability of the service during migrations. With SLAs for low-latency ingestion (within 3-6 hours for certain datasets) into the data lake, we cannot afford to have prolonged downtimes for production Hadoop clusters. Over the past 6 years we have not taken downtime for any of the migrations or upgrades we have performed on Hadoop. To ensure that we do not have any downtime for this effort, we had to devise a migration strategy that avoids downtime and defines criteria for rollbacks for fast mitigation. We will cover how we conceived the migration strategy in the latter half of this blog post.

Architecture

The old setup involved bare-metal deployment of HDFS control plane components (Namenodes, JournalNodes, etc.). The deployment process allowed mutations on the host in all forms, including host configs, binaries, cluster configs, and automation scripts. Mutability of the deployment has often led to divergences across nodes/clusters and manual errors. The added overhead of operations, on-call issues, and tracking changes have been hindering our velocity on making changes to HDFS control planes and HDFS clients. This situation has also caused delays on other efforts (such as OS and kernel upgrades) that require host-level upgrades. This experience has significantly influenced our new architecture and migration strategy for HDFS control planes.

It was evident that the previous model of operating Namenodes was not going to fare well in the long run. We started the effort by conducting research, which involved studying existing metrics, running benchmarks and feasibility analyses. Our initial approach involved working independently (in parallel) on NameNode hardware upgrade and its containerization. As we grew our confidence during the prototype phase of the project, we decided to combine the efforts and attempt both in one single migration. To achieve this, we strategized our work in such a way that we focused on the following three high-level parallel workstreams:

Container Deployment
Maintaining Service Discovery
Load Testing for Scale

Container Deployment

Hadoop Worker

HDFS control plane components require a set of files and secrets, in addition to minimal housekeeping automation to function. To meet these requirements we developed a new program (also deployed as a container) called Hadoop Worker. The Hadoop Worker is one among the first Docker containers that gets deployed on the host after provisioning. It does the following:

The Hadoop Worker is aware of the host’s cluster attribution. It works with the Cluster Management system (described in this blog) to populate the ‘hosts files’ (i.e., dfs.hosts and dfs.hosts.exclude). The Worker periodically updates these files to assist the Cluster Management system to add or decommission DataNodes from the cluster.
It works with the Keytab Distribution system (described in this blog) to generate and set up necessary keytabs required for the HDFS control plane components.
It performs a set of housekeeping tasks for HDFS clusters such as keeping backups of audit logs for compliance reasons, and backups of NameNode fsimage for disaster recovery and analytical purposes.

HDFS Docker Container

The HDFS control plane consists of a set of components: NameNode, ZKFC and JournalNodes. Each of the components run in its own Docker container co-located across 3 hosts: host1: (nn + jn + zkfc) , host2: (nn + jn + zkfc), host3: (jn). We reused the same Docker image that we had developed for DataNodes. We tweaked it in such a way that one can start the desired control plane components in the foreground (instead of a daemon) using appropriate Docker run command inputs. The Docker image consists of the Hadoop binary, all HDFS clusters’ configs and a custom high-performant NSS library to resolve UserGroupInformation. Necessary environment variables are provided to the Docker run command to auto-configure the *-site.xml files during deployment time.

We made containers other than the NameNode (including Hadoop Worker and sidecars) lightweight and contained their memory usage to a considerably small number of ~5GB. This enabled us to provide the rest of the memory available on the host to the NameNode container. For example, the largest production clusters’ NameNode (that has 675M+ blocks) are configured with 320GB JVM heap size on a host with 384GB memory.

*Figure 1: Container Deployment on HDFS NameNode Host*

Sidecar Containers

Besides the HDFS containers, we also launch sidecar containers that do auxiliary tasks. One of the sidecars called jmx-exporter (based on prometheus/jmx_exporter) handles parsing data from local HDFS containers’ JMX endpoints and exposing them as metrics to power our monitoring dashboards and alerting systems. We also have another sidecar container, auditlog-streamer that tails audit logs from NameNode and sends them off the host for real-time analytics.

To make the setup developer friendly, we developed a CLI tool (that wraps around Docker) to do common operational commands with HDFS control plane nodes.

Maintaining Service Discovery

An HDFS cluster set up consists of the following major components: NameNodes (with ZKFC), JournalNodes, ObserverNameNodes, DataNodes; besides the HDFS clients that access the cluster directly or through HDFS routers. Each of these components discover each other based on DNS lookups. All Uber’s HDFS clusters are secured through Kerberos. Hence, the authentication flow involves DNS lookups.

In the past few years, we have faced issues on multiple occasions where changes to hostname and/or IP address for a NameNode during migrations caused a large number of clients (that cached IP addresses) to fail, or cause authentication issues within the cluster. Learning from past experiences, we wanted to make migrations of HDFS control plane nodes as seamless as possible. To achieve this, we started sketching out DNS-based discovery performed by different HDFS components.

*Figure 3: DNS based Discovery performed by different HDFS components*

Based on the analysis, we devised two broad set of changes:

1. Large Number of Clients accessing NameNodes (or ObserverNameNodes)

Components that have a large number of instances (Clients and DataNodes) must use a single DNS record that resolves to multiple nodes for NameNode discovery. This ensures that logic to find Active NameNode works when there are any number of IP addresses resolvable through the DNS A record. This enables migration processes such as adding a third NameNode host before removing the existing Standby NameNode host in NameNode HA-enabled clusters. But, this meant changing the tens of thousands of DataNodes and HDFS Clients spread across several applications and services of both batch and long-running nature.

We updated the HDFS client logic in such a way that the client re-resolves the DNS A record when there are exceptions in resolving IP addresses, and ensured that Kerberos authentication works as expected. We also updated the DataNode logic such that they can discover and communicate with all NameNodes (including Observer NameNodes) after resolving the DNS A record. The upgrade of DataNodes was within the team’ control.

We standardized on supporting HDFS clients in 3 different languages (Java, Golang, and Python) at Uber. With the new patches to support DNS A record with multiple IP addresses, we released new client versions in all 3 languages. The vast majority of Uber’s codebase is hosted in language-specific monorepos. We leveraged the monorepos to update the majority of the clients to the new version. However, we had to invest in a company-wide effort spanning ~5 months to get all the service owners to confirm and roll out their services with the updated client. From this point on, any underlying host migrations for NameNodes are transparent to DataNodes or HDFS Clients.

2. Discovery among HDFS control plane components

Control plane components that interact with each other must not fail if A record or IP address for the individual host changes. This was because any host within Crane infrastructure is fungible–the service running on a host must be portable to another host with a different hostname (A record and IP).

We updated the logic where control plane components communicate with each other to honor DNS CNAMEs such that change in IP address or A record does not result in failures. The DNS CNAMEs are updated through configs as nodes are moved across hosts.

*Figure 4: DNS records used for discovery among HDFS components*

We rolled out these two big changes across all clusters and applications as part of the preparatory steps before performing the actual migration of HDFS control planes (NameNodes, JournalNodes, etc.). This gave us the flexibility to proceed with the migration without any downtime and interruptions for HDFS.

Load Testing for Scale

A total of 50k per cluster or 500k+ per region (~10 production clusters per region) operations are seen by HDFS NameNodes every second (RPC requests per second). With a major migration on the horizon that involves moving NameNodes to new servers and also into containers, we conducted rigorous load testing with various operation scenarios, including NameNode failover, DataNode decommission, etc. We chose to use the open source performance testing tool, Dynamometer, in conjunction with Uber’s internal load generation tool, HailStorm. The setup involved the following:

Dynamometer to replay production traffic on a test NameNode. The production traffic pattern was sourced from audit logs of NameNodes in existing production clusters.
HailStorm was set up with a custom multi-process Load Generator. The Load Generator can generate an arbitrary load with high concurrency as if they were coming from several hundreds of distinct users of Hadoop.
Test NameNode deployed using a Docker container on the new server, hooked up to the monitoring system that we use for production NameNodes.

This test setup gave us enough flexibility to tune parameters along the following three dimensions:

Load profile based on different types of HDFS clusters we maintain at Uber (e.g., heavy writes for temporary files cluster and heavy read traffic for batch analytics cluster)
Load factor (or replay rate factor) that can increase the load on the Test NameNode by using multipliers on observed production traffic from audit logs
Duration of the test, 1 hour to 1 day

We ran several iterations of load tests with various combinations of above parameters and gathered metrics. The notables ones worth mentioning include:

Replaying 1 hour of production traffic with 2x replay rate factor to assess if the test NameNode is able to handle additional workload.
Reproducing 1 day of production traffic behavior along with HailStorm tests to simulate periodic peak traffic typical with scheduled batch workloads. The HailStorm test stressed the NameNode up to ~100,000 operations per second for a 10 min window every hour. Figures 6 and 7 below showcase results from this test iteration.

*Figure 6: Monitoring RpcQueueTimeNumops during load tests*

*Figure 7: Monitoring RpcQueueTimeAvgTime during load tests*

During the duration of the tests, we closely monitored key metrics that are being collected from the test NameNode JMX endpoints. We kept a close watch on NameNode rpc-metrics, RpcQueueTimeNumOps and RpcQueueTimeAvgTime among a few others. With generated load tests, we observed that the containerized NameNode on the new server can handle up to a peak load of ~110k operations per second. And at the same time maintain latencies consistently below 100 ms over the 24-hour duration. These results were more than satisfactory that it gave us enough confidence to proceed with the migration.

Architecting the Migration

The most challenging part of this effort was coming up with a strategy for a seamless migration transparent to Hadoop’s 1000s of customers. After several design discussions and failed attempts, we devised the migration strategy described in Figure 8 below. Every HDFS cluster has 3 servers that host the control plane nodes. The overall process consists of 4 steps, where each node is diligently moved from one host to another.

A possible incident resulting from the migration would have had a large impact on the entire batch analytics stack. Hence, we were very meticulous with the migration process. Inspired by the book, The Checklist Manifesto, the team collectively defined an exhaustive list of items to validate at each stage of the migration. The overall checklist had 60+ items, including but not limited to: checking container versions, verifying component health metrics, ensuring inter-component connectivity, and attesting service logs. We developed scripts to run through parts of the checklist to reduce the cognitive overhead on engineers.

This strategy proved quite beneficial:

The checklist helped define the success and failure criteria for each step of the migration.
Rollback was possible at any stage of the migration process. The checklist along with criteria provided clear guidelines to make decisions on whether to rollback or fix forward.
The process guarantees that at any point in time, there is one Active NameNode in the cluster and availability of the cluster with Quorum Journal Manager is not compromised. Moreover, it guarantees that clients accessing HDFS do not see degradations as nodes are being migrated to different hosts.

Including test, staging, pre-production, and production clusters we had 32 clusters to migrate. With this strategy we were able to migrate 4 clusters a week towards the end of the migration. The entire migration process took ~3 months including time taken to resolve issues that we did not anticipate. The entire migration was completed with no impact to production Hadoop customers.

Issues that Surprised Us

Just like other large system overhauls, we encountered issues that caught us off guard during the migration. Following are the notable ones:

Failures introduced by Heterogeneous Hardware: Since late 2018, we scaled HDFS by introducing ObserverNameNodes to support the astronomical volume of read requests from services like Presto^Ⓡ. Our migration process involves stages where HDFS components are on different generations of hardware. At stage D of the migration, we saw a major improvement in active NameNode performance and latencies. However, for some of our busiest clusters, we noticed that the standby NameNodes and ObserverNameNodes running on older hardware were starting to lag behind significantly on consuming edit logs from JournalNodes (also JournalNodes running on the old hardware might not be able to catch up with the active NameNode edit log writes, we did not encounter this, but this could be a possibility). This led to client-side failures for clients reading data from stale ObserverNameNodes. This happened due to heterogeneous hardware being used across different components in the cluster. To mitigate this issue, we reduced the duration spent in stage D of the migration process (figure 8) and upgraded all HDFS control plane components to the latest hardware.
Kerberos continues to Surprise us: As part of our strategy to scale HDFS, we have introduced HDFS Router based federation for a subset of use cases. Our major focus was to make service discovery work within the HDFS cluster and for clients that access HDFS directly. However, we missed out on Routers that authenticate with Kerberos to access HDFS NameNodes. Routers see a change in the principal name as NameNodes are moved from one host to another (hdfs/${hostname-fqdn} principal). This led to NameNode principal validation failure when Routers were polling for NameNode health status. The issue surfaced in our pre-production environment while we were hardening our migration strategy.
Automation Interfering with Migration: At Uber, we have achieved a high degree of automation for HDFS DataNodes. DataNodes are constantly being decommissioned by automation without any human involvement for balancing cluster capacities, host upgrades and hardware repairs. It is very common to see DataNodes being decommissioned at any given point of time (see Figure 9 below). Automated DataNode decommissions overlapping with NameNode migration increased the startup time for the NameNode. We discovered that the NameNode was not able to achieve the necessary % of block reports due to decommissioning DataNodes. This caused the NameNode to remain in safe mode for a prolonged time. To mitigate this, we paused DataNode automation while NameNodes were being migrated.

*Figure 9: DataNodes being decommissioned by automation during the 2022 year-end holidays*

Summary

Containerizing HDFS NameNodes marks a huge milestone for the Data team at Uber. As of today, the entire Hadoop and larger Data infrastructure runs in Docker containers. Complex components like NameNodes, large fleets such as DataNodes and NodeManagers, and 350,000+ unique daily YARN applications run in containers. Embracing immutable infrastructure with containers has facilitated better automation that the team continues to benefit from everyday.

Migrating to new hardware along with the containerization effort has brought in significant performance improvements. Our Namenode latencies have seen 10x improvement (~200 ms to ~20ms on average) as seen from Figure 10 and 11 below. This has helped us to onboard more traffic and reduced our overall engineering effort to support larger use cases. Every couple of years we find ourselves in some variation of firefighting mode to contain NameNode latencies and scale HDFS. As a result of the work we have done so far, we have seen a major decrease in on-call burden, thereby ending our firefighting season.

*Figure 10: RpcQueueTimeAvgTime prior to migration*

*Figure 11: RpcQueueTimeAvgTime post migration*

With the additional breathing time we have earned, we are currently focussing on the next set of strategic projects for the Data team. One project is to adopt a server type with universal high density disk in all of our clusters for better storage efficiency. Currently we are using different server types in different clusters to satisfy different throughput requirements. We also have work underway to improve IO per PiB of storage through caching and intelligent data colocation within clusters based on data access frequency. We are also working on abstracting Cloud Object Storage behind HDFS APIs to facilitate Uber’s cloud journey.

Acknowledgements

The work portrayed in this article was only possible due to consistent effort from engineers across several teams at Uber. We would like to thank everyone who has supported us through our journey, and also everyone who reviewed and assisted in improving this article.

Apache Hadoop®, and Hadoop® are either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries.

Debian is a registered trademark owned by Software in the Public Interest, Inc. Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein.

Java is a registered trademark of Oracle and/or its affiliates.

Matt Mathew

Matt is a Sr. Staff Engineer on the Engineering Security team at Uber. He currently works on various projects in the security domain. Previously, he led the initiative to containerize and automate Data infrastructure at Uber.

Prabhat Jha

Prabhat Jha is a Software Engineer II on the Data (Hadoop) team at Uber. He currently works on deployment and automation of Data infrastructure at Uber, and worked on containerization of NameNode.

Jing Zhao

Jing Zhao is a Principal Engineer on the Data team at Uber. He is a committer and PMC member of Apache Hadoop and Apache Ratis.

Yuru Liu

Yuru Liu is a Senior Software Engineer on the Data (Hadoop) team at Uber. He currently works on the containerization of HDFS Namenode, HDFS client, and observability of Data infrastructure at Uber.

Nishith Shetty

Nishith Shetty is a Software Engineer II on the Data Infrastructure team at Uber. He currently works on the containerization and automation of HDFS Namenodes.