LeakProf: Featherlight In-Production Goroutine Leak Detection

November 10, 2022 / Global

Go is a popular programming language used in microservice development, with one of the key features being first-class support for concurrency. Given its burgeoning popularity, it is no surprise that Uber adopted it: the Go monorepo is a centerpiece of its development platform, housing a significant portion of Uber’s codebase in the form of critical business logic, support libraries, or key components of the infrastructure.

Go’s concurrency model is built on goroutines–lightweight threads. Any function call prefixed with the “go” keyword launches the function asynchronously. Goroutines are used abundantly in Go codebases owing to low syntactical overhead and resource requirements, with programs often involving tens to hundreds or thousands of goroutines simultaneously. Two or more goroutines can communicate with each other via message passing over channels, a paradigm inspired by Hoare’s Communicating Sequential Processes. While traditional shared-memory communication is still an option, the development team behind Go encourages users to favor channels, arguing for higher tolerance against data races when used properly.

Goroutine Leaks

An unfortunate side effect of goroutines is goroutine leaks. A key component of channel semantics is blocking, where a channel operation stalls execution of a goroutine until a rendezvous is achieved (i.e., a communication partner is found). More specifically, for an unbuffered channel, the sender blocks until the receiver arrives at the channel and vice versa. A goroutine may be blocked forever trying to send or receive on a channel; such a situation where a blocked goroutine never gets unblocked is referred to as a goroutine leak. When too many goroutines leak, the consequences can be severe. Leaking goroutines consume resources such as memory from being released or reclaimed. Note, buffered channels can also cause a goroutine leak once the buffer is full.

Programmatic errors (e.g., complex control flow, early returns, timeouts), can lead to mismatch in communication between goroutines, where one or more goroutines may block but no other goroutine will ever create the necessary conditions for unblocking. Goroutine leaks prevent the garbage collector from reclaiming the associated channel, goroutine stack, and all reachable objects of the permanently blocked goroutine. Long-running services where small leaks accumulate over time exacerbate the problem.

Go distribution does not offer any out-of-the-box solution for detecting goroutine leaks, whether during compilation or at runtime. Detecting goroutine leaks is non-trivial as they may depend on complex interactions/interleavings between several goroutines, or otherwise rare runtime conditions. Several proposed static analysis techniques [1, 2, 3] are prone to imprecision, both reporting false positives or incurring false negatives. Other proposals such as goleak employ dynamic analysis during testing, which may reveal several blocking errors, but their efficacy depends on the comprehensive coverage of code paths and thread schedules. Exhaustive coverage is infeasible at a large scale; for example, certain configuration flags changing code paths in production are not necessarily tested by unit-tests.

Detecting goroutine leaks is complicated, especially in complex production codes using numerous libraries with thousands of goroutines at runtime employing numerous channels. There is a dire need to detect these leaks in production code. Such a tool needs to meet the following requirements:

It should not introduce high overheads because it will be used in production workloads. High overhead both hurts SLAs and consumes more computational resources.
It should have a low false positive rate; false leaks waste developer time.

A Practical, Lightweight Solution to In-Production Goroutine Leaks

We take a practical approach to detecting goroutine leaks for long-running programs in production that meets the aforementioned criterion. Our premise and critical observation are the following:

If a programs has a non-trivial amount of goroutine leaks, it will be eventually visible via a high number of goroutine count blocked on some channel operation
Only a few source code locations (involved in channel operations) account for most leaking goroutines
Rare goroutine leaks, while not ideal, incur low overhead and may be ignored

Point #1 is justified by a spike in the number of goroutines for programs with leaks. Point #2 simply states that not all channel operations are leaky, but the source location of significant leak causes must be exercised frequently to expose the leak. Since any leaked goroutine persists for the remainder of a service’s lifetime, repeatedly encountering leaky scenarios eventually contributes to a large build-up of blocked goroutines. This is especially true for leaks caused by concurrency operations reachable via many different execution paths and/or in loops. Finally, point #3 is a practical consideration. If a leak-inducing operation is rarely encountered, the impact it has on memory build-up is less likely to be severe. Based on these pragmatic observations, we designed LeakProf, a reliable leak indicator with few false positives and a minimum runtime overhead.

Implementation of LeakProf

LeakProf periodically profiles the call stacks of currently running goroutines (obtained with pprof). Inspecting the call stacks for particular configurations indicates whether a goroutine is blocked on specific operations such as channel send, channel receive, and select. These are relatively straightforward to recognize as well-known blocking functions in the Go runtime. After aggregating blocked goroutines by the source locations of channel operations, an egregious number of blocked goroutines at a single source location, determined by a configurable threshold, is treated as a potential goroutine leak.

Our approach is neither sound nor complete. It, respectively, incurs both false negatives (i.e., not all leaks are guaranteed to be found), and false positives (i.e., reports may not represent a real leak). False negatives are caused whenever a program does not exercise leaky scenarios at runtime, or if the number of leaks does not exceed the configurable threshold. Conversely, false positives lead to spurious reports whenever a high number of goroutines are deliberately blocked as a natural consequence of the intended program semantics, and not due to a leak (e.g., heartbeat operations with high latency). To improve filtering of potential false positives, we are continuously developing complementary lightweight heuristics based on static analysis. For example, reporting a suspicious select statement involves analyzing its AST to determine whether one of the case branches involves waiting on known non-blocking operations (e.g., involving tickers or timers provided by the Go standard library); if this property is satisfied, the supposed leak will not be reported, regardless of the number of blocked goroutines, as it is a definite false positive. It may also feature a configurable list of known false positives. Regardless, the approach works very well in practice to detect non-trivial leaks with a large impact on production services, as evidenced below.

When deployed at Uber, LeakProf leverages profiling information to collect blocked goroutine information and automatically notifies service owners for suspicious concurrency operations. Operations are deemed suspicious if the number of blocked goroutines exceeds a given threshold, and only if they are part of the Uber code base. The efficacy of this approach was proven quickly, finding 10 critical leaking goroutines, and only 1 false positive. Addressing 2 of the defects each resulted in a 2.5x and a 5x reduction in the service’s peak memory, with the service owners voluntarily reducing the container memory requirements by 25%.

Leaky Go Program Patterns

An analysis of goroutine leaks found in production has relieved the following common patterns of leaky code patterns.

Premature Function Return

This leak pattern is caused whenever several goroutines are expected to communicate, but some code paths return prematurely without participating in the channel communication, leaving the other participant waiting forever. This occurs when communication partners do not account for all of each other’s possible execution paths.