May 7, 2026

Zero-Growth Stack, Real Gains: How Stack Allocation Can Save 10% CPU in Go

Cristian Velazquez

Sr Staff Engineer

Share this article

Introduction

At Uber, around 65% of our services run with Go®, which accounts for more than 2 million cores. Our current scale means that a 1% efficiency improvement across the board accounts for several million dollars. Today, we’re showcasing how we improved one service’s utilization by 10% and how we plan to replicate it.

The Go runtime employs several mechanisms to keep resource usage low (memory), but at scale, the tradeoff might be more expensive (CPU). One mechanism is stack memory expansion, which starts small to allow more concurrency than OS threads, but can expand if they reach their pre-allocated size. Avoiding repeated stack expansion is essential to avoid wasteful CPU consumption. Preallocating stacks with the desired size is essential.

Deep Dive

Go is the most-used language at Uber in terms of capacity. Go has its own runtime that replaces threads with goroutines (think of them as a lightweight alternative to threads). One key difference between them is how they handle stack usage. If we take the example of pthread_create, an OS thread uses 2MB. If the RLIMIT_STACK resource limit is set to "unlimited", a per-architecture value is used for the stack size: 2 MB on most architectures and 4 MB on POWER and Sparc-64. On the other hand, Go uses 2KB as the default initial size for a goroutine.

If 2KB isn’t enough, Go has a special procedure to check if a function is going to overflow. When this happens, it creates a new stack of double the size and copies the contents of the previous one.

Disassembled Go assembly code for the reflect.mapaccess function, showing source file references, memory addresses, hexadecimal opcodes, and corresponding assembly instructions in a dark-themed terminal window. — Figure 1: Example assembly code to show stack expansion.

When you look at Figure 1, line 2 compares the stack pointer, and if we’ve exceeded the threshold, it jumps to call runtime.morestack.

go1.19 introduced adaptive stack size, which keeps track of on-going average stack size and uses that information to set the initial stack. It improved to some extent because you don’t start with the smallest possible size, but it can still be quite expensive for some applications.

Then, what other options did we have?

Option #1: Goroutine Pooling

Other teams at Uber have used this solution in the past. The M3 team improved CPU usage by keeping a pool of goroutines and sending work to them. It works really well for them because the worker pool has only one responsibility.

Disadvantages of this option are that you need to perform code changes that make the code slightly more complex, take time, and require knowledge of the application in question. Channel communication also has a cost, and its size is static (under heavy load, do we block or drop?).

Option #2: Customizing the Runtime

Let’s dive into the Go source code. The way the average stack size works is shown in Figure 2.

Go code snippet defining a variable 'startingStackSize' and a function 'gcComputeStartingStackSize' with comments explaining stack size initialization for goroutines and a conditional check on 'debug.adaptivestackstart'.

Figure 2: Go runtime starting stack size implementation.

There’s a method that computes the value and stores it in startingStackSize . Additionally, there’s a way to disable the execution of this method with debug.adaptivestackstart .

At this point, we had 2 options: add public methods to the runtime or leverage private linking to expose private methods. In Uber’s case, we decided to use the private linking approach because it touches the runtime the least.

First, we needed to patch the runtime to allow private linking to the debug global variable and startingStackSize global variable. This was needed because linkname requires a contract since go1.23 and these two places don’t have it in the official source code (example).

Then, we created public methods in our internal modules to modify those variables.

Go source code with functions for managing stack size and garbage collector settings, including GetStartingStackSize, SetStartingStackSize, SetGCShrinkStackOff, and SetAdaptiveStackStartOff, each with comments explaining their purpose and thread-safety. — Figure 3: Code required for static stack sizing.

We needed the debug variable to disable adaptive stack start and stack shrinking to disable ‌adaptive stack. This was to avoid runtime overriding our changes. We also disabled stack shrinking, which also has a cost on CPU.

To simplify stack tuning, we decided to use a static value to minimize runtime impact. As soon as we start the application, we modify the value. Then we leverage our configuration system to inject the stack size for different exceptions. This allows testing by runtime_environment, zone, region, and so on.

The only disadvantage to this approach is that the Go community could change how it defines the average stack size, and our code would stop working. To reduce the risk of crashes and issues in the future, we use build tags.

The current version exposes methods that allow modifying the runtime, but we create a boilerplate file with empty implementations of all the methods, pointing to the upcoming Go version. Then, on every new Go version, we must check the private structure to ensure that debug hasn’t changed and that startingStackSize is still there.

Production Example

One of our top service’s profiles showed around 10% CPU coming from stack growth.

Flowchart box labeled 'runtime copystack' displaying timing statistics: 39.98 seconds (0.048%) out of a total 8106.43 seconds (9.77%). — Figure 4. Profile block for service with high CPU cost from stack growth.

Their stack memory usage was very low, as shown in Figure 8.

Stack Memory usage visualization with horizontal colored bars representing different memory ranges over time, labeled on the left with memory addresses and sizes, and a time axis at the bottom showing intervals from 13:00 to 14:15. Green and purple bars indicate varying memory activity across different stack levels. — Figure 5: Current stack usage for the service.

Most instances were in the 12-16 MB bucket. This service had 16GB of memory per instance, so it had plenty of room to grow.

We knew that we could use memory to reduce CPU usage, so we looked at different approaches to determine how much stack memory to allocate per goroutine.

Approach #1: Manual Mode

The first approach we considered was to start with 4KB, deploy, check the profile, and if the copystack frame was still high, go to the next value, 8KB, rinse and repeat.

Obviously this isn’t scalable to do for many applications, and it could‌ have the opposite effect if the average stack size was higher than the initial 4 KB.

Approach #2: Manual Mode With a Metric

The second approach we considered was the same as the first, but added instrumentation to our services to emit the current average starting stack size, so that we could use that information to tune. If the current avg was 4KB and we still saw copystack in the profile, it meant we could go directly to 8KB, reducing the number of steps.

This approach is better, but still requires significant effort because we needed to wait for deployments to happen.

Approach #3: Get the Value at Runtime

Without going into too much detail on how to get the current stack size at runtime (cgo or private linking help here), this approach wasn’t feasible due to the overhead it brings and because we’d need to know where the right place is to call it (which method is the one causing copystack).

Approach #4: Profiler

Returning to Figure 4, the profiler already gave us the information we wanted. Any stack trace that reached this block was the one we needed to see how large it was. The question was, how could we automate this process? How could we do it without adding runtime instrumentation?

Figure 6 depicts our overall workflow of profile-guided stack-size tuning.

Flowchart illustrating a system for CPU profiling and stack size analysis. Configuration system stores stack size per service and interacts with multiple ServiceA instances. Fleet wide profiler pulls CPU profiles from services and stores them in blob storage. Background worker pulls profiles from blob storage, filters copystack on profiles, builds binaries with debug symbols, and computes p99/max stack from stacktraces.

Figure 6. Entire feedback loop to tune stack sizes.

To read the stack traces, we use the pprof library. Once we have the profiles, we can filter out any stack trace that doesn’t go to the copystack function. After that, it needs to be able to know how much stack space a specific function uses. It has 2 options:

go tool objdump <binary>: easier to implement, but it’s slow and not fully integrated (call a subprocess and ensure the Go binary is available).
Create our own assembly reader: better for a final production solution. We do this using a library to disassemble a binary.

The first requirement is to ensure that we build our Go program with debug symbols, and that we could read those symbols in Go.

Go source code displayed in a dark-themed code editor, showing two functions: 'disassemble' and 'generateSymbolCache'. The code handles ELF file parsing, symbol extraction, and symbol caching, with error handling and comments explaining each step. Syntax highlighting uses colors for keywords, types, and comments.

Figure 7. Code to get debug symbols from binary.

Then we need to go through the stack trace previously computed by the pprof library, and for each function, get its stack usage.

Go programming language source code displayed in a dark-themed code editor, featuring nested for-loops, symbol and stack trace handling, byte manipulation, and conditional logic for disassembling instructions and calculating stack sizes.

Figure 8. Code to get stack usage of each method in a stacktrace.

In the code presented in Figure 11, we go through all the assembly instructions for each function, and the first time we find sub, that’s the instruction that reserves the amount of stack memory the function needs. Then we only need to get the remaining bytes from the instruction to get the amount.

Code editor window displaying numbered lines with values and function names: 19472, 258 <functionA>, 128 <functionB>, 512 <functionC>, followed by ellipsis.

Figure 9. Output from stack analyzer.

The stack analyzer prints all the stack traces with the total stack usage, they’re using order by usage. If we analyze the top usage, we see that we’re using around 19KB, but since we need to set the stack as a power of 2, the correct value is 32KB.

Additionally, the analyzer can provide useful information about each function. For instance, we found that go.uber.org/yarpc/internal/observability.(*Middleware).Call was consuming 2.6KB of stack memory. This is a middleware, so it would be called by every single request. Therefore, we can optimize it to reduce its usage. The change is available on GitHub®, where we reduced it to 600 bytes.

Impact on Production

We started with close to a 10% impact on the service.

Figure 10. Stack growth cost for the same service as above.

We went to <1% CPU usage after deploying the change that moves the stack size from 2KB to 32KB.

Box displaying profiling data: 'runtime copystack 0.42s (0.0047%) of 70.61s (0.79%)'.

Figure 11: Stack growth cost for the same service as above after a static stack size of 32KB.

Most instances use around 50MB of stack, with a few reaching 200MB. The container has 16GB of memory, so 200MB overhead is less than 2%.

Horizontal bar chart titled 'Stack Memory' displaying memory address ranges on the y-axis and time from 14:15 to 20:00 on the x-axis. Colored bars (green, purple, orange, red) represent memory usage patterns for each address range over time, with varying densities and lengths indicating changes in stack memory allocation.

Figure 12. Stack memory usage for the same service as above after static stack size to 32KB.

The impact on CPU was significant (1-(150/180) == ~16%).

Two colored lines, yellow and green, represent CPU usage percentages over time from 17:00 to 11:00 the next day. Both lines show a decrease in usage during the night, reaching a low around midnight, then rising again in the morning. The yellow line peaks at 181% and the green at 149% at 10:59:07 on 2023-08-25, as indicated by the legend.

Figure 13. CPU consumption with 1 week delta (yellow is 1 week before).

Next Steps

We plan to continue onboarding more services to this solution. Identifying the correct value for the stack is time-consuming, so we need to automate the process to find the best candidates. Signals to look for are:

High CPU consumption from stack growth
Relatively low stack memory usage or enough memory to grow

Additionally, we can use the analyzer to optimize hot functions that consume high amounts of stack.

Conclusion

Go’s unique stack expansion mechanism typically provides an ideal balance of low memory usage and efficient CPU performance. However, at Uber’s scale, where 1% efficiency gains represent millions of dollars in savings, even highly optimized runtimes can benefit from specialized tuning. While our current approach involves internal modifications to the runtime, the significant performance gains demonstrate that these efforts are well worth the investment until a native, official solution is available.

Acknowledgments

Special thanks to the Go team that helped us with the runtime patch, which made this possible.

Cover Photo Attribution: “Colorful software or web code on a computer monitor” by Markus Spiske is covered by the Unsplash license.

Go and the Go logo are trademarks of Google LLC.

GitHub is a registered trademark of GitHub, Inc, a subsidiary of Microsoft.

Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.

Written by

Zero-Growth Stack, Real Gains: How Stack Allocation Can Save 10% CPU in Go

Introduction

Deep Dive

Option #1: Goroutine Pooling

Option #2: Customizing the Runtime

Production Example

Approach #1: Manual Mode

Approach #2: Manual Mode With a Metric

Approach #3: Get the Value at Runtime

Approach #4: Profiler

Impact on Production

Next Steps

Conclusion

Acknowledgments

Select your preferred language

Products

Company

Select your preferred language

Ride

Drive & deliver

Uber Eats

Business

Drive & deliver

Ride

Uber Eats

Uber for Business

Manage account

Sign out