Zero-Growth Stack, Real Gains: How Stack Allocation Can Save 10% CPU in Go
Sr Staff Engineer
Introduction
At Uber, around 65% of our services run with Go®, which accounts for more than 2 million cores. Our current scale means that a 1% efficiency improvement across the board accounts for several million dollars. Today, we’re showcasing how we improved one service’s utilization by 10% and how we plan to replicate it.
The Go runtime employs several mechanisms to keep resource usage low (memory), but at scale, the tradeoff might be more expensive (CPU). One mechanism is stack memory expansion, which starts small to allow more concurrency than OS threads, but can expand if they reach their pre-allocated size. Avoiding repeated stack expansion is essential to avoid wasteful CPU consumption. Preallocating stacks with the desired size is essential.
Deep Dive
Go is the most-used language at Uber in terms of capacity. Go has its own runtime that replaces threads with goroutines (think of them as a lightweight alternative to threads). One key difference between them is how they handle stack usage. If we take the example of pthread_create, an OS thread uses 2MB. If the RLIMIT_STACK resource limit is set to "unlimited", a per-architecture value is used for the stack size: 2 MB on most architectures and 4 MB on POWER and Sparc-64. On the other hand, Go uses 2KB as the default initial size for a goroutine.
If 2KB isn’t enough, Go has a special procedure to check if a function is going to overflow. When this happens, it creates a new stack of double the size and copies the contents of the previous one.
When you look at Figure 1, line 2 compares the stack pointer, and if we’ve exceeded the threshold, it jumps to call runtime.morestack.
go1.19 introduced adaptive stack size, which keeps track of on-going average stack size and uses that information to set the initial stack. It improved to some extent because you don’t start with the smallest possible size, but it can still be quite expensive for some applications.
Then, what other options did we have?
Option #1: Goroutine Pooling
Other teams at Uber have used this solution in the past. The M3 team improved CPU usage by keeping a pool of goroutines and sending work to them. It works really well for them because the worker pool has only one responsibility.
Disadvantages of this option are that you need to perform code changes that make the code slightly more complex, take time, and require knowledge of the application in question. Channel communication also has a cost, and its size is static (under heavy load, do we block or drop?).
Option #2: Customizing the Runtime
Let’s dive into the Go source code. The way the average stack size works is shown in Figure 2.
Figure 2: Go runtime starting stack size implementation.
There’s a method that computes the value and stores it in startingStackSize . Additionally, there’s a way to disable the execution of this method with debug.adaptivestackstart .
At this point, we had 2 options: add public methods to the runtime or leverage private linking to expose private methods. In Uber’s case, we decided to use the private linking approach because it touches the runtime the least.
First, we needed to patch the runtime to allow private linking to the debug global variable and startingStackSize global variable. This was needed because linkname requires a contract since go1.23 and these two places don’t have it in the official source code (example).
Then, we created public methods in our internal modules to modify those variables.
We needed the debug variable to disable adaptive stack start and stack shrinking to disable adaptive stack. This was to avoid runtime overriding our changes. We also disabled stack shrinking, which also has a cost on CPU.
To simplify stack tuning, we decided to use a static value to minimize runtime impact. As soon as we start the application, we modify the value. Then we leverage our configuration system to inject the stack size for different exceptions. This allows testing by runtime_environment, zone, region, and so on.
The only disadvantage to this approach is that the Go community could change how it defines the average stack size, and our code would stop working. To reduce the risk of crashes and issues in the future, we use build tags.
The current version exposes methods that allow modifying the runtime, but we create a boilerplate file with empty implementations of all the methods, pointing to the upcoming Go version. Then, on every new Go version, we must check the private structure to ensure that debug hasn’t changed and that startingStackSize is still there.
Production Example
One of our top service’s profiles showed around 10% CPU coming from stack growth.
Their stack memory usage was very low, as shown in Figure 8.
Most instances were in the 12-16 MB bucket. This service had 16GB of memory per instance, so it had plenty of room to grow.
We knew that we could use memory to reduce CPU usage, so we looked at different approaches to determine how much stack memory to allocate per goroutine.
Approach #1: Manual Mode
The first approach we considered was to start with 4KB, deploy, check the profile, and if the copystack frame was still high, go to the next value, 8KB, rinse and repeat.
Obviously this isn’t scalable to do for many applications, and it could have the opposite effect if the average stack size was higher than the initial 4 KB.
Approach #2: Manual Mode With a Metric
The second approach we considered was the same as the first, but added instrumentation to our services to emit the current average starting stack size, so that we could use that information to tune. If the current avg was 4KB and we still saw copystack in the profile, it meant we could go directly to 8KB, reducing the number of steps.
This approach is better, but still requires significant effort because we needed to wait for deployments to happen.
Approach #3: Get the Value at Runtime
Without going into too much detail on how to get the current stack size at runtime (cgo or private linking help here), this approach wasn’t feasible due to the overhead it brings and because we’d need to know where the right place is to call it (which method is the one causing copystack).
Approach #4: Profiler
Returning to Figure 4, the profiler already gave us the information we wanted. Any stack trace that reached this block was the one we needed to see how large it was. The question was, how could we automate this process? How could we do it without adding runtime instrumentation?
Figure 6 depicts our overall workflow of profile-guided stack-size tuning.
Figure 6. Entire feedback loop to tune stack sizes.
To read the stack traces, we use the pprof library. Once we have the profiles, we can filter out any stack trace that doesn’t go to the copystack function. After that, it needs to be able to know how much stack space a specific function uses. It has 2 options:
- go tool objdump <binary>: easier to implement, but it’s slow and not fully integrated (call a subprocess and ensure the Go binary is available).
- Create our own assembly reader: better for a final production solution. We do this using a library to disassemble a binary.
The first requirement is to ensure that we build our Go program with debug symbols, and that we could read those symbols in Go.
Figure 7. Code to get debug symbols from binary.
Then we need to go through the stack trace previously computed by the pprof library, and for each function, get its stack usage.
Figure 8. Code to get stack usage of each method in a stacktrace.
In the code presented in Figure 11, we go through all the assembly instructions for each function, and the first time we find sub, that’s the instruction that reserves the amount of stack memory the function needs. Then we only need to get the remaining bytes from the instruction to get the amount.
Figure 9. Output from stack analyzer.
The stack analyzer prints all the stack traces with the total stack usage, they’re using order by usage. If we analyze the top usage, we see that we’re using around 19KB, but since we need to set the stack as a power of 2, the correct value is 32KB.
Additionally, the analyzer can provide useful information about each function. For instance, we found that go.uber.org/yarpc/internal/observability.(*Middleware).Call was consuming 2.6KB of stack memory. This is a middleware, so it would be called by every single request. Therefore, we can optimize it to reduce its usage. The change is available on GitHub®, where we reduced it to 600 bytes.
Impact on Production
We started with close to a 10% impact on the service.
Figure 10. Stack growth cost for the same service as above.
We went to <1% CPU usage after deploying the change that moves the stack size from 2KB to 32KB.
Figure 11: Stack growth cost for the same service as above after a static stack size of 32KB.
Most instances use around 50MB of stack, with a few reaching 200MB. The container has 16GB of memory, so 200MB overhead is less than 2%.
Figure 12. Stack memory usage for the same service as above after static stack size to 32KB.
The impact on CPU was significant (1-(150/180) == ~16%).
Figure 13. CPU consumption with 1 week delta (yellow is 1 week before).
Next Steps
We plan to continue onboarding more services to this solution. Identifying the correct value for the stack is time-consuming, so we need to automate the process to find the best candidates. Signals to look for are:
- High CPU consumption from stack growth
- Relatively low stack memory usage or enough memory to grow
Additionally, we can use the analyzer to optimize hot functions that consume high amounts of stack.
Conclusion
Go’s unique stack expansion mechanism typically provides an ideal balance of low memory usage and efficient CPU performance. However, at Uber’s scale, where 1% efficiency gains represent millions of dollars in savings, even highly optimized runtimes can benefit from specialized tuning. While our current approach involves internal modifications to the runtime, the significant performance gains demonstrate that these efforts are well worth the investment until a native, official solution is available.
Acknowledgments
Special thanks to the Go team that helped us with the runtime patch, which made this possible.
Cover Photo Attribution: “Colorful software or web code on a computer monitor” by Markus Spiske is covered by the Unsplash license.
Go and the Go logo are trademarks of Google LLC.
GitHub is a registered trademark of GitHub, Inc, a subsidiary of Microsoft.
Stay up to date with the latest from Uber Engineering—follow us on LinkedIn for our newest blog posts and insights.
Cristian Velazquez
Sr Staff Engineer