Fixing Go’s Linker: An Unexpected Journey into ARM64, DWARF, and Linker Internals

February 16, 2023 / Global

Introduction

We encountered an unusual problem recently at Uber with Golang^™ debugging, as our engineers began transitioning to Apple^® Silicon hardware, which uses the ARM64 Instruction Set Architecture (ISA), rather than the x86/AMD64 ISA many of us have been using for many years now. This required some rather complex debugging of the toolchain itself by Uber engineers. This post will showcase the analysis techniques, and dive into some topics including:

DWARF
ARM64 limitations
The linking process and linker internals
Low level inspection of object files and executables

Some engineers with Apple^® M1s were reporting they were unable to set breakpoints or step through their Go^™programs, in any IDE. This was true for some programs, and we found that we could not get line information using the Delve debugger, which is the underlying debugger to IDEs such as VSCode and Goland. Normally breaking on main.main of fooService would look like:

Figure 1: Interactive debug of *main* function of fooService compiled for AMD64

We can see the original source code, and even disassemble the code to see which line in the source code each instruction originated from.

Debugging the ARM64 build of the exact same program displayed without being able to resolve any source information:

Figure 2: Interactive debug of *main* function of fooService compiled for ARM64

DWARF

What actually allows the debugger to map program addresses to source code files and line numbers, amongst many other things (variables, etc.)? The most commonly used standard is DWARF, which is used by Go and embedded in the final native system binary; for Macs, this is a Mach-O executable file.

The section of the DWARF spec that deals with mapping source to program addresses is described as the Line Program Table (LPT) in chapter 6.2. Think of the LPT as a list of micro-instructions that run in a simple state machine to produce the address-to-file+line mapping. This allows the LPT to remain very compact.

We wrote a tool, dwarfmachodebug, that dumps the LPT information, and ran it against both architecture binaries:

Figure 3: Dumping LPT of *main* function of fooService compiled for ARM64

Figure 4: Dumping LPT of *main* function of fooService compiled for AMD64

The output difference (particularly that there is full coverage of every line and no errors) suggests the DWARF/LPT information is corrupted in some way on ARM64, but not AMD64.

DWARF in the Build

Where does DWARF information come from?

In Go, the unit of compilation (CU) is the package. These .go files get compiled by the Go compiler into platform-independent object files^¹, which are then linked by the Go linker (not GCC/LLVM/etc^²) into a platform-dependent binary (e.g., Mach-O for Mac, ELF for Linux), as shown.

Figure 5: The process of building helloworld

¹ This file format was re-engineered in Go1.15

² The system’s native “external” linker may still be required for certain architectures and platforms; and whether CGo is used, and other less common scenarios

We can see that the call to fmt.Println() is now resolved by the linker.^³

The DWARF information is added in each compilation unit by the compiler, in a section known as the Auxiliary Symbols (AuxSyms), and then merged together by the linker, just like the code.

In fact, we can watch the whole build process. Let’s build Go’s hello world example, which has a call from main to the fmt package. The flags used here are:

x – prints all the build commands
work – prints the temporary build dir ($WORK) and does not delete it
gcflags=’all=-N -l’ – typical debug flags, disable optimisations and inlining to make this example easier to study

Figure 6: Verbose build of helloworld on ARM64

^³ There is an excellent 20-part series on linker design, by Golang contributor Ian Lance Taylor.

In this trimmed output, the compile (green) and link (red) commands are highlighted, as well as the line in the build file (purple) that references our helloworld object file that was just compiled (_pkg_.a)

We can see that only helloworld.go is compiled, the rest of the program is linked from cached builds of the fmt, etc., packages.

Figure 7: Disassembly of the *main* function of *helloworld* ARM64 compiled object file

This is the ARM64 disassembly of the helloworld object file before it is linked. We can clearly see line information is obtained by the objdump tool via DWARF. But because this is pre-link, the call to fmt.Println is a CALL^⁴ with a 0 offset, because the compiler doesn’t know where fmt.Println will be in the final executable; instead it inserts a Reloc (relocation) to ask the linker for help.

It’s the linker’s job to fill this. Let’s dissect the Reloc, [0:4]R_CALLARM64:fmt.Println. This variant tells the linker what type of instruction to be relocated and to what (function, global variable, etc). In this case, it says it is the entire instruction that needs modifying, and it is of the 26-bit immediate type, and the destination is function fmt.Println.

In fact, if we look at the final helloworld executable, we can see the linker has patched the instruction as we expect:

Figure 8: Disassembly of the *main* function of *helloworld* ARM64 linked executable file

^⁴Note, Go uses the mnemonic CALL due to its Plan9 origins; this is normally referred to as BL (Branch-and-Link)

Analysis

Using what we have learned, can we identify any peculiarities in the ARM64 binaries vs the AMD64 ones? If we compare the main() function disassemblies, there is one thing that stands out, which is the call to fx.New (Uber’s open source dependency injection framework for Go).

Figure 9: Side by side comparison of the ARM64 vs AMD64 calls to *fx.New* in *main()* of *fooService* linked executable

What is this +0-tramp0 suffix on fx.New that is seen only on ARM64?

Figure 10: Disassembly of the *fx.New+0-tramp0* code in the ARM64 linked executable file of *fooService*

Essentially, it is a mysterious jump to a (large) PC relative address.^⁵ We investigate in the debugger:

Figure 11: Interactive debug step through of the call to *fx.New+0-tramp0* of the ARM64 linked executable file of *fooService*

⁵ ADRP instruction

Note, the debugger cannot resolve the fx.New+0-tramp0 instructions to any source code, but we know there are 3 instructions there, so we Step Instruction (si) 3+1 times, and find we end up at fx.New

But why this indirection?

Let’s go back to the variant of instructions the CALL instructions the compiler uses on each architecture.

Figure 12: Instructions the compiler uses to make the call to *fx.New* for ARM64 and AMD64 in executable *fooService*

The Go compiler’s ARM64 code generator chose the BL imm26 instruction, whose encoding allows up to 26 bits of space for the relative call. Since ARM64^⁶ is a RISC ISA with fixed 32-bit instructions, and all instructions are 32-bit aligned, there are 2 implicit bits to get a total of 28 bits of signed offset. This means the linker can only relocate calls that ± 2^²⁷ = ±128MB^⁷ in relative distance.

By comparison, the Go compiler’s AMD64 code generator, which is a variable length CISC ISA, chooses CALL sign extended rel32 immediate, which is 5 bytes long and allows relative calls of ± 2GB.

^⁶Note, there is currently no Thumb^® (16-bit instructions) on ARM64(aka AArch64). Thumb^® is an extension to ARM (aka AArch32) only

^⁷The linker further restricts this to ±124MB to reserve space for various other structures

Figure 13: The limitations in call distance of AMD64 vs ARM64

In some cases, +-128MB won’t be enough. For example, the call site to fx.New in main.main we saw was 0x10fe8afd0. The real address of fx.New was 0x104c99fc0. The difference is 180MB, which exceeds the limit of the BL instruction.

The linker solves this problem by inserting trampolines when necessary, which as the name suggests, consist of a jump to the target. The 3-instruction trampoline we saw in ARM64 allows +-4GB jumps (yes, 33 bits). Let’s see how the binaries compare:

Figure 14: Why the ARM64 version of *fooService* needs a trampoline from *main* to *fx.New*

We can see the trampoline uses 3 instructions (12 bytes) to perform the extended jump. This is why it is impractical to use this form everywhere^⁸, when perhaps only a small number of calls need trampolining.

^⁸There is a technique known as Linker Relaxation, where the compiler always emits the longest possible sequence, and the linker “relaxes” it to a shorter sequence at link time. This is incredibly difficult (and link time intensive) to implement even with certain constraints; thus Go does not use it.

Trampolines and DWARF Info

As interesting as we hope you found this, we need to figure out why trampoline insertion breaks the DWARF LPT. The strategy we’ll adopt is starting with DWARF LPT and then figuring out how it is not cooperating with trampolines by using differential debugging (debugging the link of good vs. broken binary and bisecting the main linker flow).

DWARF LPT Generation

Per compile unit (package) LPT data is gathered by writelines. In particular this loop goes through each function in the package by its symbol index and obtains the AuxSyms.

Figure 15: The high level process the Go linker (as of Go 1.19) uses to insert DWARF LPT

Figure 16: Portions of the relevant code used to insert the DWARF LPT inside the Go linker

In any Go function where a trampoline was inserted, the IsExternal() condition was true, meaning that empty results were returned and only a partial LPT was generated for that function. We now need to understand how trampolines are changing this condition.

Trampoline Insertion

Trampoline insertion code can be found by searching through the Go linker source. In particular, a two-pass optimistic-pessimistic strategy is used, where the linker tries without trampolines and if this looks like it could possibly fail, it switches to a more complex strategy that considers the use of trampolines.

Figure 17: The high-level process the Go linker (as of Go 1.19) uses to assign function addresses

An interesting discovery here is that there is a variable, FlagDebugTramp, which appears to force trampoline insertion, and is set by a not-well-documented command line flag, debugtramp. Thus, we can reproduce the problem 100% of the time, even for helloworld, and we could file a detailed issue with reproduction steps at this time.

Figure 18: Reproducing the issue on any trivial ARM64 Go binary (*helloworld*)

The key observation to make here is that the linker has created new code. We also know the linker creates a symbol, or name for the trampoline (e.g., fmt.Fprintln+0-tramp0). This happens in architecture-specific trampoline code for each trampoline.

But further along this function we see some interesting behavior:

In particular, this call takes the global function symbol index symIdx, and copies it to the heap

Figure 19A, 19B, 19C: Portions of the relevant code used to insert trampolines inside the Go linker

because like other linkers, go’s link read-only mmap()’s the input object files for efficiency; if it must change a function symbol (e.g., to change a relocation target to a trampoline), it has to deep copy the symbol into the heap so it can be modified, and mark it as external, or essentially “copy-on-write.”

The problem is we saw that the DWARF generation phase skips over any symbol marked as external.

Solution

With this in mind, we can’t change this core part of the linker design.

But the false assumption made in GetFuncDwarfAuxSyms looks like the real problem. Indeed, this is a classic one line bug fix.

Figure 20: A simple change to the DWARF LPT generator that fixes the bug.

Of course, we should write a test to reproduce the bug and verify the fix.

Conclusion

It’s remarkable that this bug existed for so long (about a year), but one hypothesis is that ARM64 has traditionally been used in smaller embedded systems with smaller binaries, and the combination of the recent emergence of ARM64 on personal computers with the larger Go binaries used by companies like Uber have begun to surface this problem.

It’s also interesting to note that trampolines were very common in the past. In the pre-32-bit era, we had near-far keywords in C. Trampolines were very common then, because a near pointer could not exceed more than 64k on 8086, and a more expensive far pointer would be needed to reach up to 1MB. This problem has returned in the new RISC era, due to the fixed size of instructions.

The Go team were very receptive to this submission, and this fix was backported into go1.19.1 and go1.18.6 within a month. We would like to thank Cherry Mui and Than McIntosh of Google for their prompt code reviews and suggestions.
A closing thought is that readers may be wondering why some of our binaries became so large. Whilst there are several reasons, one in particular relates to the way the linker detects what content in all of the input object files is reachable through a process known as “dead code analysis”. If time permits, I would like to do a follow up blog on the limitations of the Go linker’s dead code analysis, and what we can do to improve it.

Jeremy Quirke

Jeremy Quirke is a Senior Staff Engineer at Uber working primarily on the underlying technology behind Earner Upfront Pricing, and is passionate about improving engineering tooling.

Posted by Jeremy Quirke

Category:

Engineering

Backend