> I had observed binaries beyond 25GiB, including debug symbols. How is this possible? These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.
I am very sympathetic to wanting nice static binaries that can be shipped around as a single artifact[0], but... surely at some point we have to ask if it's worth it? If nothing else, that feels like a little bit of a code smell; surely if your actual executable code doesn't even fit in 2GB it's time to ask if that's really one binary's worth of code or if you're actually staring at like... a dozen applications that deserve to be separate? Or get over it the other way and accept that sometimes the single artifact you ship is a tarball / OCI image / EROFS image for systemd[1] to mount+run / self-extracting archive[2] / ...
[0] Seriously, one of my background projects right now is trying to figure out if it's really that hard to make fat ELF binaries.
This is something that always bothered me while I was working at Google too: we had an amazing compute and storage infrastructure that kept getting crazier and crazier over the years (in terms of performance, scalability and redundancy) but everything in operations felt slow because of the massive size of binaries. Running a command line binary? Slow. Building a binary for deployment? Slow. Deploying a binary? Slow.
The answer to an ever-increasing size of binaries was always "let's make the infrastructure scale up!" instead of "let's... not do this crazy thing maybe?". By the time I left, there were some new initiatives towards the latter and the feeling that "maybe we should have put limits much earlier" but retrofitting limits into the existing bloat was going to be exceedingly difficult.
There's a lot of tooling built on static binaries:
- google-wide profiling: the core C++ team can collect data on how much of fleet CPU % is spent in absl::flat_hash_map re-bucketing (you can find papers on this publicly)
- crashdump telemetry
- dapper stack trace -> codesearch
Borg literally had to pin the bash version because letting the bash version float caused bugs. I can't imagine how much harder debugging L7 proxy issues would be if I had to follow a .so rabbit hole.
I can believe shrinking binary size would solve a lot of problems, and I can imagine ways to solve the .so versioning problem, but for every problem you mention I can name multiple other probable causes (eg was startup time really execvp time, or was it networked deps like FFs).
In this section, we will deviate slightly from the main topic to discuss static linking.
By including all dependencies within the executable itself, it can run without relying on external shared objects.
This eliminates the potential risks associated with updating dependencies separately.
Certain users prefer static linking or mostly static linking for the sake of deployment convenience and performance aspects:
* Link-time optimization is more effective when all dependencies are known. Providing shared object information during executable optimization is possible, but it may not be a worthwhile engineering effort.
* Profiling techniques are more efficient dealing with one single executable.
* Dynamic linking involves PLT and GOT, which can introduce additional overhead. Static linking eliminates the overhead.
* Loading libraries in the dynamic loader has a time complexity `O(|libs|^2*|libname|)`. The existing implementations are designed to handle tens of shared objects, rather than a thousand or more.
Furthermore, the current lack of techniques to partition an executable into a few larger shared objects, as opposed to numerous smaller shared objects, exacerbates the overhead issue.
In scenarios where the distributed program contains a significant amount of code (related: software bloat), employing full or mostly static linking can result in very large executable files.
Consequently, certain relocations may be close to the distance limit, and even a minor disruption (e.g. add a function or introduce a dependency) can trigger relocation overflow linker errors.
I think google of all companies could build a good autostripper reducing binaries by adding partial load assembly on misses. It cant be much slower then shovelling a full monorepo assembly plus symbols into ram.
But the compression ratio isn't magical (approx. 1:0.25, for both zlib and zstd in the examples given). You'd probably still want to set aside debuginfo in separate files.
With embedded firmware you only flash the .text and and flash to the device. But you still can debug using the .elf file. In my case if I get a bus fault I'll pull the offending address off the stack and use bintools and the .elf to show me who was naughty. I think if you have a crash dump you should be able to make sense of things as long as you keep the unstripped .elf file around.
When I was at Google, on an SRE team, here is the explanation that I was given.
Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying.
One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine.
But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine.
Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries.
> But what happens if there was a cosmic bit flip in a dynamic library?
I think there were more basic reasons we didn't ship shared libraries to production.
1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking.
2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge.
3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod.
That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking.
> Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason!
I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs.
Thanks. It is nice to hear another perspective on this.
But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.)
Memory and disk corruption definitely were a problem in the early days. See https://news.ycombinator.com/item?id=14206811 for example. I also recall an anecdote about how the search index basically became unbuildable beyond a certain size due to the probability of corruption, which was what inspired RecordIO. I think ECC RAM and transport checksums largely fixed those problems.
It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.
Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.
In Azure - which I think is at Google scale - everything is dynamically linked. Actually a lot of Azure is built on C# which does not even support static linking...
Statically linking being necessary for scaling does not pass the smell test for me.
I never worked for Google, but have seen some strange things like bit flips at more modest scales. From the parent description, it looks like defaulting to static binaries is helping to speed up troubleshooting to remove the “this should never happen, but statistically will happen every so often” class of bugs.
As I see it, the issue isn’t requiring static compiling to scale. It’s requiring it to make troubleshooting or measuring performance at scale easier. Not required, per se, but very helpful.
Exactly. SRE is about monitoring and troubleshooting at scale.
Google runs on a microservices architecture. It's done that since before that was cool. You have to do a lot to make a microservices architecture work. Google did not advertise a lot of that. Today we have things like Data Dog that give you some of the basics. But for a long time, people who left Google faced a world of pain because of how far behind the rest of the world was.
Azure's devops record is not nearly as good as Google's was.
The biggest datasets that ChatGPT is aware of being processed in complex analytics jobs on Azure are roughly a thousand times smaller than an estimate of Google's regularly processed snapshot of the web. There is a reason why most of the fundamental advancements in how to parallelize data and computations - such as map-reduce and BigTable - all came from Google. Nobody else worked at their scale before they did. (Then Google published it, and people began to implement it. Then failed to understand what was operationally important to making it actually work at scale...)
So, despite how big it is, I don't think that Azure operates at Google scale.
For the record, back when I worked at Google, the public internet was only the third largest network that I knew of. Larger still was the network that Google uses for internal API calls. (Do you have any idea how many API calls it takes to serve a Google search page?) And larger still was the network that kept data synchronized between data centers. (So, for example, you don't lose your mail if a data center goes down.)
Does AWS have a good reputation in devops? Because large chunks of AWS are built on Java - which also does not offer static linking (bundling a bunch of *.jar files into one exe does not count as static linking). Still does not pass the smell test.
In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good", Everything else that's more Platform-as-a-Service can be considered a half baked leaky abstraction. Anything they release as "GA" especially around ReInvent should be avoided for a minimum of 6 months-1 year since it's more like a public Beta with some guaranteed bugs.
In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good" - large chunks of which are, by the way, written in Java. I think you are proving my point...
which just means Java isn't affected? or your definition of not not counting bundled and not shared jars as static linking is wrong, since they achieve the same effect.
> But what happens if there was a cosmic bit flip in a dynamic library?
You'd need multiple of those, because you have ECC. Not impossible, but getting all those dice rolled the same way requires even bigger scale than Google's.
One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely. In environments like Google’s it's important to know that what you have deployed to production is exactly what you think it is.
> One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely.
It depends.
If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.
> If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.
> […] which artefacts were linked against that bad version of libc.
There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.
In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.
In dynamically linked operating environments, only the libc needs to be updated.
The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.
I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.
But I feel like we already knew that practices don't work unless organizations actually follow them.
Google contributed the code, and the entire concept, of DWARF fission to both GCC and LLVM. This suggests that rather than overlooking something obvious that they'll be embarrassed to learn on HN, they were aware of the issues and were using the solutions before you'd even heard of them.
There's no contradiction, no missing link in the facts of the story. They have a huge program, it is 2GiB minus epsilon of .text, and a much larger amount of DWARF stuff. The article is about how to use different code models to potentially go beyond 2GiB of text, and the size of the DWARF sections is irrelevant trivia.
> They have a huge program, it is 2GiB minus epsilon of .text,
but the article says 25+GiB including debug symbols, in a single binary?
also, I appreciate your enthusiasm in assuming that because some people do something in an organization, it is applied consistently everywhere. Hell, if it were microsoft other departments would try to shoot down the "debug tooling optimization" dpt
yes and that's what I'm saying, I find it crazy to not split the debug info out. At least on my machine it really makes a noticeable difference of load time if I load a binary which is ~2GB with debug info in or the same binary which is ~100MB with debug info out.
Doesn't make any difference in practice. The debug info is never mapped into memory by the loader. This only matters if you want to store the two separate i.e lazy load debug symbols if needed.
this is just not true. I just tried with one of my binaries which is 3.2G unstripped, and 150MB-ish stripped. Unstripped takes 23 seconds until the window shows up, stripped takes ~a second
There is something wacky going on with your system, or the program is written in a way that makes it traverse the debug info if it is present. What program is it?
For example I can imagine desktop operating system antivirus/integrity checks having this effect.
ELF is just a container format and you can put literally anything into one of its sections. Whether the DWARF sections are in "the binary" or in another named file is really quite beside the point.
If you have 25gb of executables then I don’t think it matters if that’s one binary executable or a hundred. Something has gone horribly horribly wrong.
I don’t think I’ve ever seen a 4gb binary yet. I have seen instances where a PDB file hit 4gb and that caused problems. Debug symbols getting that large is totally plausible. I’m ok with that at least.
Llamafile (https://llamafile.ai) can easily exceed 4GB due to containing LLM weights inside. But remember, you cannot run >4GB executable files on Windows.
> A few ps3 games I've seen had 4GB or more binaries.
Is this because they are embedding assets into the binary? I find it hard to believe anyone was carrying around enough code to fill 4GB in the PS3 era...
I assume so, there were rarely any other files on the disc in this case.
It varied between games, one of the battlefields (3 or bad company 2) was what I was thinking of. It generally improved with later releases.
The 4GB file size was significant, since it meant I couldn't run them from a backup on a fat32 usb drive. There are workarounds for many games nowadays.
Not quite. I very much work in large, templated, C++ codebases. But I do so on windows where the symbols are in a separate file the way the lord intended.
Debug symbol size shouldn't be influencing relocation jump distances - debug info has its own ELF section.
Regardless of whether you're FAANG or not, nothing you're running should require an executable with a 2 GB large .text section. If you're bumping into that limit, then your build process likely lacks dead code elimination in the linking step. You should be using LTO for release builds. Even the traditional solution (compile your object files with -ffunction-sections and link with --gc-sections) does a good job of culling dead code at function-level granularity.
Google Chrome ships as a 500 MB binary on my machine, so if you're embedding a web browser, that's how much you need minimum. Now tack on whatever else your application needs and it's easy to see how you can go past 2 GB if you're not careful. (To be clear, I am not making a moral judgment here, I am just saying it's possible to do. Whether it should happen is a different question.)
I just checked Google Chrome Framework on my Mac, it was a little over 400 MB. Although now that I think about it it's probably a universal binary so you can cut that in half?
Google is made of many thousands of individuals. Some experts will be aware of all those, some won't. In my team, many didn't know about those details as they were handled by other builds teams for specific products or entire domains at once.
But since each product in some different domains had to actively enable those optimizations for themselves, they were occasionally forgotten, and I found a few in the app I worked for (but not directly on).
ICF seems like a good one to keep in the box of flags people don't know about because like everything in life it's a tradeoff and keeping that one problematic artifact under 2GiB is pretty much the only non-debatable use case for it.
Once the compiler has generated a 32-bit relative jump with an R_X86_64_PLT32 relocation, it’s too late. (A bit surprising for it to be a PLT relocation, but it does make some sense upon reflection, and the linker turns it into a direct call if you’re statically linking.) I think only RISC-V was brave enough to allow potentially size-changing linker relaxation, and incidentally they screwed it up (the bug tracker says “too late to change”, which brings me great sadness given we’re talking about a new platform).
On x86-64 it would probably be easier to point the relative call to a synthesized trampoline that does a 64-bit one, but it seems nobody has bothered thus far. You have to admit that sounds pretty painful.
> The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute JMP.
Makes sense, but in the assembly output just after, there is not a single JMP instruction. Instead, CALL <immediate> is replaced with putting the address in a 64-bit register, then CALL <register>, which makes even more sense. But why mention the JMP thing then? Is it a mistake or am I missing something? (I know some calls are replaced by JMP, but that's done regardless of -mcmodel=large)
I would assume loose language, referring to a CALL as a JMP. However of the two reasons given to dislike the large code model, register pressure isn't relevant to that particular snippet.
It's performing a call, ABIs define registers that are not preserved over calls; writing the destination to one of those won't affect register pressure.
> Responses to my publication submissions often claimed such problems did not exist
I see this often even in communities of software engineers, where people who are unaware of certain limitations at scale will announce that the research is unnecessary
Sure! But there's a sleight of hand in the numbers here where we're talking about 25GB binaries with debuginfo and then 2GB maximum offsets in the .text section. Of those 25GB binaries, probably 24.5 of them are debuginfo. You have to get into truly huge binaries before >2GB calls become an issue.
(I wonder but have no particular insight into if LTO builds can do smarter things here -- most calls are local, but the handful of far calls can use the more expensive spelling.)
At Google I worked with one statistics aggregation binary[0] that was ~25GB stripped. The distributed build system wouldn't even build the debug version because it exceeded the maximum configured size for any object file. I never asked if anyone had tried factoring it into separate pipelines but my intuition is that the extra processing overhead wouldn't have been worth splitting the business logic that way; once the exact set of necessary input logs are in memory you might as well do everything you need to them given the dramatically larger ratio of data size to code size.
… on the x86 ISA because it encodes the 32-bit jump/call offset directly in the opcode.
Whilst most RISC architecture do allow PC-relative branches, the offset is relatively small as 32-bit opcodes do not have enough room to squeeze a large offset in.
«Long» jumps and calls are indirect branches / calls done via registers where the entirety of 64 bits is available (address alignment rules apply in RISC architectures). The target address has to be loaded / calculated beforehand, though. Available in RISC and x86 64-bit architectures.
To be fair, this is with debug symbols. Debug builds of Chrome were in the 5GB range several years ago; no doubt that’s increased since then. I can remember my poor laptop literally running out of RAM during the linking phase due to the sheer size of the object files being linked.
Why are debug symbols so big? For C++, they’ll include detailed type information for every instantiation of every type everywhere in your program, including the types of every field (recursively), method signatures, etc. etc., along with the types and locations of local variables in every method (updated on every spill and move), line number data, etc. etc. for every specialization of every function. This produces a lot of data even for “moderate”-sized projects.
Worse: for C++, you don’t win much through dynamic linking because dynamically linking C++ libraries sucks so hard. Templates defined in header files can’t easily be put in shared libraries; ABI variations mean that dynamic libraries generally have to be updated in sync; and duplication across modules is bound to happen (thanks to inlined functions and templates). A single “stuck” or outdated .so might completely break a deployment too, which is a much worse situation than deploying a single binary (either you get a new version or an old one, not a broken service).
I've hit the same thing in Rust, probably for the same reasons.
Isn't the simple solution to use detached debug files?
I think Windows and Linux both support them. That's how phones like Android and iOS get useful crash reports out of small binaries, they just upload the stack trace and some service like Sentry translates that back into source line numbers. (It's easy to do manually too)
I'm surprised the author didn't mention it first. A 25 GB exe might be 1 GB of code and 24 GB of debug crud.
> Isn't the simple solution to use detached debug files?
It should be. But the tooling for this kind of thing (anything to do with executable formats including debug info and also things like linking and cross-compilation) is generally pretty bad.
The problem is that when a final binary is linked everything goes into it. Then, after the link step, all the debug information gets stripped out into the separate symbols file. That means at some point during the build the target binary file will contain everything. I can not, for example, build clang in debug mode on my work machine because I have only 32 GB of memory and the OOM killer comes out during the final link phase.
Of course, separate binaries files make no difference at runtime since only the LOAD segments get loaded (by either the kernel or the dynamic loader, depending). The size of a binary on disk has little to do with the size of a binary in memory.
> The problem is that when a final binary is linked everything goes into it
I don't think that's the case on Linux, when using -gsplit-dwarf the debug info is put in separate files at the object file level, they are never linked into binaries.
Yes, but it can be more of a pain keeping track of pairs. In production though, this is what's done. And given a fault, the debug binary can be found in a database and used to gdb the issue given the core. You do have to limit certain online optimizations in order to have useful tracebacks.
This also requires careful tracking of prod builds and their symbol files... A kind of symbol db.
To be fair, they worked at Google, their engineering decisions are not normal. They might just decide that 25 GiB binaries are worth a 0.25% speedup at start time, potentially resulting in tens of millions of dollars' worth of difference. Nobody should do things the way Google does, but it's interesting to think about.
The overall size wouldn't get smaller just because it is dynamically linked, on the contrary (because DLLs are a dead code elimination barrier). 25 GB is insane either way, something must have gone horribly wrong very early in the development process (also why, even ship with debug information included, that doesn't make sense in the first place).
Won't make a bit of difference because everything is in a sort of container (not Docker) anyway. Unless you're suggesting those libraries to be distributed as base image to every possible Borg machine your app can run on which is an obvious non-starter.
You can use thunks/trampolines. lld can make them for some architectures, presumably also for x86_64. Though I don't know why it didn't in your case.
But, like the large code model it can be expensive to add trampolines, both in icache performance and just execution if a trampoline is in a particularly hot path.
> With this information, the necessity of code-models feels unecessary [sic]. Why trigger the cost for every callsite when we can do-so piecemeal as necessary with the opportunity to use profiles to guide us on which methods to migrate to thunks.
Does the linker have access to the same hotness information that the compiler uses during PGO? Well -- presumably it could, even if it doesn't now. But it would be like a heuristic with a hotness threshold? Do linkers "do" heuristics?
Note, sections without the SHF_ALLOC flag, such as `.debug_*` sections, do not contribute to the relocation distance pressure.
Many 10+GiB binaries (likely due to not using split DWARF) might have much smaller code+data and not even close to the limit.
However, Google, Meta, and ByteDance have encountered x86-64 relocation distance issue with their huge C++ server binaries.
To my knowledge industry users in other domains haven't run into this problem.
To address this, Google adopted the medium code model approximately two years ago for its sanitizer and PGO instrumentation builds.
CUDA fat binaries also caused problems. I suggest that linker script `INSERT BEFORE/AFTER` for orphan sections (https://reviews.llvm.org/D74375 ) served as a key mitigation.
I hope that a range extension thunk ABI, similar to AArch64/Power, is defined for the x86-64 psABI.
It is better than the current long branch pessimization we have with -mcmodel=large.
---
It seems that nobody has run into this .eh_frame_hdr implementation limitation yet
* `.eh_frame_hdr -> .text`: GNU ld and ld.lld only support 32-bit offsets (`table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4;`) as of Dec 2025.
shameless plug: if you want to understand the content of this post better, first read the first half of my article on jumps [1] (up to syscall). goes into detail about relocations and position-independent code.
I've seen terrible, terrible binary sizes with Eigen + debug symbols, due to how Eigen lazy evaluation works (I think). Every math expression ends up as a new template instantiation.
In terms of compile times, boost geometry is somehow worse. You're encouraged to import boost/geometry.hpp, which includes every module, which stalls compile times by several seconds just to parse all the templates. It's not terrible if you include just the headers you need, but that's not the "default" that most people use.
Oh man, that first paragraph. “Such problems don’t exist…” what a gaslighting response to a publication submittal. The least they could do is ask where this problem emerges and you can hand wavy your answer without revealing business IP.
Also, we, as an industry of software engineers, need to re-examine these hard defaults we thought could never be achieved. Such as the .text limits.
> I had observed binaries beyond 25GiB, including debug symbols. How is this possible? These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.
I am very sympathetic to wanting nice static binaries that can be shipped around as a single artifact[0], but... surely at some point we have to ask if it's worth it? If nothing else, that feels like a little bit of a code smell; surely if your actual executable code doesn't even fit in 2GB it's time to ask if that's really one binary's worth of code or if you're actually staring at like... a dozen applications that deserve to be separate? Or get over it the other way and accept that sometimes the single artifact you ship is a tarball / OCI image / EROFS image for systemd[1] to mount+run / self-extracting archive[2] / ...
[0] Seriously, one of my background projects right now is trying to figure out if it's really that hard to make fat ELF binaries.
[1] https://systemd.io/PORTABLE_SERVICES/
[2] https://justine.lol/ape.html > "PKZIP Executables Make Pretty Good Containers"
This is something that always bothered me while I was working at Google too: we had an amazing compute and storage infrastructure that kept getting crazier and crazier over the years (in terms of performance, scalability and redundancy) but everything in operations felt slow because of the massive size of binaries. Running a command line binary? Slow. Building a binary for deployment? Slow. Deploying a binary? Slow.
The answer to an ever-increasing size of binaries was always "let's make the infrastructure scale up!" instead of "let's... not do this crazy thing maybe?". By the time I left, there were some new initiatives towards the latter and the feeling that "maybe we should have put limits much earlier" but retrofitting limits into the existing bloat was going to be exceedingly difficult.
There's a lot of tooling built on static binaries:
- google-wide profiling: the core C++ team can collect data on how much of fleet CPU % is spent in absl::flat_hash_map re-bucketing (you can find papers on this publicly)
- crashdump telemetry
- dapper stack trace -> codesearch
Borg literally had to pin the bash version because letting the bash version float caused bugs. I can't imagine how much harder debugging L7 proxy issues would be if I had to follow a .so rabbit hole.
I can believe shrinking binary size would solve a lot of problems, and I can imagine ways to solve the .so versioning problem, but for every problem you mention I can name multiple other probable causes (eg was startup time really execvp time, or was it networked deps like FFs).
We are missing tooling to partition a huge binary into a few larger shared objects.
As my https://maskray.me/blog/2023-05-14-relocation-overflow-and-c... (linked by author, thanks! But I maintain lld/ELF instead of "wrote" it - it's engineer work of many folks)
Quoting the relevant paragraphs below:
## Static linking
In this section, we will deviate slightly from the main topic to discuss static linking. By including all dependencies within the executable itself, it can run without relying on external shared objects. This eliminates the potential risks associated with updating dependencies separately.
Certain users prefer static linking or mostly static linking for the sake of deployment convenience and performance aspects:
* Link-time optimization is more effective when all dependencies are known. Providing shared object information during executable optimization is possible, but it may not be a worthwhile engineering effort.
* Profiling techniques are more efficient dealing with one single executable.
* The traditional ELF dynamic linking approach incurs overhead to support [symbol interposition](https://maskray.me/blog/2021-05-16-elf-interposition-and-bsy...).
* Dynamic linking involves PLT and GOT, which can introduce additional overhead. Static linking eliminates the overhead.
* Loading libraries in the dynamic loader has a time complexity `O(|libs|^2*|libname|)`. The existing implementations are designed to handle tens of shared objects, rather than a thousand or more.
Furthermore, the current lack of techniques to partition an executable into a few larger shared objects, as opposed to numerous smaller shared objects, exacerbates the overhead issue.
In scenarios where the distributed program contains a significant amount of code (related: software bloat), employing full or mostly static linking can result in very large executable files. Consequently, certain relocations may be close to the distance limit, and even a minor disruption (e.g. add a function or introduce a dependency) can trigger relocation overflow linker errors.
> We are missing tooling to partition a huge binary into a few larger shared objects
Those who do not understand dynamic linking are doomed to reinvent it.
There’s no way my proxy binary actually requires 25GB of code, or even the 3GB it is. Sounds to me like the answer is a tree shaker.
Google implemented the C++ equivalent of a tree shaker in their build system around 2009.
the front-end services to be "fast" AFAIK probably include nearly all the services you need to avoid hops -- so you can't really shake that much away.
I think google of all companies could build a good autostripper reducing binaries by adding partial load assembly on misses. It cant be much slower then shovelling a full monorepo assembly plus symbols into ram.
The low-hanging fruit is just not shipping the debuginfo, of course.
Is compressed debug info a thing? It seems likely to compress well, and if it's rarely used then it might be a worthwhile thing to do?
It is: https://maskray.me/blog/2022-01-23-compressed-debug-sections
But the compression ratio isn't magical (approx. 1:0.25, for both zlib and zstd in the examples given). You'd probably still want to set aside debuginfo in separate files.
Small brained primate comment.
With embedded firmware you only flash the .text and and flash to the device. But you still can debug using the .elf file. In my case if I get a bus fault I'll pull the offending address off the stack and use bintools and the .elf to show me who was naughty. I think if you have a crash dump you should be able to make sense of things as long as you keep the unstripped .elf file around.
Maybe I am missing something, but why didn't they just leverage dynamic libraries ?
When I was at Google, on an SRE team, here is the explanation that I was given.
Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying.
One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine.
But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine.
Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries.
> But what happens if there was a cosmic bit flip in a dynamic library?
I think there were more basic reasons we didn't ship shared libraries to production.
1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking.
2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge.
3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod.
That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking.
> Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason!
I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs.
Thanks. It is nice to hear another perspective on this.
But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.)
Memory and disk corruption definitely were a problem in the early days. See https://news.ycombinator.com/item?id=14206811 for example. I also recall an anecdote about how the search index basically became unbuildable beyond a certain size due to the probability of corruption, which was what inspired RecordIO. I think ECC RAM and transport checksums largely fixed those problems.
It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.
Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.
As a developer depending on the infrastructure and systems you guys make reliable every day inside Google, Bless You. Truly.
When Forge has a problem, I might as well go on a nature hike.
In Azure - which I think is at Google scale - everything is dynamically linked. Actually a lot of Azure is built on C# which does not even support static linking...
Statically linking being necessary for scaling does not pass the smell test for me.
I never worked for Google, but have seen some strange things like bit flips at more modest scales. From the parent description, it looks like defaulting to static binaries is helping to speed up troubleshooting to remove the “this should never happen, but statistically will happen every so often” class of bugs.
As I see it, the issue isn’t requiring static compiling to scale. It’s requiring it to make troubleshooting or measuring performance at scale easier. Not required, per se, but very helpful.
Exactly. SRE is about monitoring and troubleshooting at scale.
Google runs on a microservices architecture. It's done that since before that was cool. You have to do a lot to make a microservices architecture work. Google did not advertise a lot of that. Today we have things like Data Dog that give you some of the basics. But for a long time, people who left Google faced a world of pain because of how far behind the rest of the world was.
Azure's devops record is not nearly as good as Google's was.
The biggest datasets that ChatGPT is aware of being processed in complex analytics jobs on Azure are roughly a thousand times smaller than an estimate of Google's regularly processed snapshot of the web. There is a reason why most of the fundamental advancements in how to parallelize data and computations - such as map-reduce and BigTable - all came from Google. Nobody else worked at their scale before they did. (Then Google published it, and people began to implement it. Then failed to understand what was operationally important to making it actually work at scale...)
So, despite how big it is, I don't think that Azure operates at Google scale.
For the record, back when I worked at Google, the public internet was only the third largest network that I knew of. Larger still was the network that Google uses for internal API calls. (Do you have any idea how many API calls it takes to serve a Google search page?) And larger still was the network that kept data synchronized between data centers. (So, for example, you don't lose your mail if a data center goes down.)
perhaps that's why azure has such a bad reputation in the devops crowd.
Does AWS have a good reputation in devops? Because large chunks of AWS are built on Java - which also does not offer static linking (bundling a bunch of *.jar files into one exe does not count as static linking). Still does not pass the smell test.
In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good", Everything else that's more Platform-as-a-Service can be considered a half baked leaky abstraction. Anything they release as "GA" especially around ReInvent should be avoided for a minimum of 6 months-1 year since it's more like a public Beta with some guaranteed bugs.
In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good" - large chunks of which are, by the way, written in Java. I think you are proving my point...
which just means Java isn't affected? or your definition of not not counting bundled and not shared jars as static linking is wrong, since they achieve the same effect.
> But what happens if there was a cosmic bit flip in a dynamic library?
You'd need multiple of those, because you have ECC. Not impossible, but getting all those dice rolled the same way requires even bigger scale than Google's.
Sounds like Google should put their computers at Homestake
One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely. In environments like Google’s it's important to know that what you have deployed to production is exactly what you think it is.
See for more: https://google.github.io/building-secure-and-reliable-system...
> One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely.
It depends.
If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.
> If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.
The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.
See https://google.github.io/building-secure-and-reliable-system...
I am well aware of and understand that.
However,
> […] which artefacts were linked against that bad version of libc.
There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.
In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.
In dynamically linked operating environments, only the libc needs to be updated.
The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.
What is the point you're trying to make?
I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.
But I feel like we already knew that practices don't work unless organizations actually follow them.
So what was your point?
Sounds like Google could really use Nix
> https://systemd.io/PORTABLE_SERVICES/
Systemd and portable?
Portable across systemd/Linux systems, yes:)
What's wild to me is not using -gsplit-dwarf to have separate debug info and "normal-sized" binaries
Google contributed the code, and the entire concept, of DWARF fission to both GCC and LLVM. This suggests that rather than overlooking something obvious that they'll be embarrassed to learn on HN, they were aware of the issues and were using the solutions before you'd even heard of them.
A case of the left hand not knowing what the right hand is doing?
There's no contradiction, no missing link in the facts of the story. They have a huge program, it is 2GiB minus epsilon of .text, and a much larger amount of DWARF stuff. The article is about how to use different code models to potentially go beyond 2GiB of text, and the size of the DWARF sections is irrelevant trivia.
> They have a huge program, it is 2GiB minus epsilon of .text,
but the article says 25+GiB including debug symbols, in a single binary?
also, I appreciate your enthusiasm in assuming that because some people do something in an organization, it is applied consistently everywhere. Hell, if it were microsoft other departments would try to shoot down the "debug tooling optimization" dpt
Yes, the 25GB figure in the article is basically irrelevant to the 2GB .text section concern. Most ELF files that size are 95%+ debuginfo.
yes and that's what I'm saying, I find it crazy to not split the debug info out. At least on my machine it really makes a noticeable difference of load time if I load a binary which is ~2GB with debug info in or the same binary which is ~100MB with debug info out.
Doesn't make any difference in practice. The debug info is never mapped into memory by the loader. This only matters if you want to store the two separate i.e lazy load debug symbols if needed.
this is just not true. I just tried with one of my binaries which is 3.2G unstripped, and 150MB-ish stripped. Unstripped takes 23 seconds until the window shows up, stripped takes ~a second
There is something wacky going on with your system, or the program is written in a way that makes it traverse the debug info if it is present. What program is it?
For example I can imagine desktop operating system antivirus/integrity checks having this effect.
ELF is just a container format and you can put literally anything into one of its sections. Whether the DWARF sections are in "the binary" or in another named file is really quite beside the point.
If you have 25gb of executables then I don’t think it matters if that’s one binary executable or a hundred. Something has gone horribly horribly wrong.
I don’t think I’ve ever seen a 4gb binary yet. I have seen instances where a PDB file hit 4gb and that caused problems. Debug symbols getting that large is totally plausible. I’m ok with that at least.
Llamafile (https://llamafile.ai) can easily exceed 4GB due to containing LLM weights inside. But remember, you cannot run >4GB executable files on Windows.
I did, it was a Spring Boot fat jar with a NLP, I had to deploy it to the biggest instance AWS could offer, the costs were enormous
Java bytecode is always dynamically linked.
still, if I remember correctly I had to reserve 6gig of memory so that the jvm could actually start
A few ps3 games I've seen had 4GB or more binaries.
This was a problem because code signing meant it needed to be completely replaced by updates.
> A few ps3 games I've seen had 4GB or more binaries.
Is this because they are embedding assets into the binary? I find it hard to believe anyone was carrying around enough code to fill 4GB in the PS3 era...
I assume so, there were rarely any other files on the disc in this case.
It varied between games, one of the battlefields (3 or bad company 2) was what I was thinking of. It generally improved with later releases.
The 4GB file size was significant, since it meant I couldn't run them from a backup on a fat32 usb drive. There are workarounds for many games nowadays.
If you haven't seen a 25GB binary with debuginfo, you just aren't working in large, templated, C++ codebases. It's nothing special there.
Not quite. I very much work in large, templated, C++ codebases. But I do so on windows where the symbols are in a separate file the way the lord intended.
Debug symbol size shouldn't be influencing relocation jump distances - debug info has its own ELF section.
Regardless of whether you're FAANG or not, nothing you're running should require an executable with a 2 GB large .text section. If you're bumping into that limit, then your build process likely lacks dead code elimination in the linking step. You should be using LTO for release builds. Even the traditional solution (compile your object files with -ffunction-sections and link with --gc-sections) does a good job of culling dead code at function-level granularity.
Google Chrome ships as a 500 MB binary on my machine, so if you're embedding a web browser, that's how much you need minimum. Now tack on whatever else your application needs and it's easy to see how you can go past 2 GB if you're not careful. (To be clear, I am not making a moral judgment here, I am just saying it's possible to do. Whether it should happen is a different question.)
Do you have some special setup?
Chromium is in the hundred and something MB range on mine last I looked. Might expand to more on install.
I just checked Google Chrome Framework on my Mac, it was a little over 400 MB. Although now that I think about it it's probably a universal binary so you can cut that in half?
Yea looks like Chrome ships a universal binary with both x86_64 and arm64.
makes sense, chromium on my Fedora system takes up 234MB.
FAANGs we're deeply involved in designing LTO. See, e.g.,
https://research.google/pubs/thinlto-scalable-and-incrementa...
And other refs.
And yet...
Google also uses identical code folding. It's a pretty silly idea that a shop that big doesn't know about the compiler flags.
Google is made of many thousands of individuals. Some experts will be aware of all those, some won't. In my team, many didn't know about those details as they were handled by other builds teams for specific products or entire domains at once.
But since each product in some different domains had to actively enable those optimizations for themselves, they were occasionally forgotten, and I found a few in the app I worked for (but not directly on).
ICF seems like a good one to keep in the box of flags people don't know about because like everything in life it's a tradeoff and keeping that one problematic artifact under 2GiB is pretty much the only non-debatable use case for it.
> We would like to keep our small code-model. What other strategies can we pursue?
Move all the hot BBs near each other, right?
Facebook's solution: https://github.com/llvm/llvm-project/blob/main/bolt%2FREADME...
Google's:
https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...
but for x86_64, as of right now, if only a single call needs more than 31bits you have to upgrade the whole code section to large code model.
BOLT AFAIU is more about cache locality of putting hot code near each other and not really breaking the 2GiB barrier.
Why? Can't the linker or post-link optimizer reduce all near calls, leaving the more complicated mov with immediate form only where required?
Once the compiler has generated a 32-bit relative jump with an R_X86_64_PLT32 relocation, it’s too late. (A bit surprising for it to be a PLT relocation, but it does make some sense upon reflection, and the linker turns it into a direct call if you’re statically linking.) I think only RISC-V was brave enough to allow potentially size-changing linker relaxation, and incidentally they screwed it up (the bug tracker says “too late to change”, which brings me great sadness given we’re talking about a new platform).
On x86-64 it would probably be easier to point the relative call to a synthesized trampoline that does a 64-bit one, but it seems nobody has bothered thus far. You have to admit that sounds pretty painful.
> The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute JMP.
Makes sense, but in the assembly output just after, there is not a single JMP instruction. Instead, CALL <immediate> is replaced with putting the address in a 64-bit register, then CALL <register>, which makes even more sense. But why mention the JMP thing then? Is it a mistake or am I missing something? (I know some calls are replaced by JMP, but that's done regardless of -mcmodel=large)
I would assume loose language, referring to a CALL as a JMP. However of the two reasons given to dislike the large code model, register pressure isn't relevant to that particular snippet.
It's performing a call, ABIs define registers that are not preserved over calls; writing the destination to one of those won't affect register pressure.
I think the author is just noting that the construction is similar to an 8-byte JMP instruction. The text now reads:
> The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute 64bit ones; kind of like a JMP.
(We still need to use CALL in order to push a return address.)
> Responses to my publication submissions often claimed such problems did not exist
I see this often even in communities of software engineers, where people who are unaware of certain limitations at scale will announce that the research is unnecessary
Sure! But there's a sleight of hand in the numbers here where we're talking about 25GB binaries with debuginfo and then 2GB maximum offsets in the .text section. Of those 25GB binaries, probably 24.5 of them are debuginfo. You have to get into truly huge binaries before >2GB calls become an issue.
(I wonder but have no particular insight into if LTO builds can do smarter things here -- most calls are local, but the handful of far calls can use the more expensive spelling.)
At Google I worked with one statistics aggregation binary[0] that was ~25GB stripped. The distributed build system wouldn't even build the debug version because it exceeded the maximum configured size for any object file. I never asked if anyone had tried factoring it into separate pipelines but my intuition is that the extra processing overhead wouldn't have been worth splitting the business logic that way; once the exact set of necessary input logs are in memory you might as well do everything you need to them given the dramatically larger ratio of data size to code size.
[0] https://research.google/pubs/ubiq-a-scalable-and-fault-toler...
> […] 2GB maximum offsets in the .text section
… on the x86 ISA because it encodes the 32-bit jump/call offset directly in the opcode.
Whilst most RISC architecture do allow PC-relative branches, the offset is relatively small as 32-bit opcodes do not have enough room to squeeze a large offset in.
«Long» jumps and calls are indirect branches / calls done via registers where the entirety of 64 bits is available (address alignment rules apply in RISC architectures). The target address has to be loaded / calculated beforehand, though. Available in RISC and x86 64-bit architectures.
25 GiB for a single binary sounds horrifying
at some point surely some dynamic linking is warranted
To be fair, this is with debug symbols. Debug builds of Chrome were in the 5GB range several years ago; no doubt that’s increased since then. I can remember my poor laptop literally running out of RAM during the linking phase due to the sheer size of the object files being linked.
Why are debug symbols so big? For C++, they’ll include detailed type information for every instantiation of every type everywhere in your program, including the types of every field (recursively), method signatures, etc. etc., along with the types and locations of local variables in every method (updated on every spill and move), line number data, etc. etc. for every specialization of every function. This produces a lot of data even for “moderate”-sized projects.
Worse: for C++, you don’t win much through dynamic linking because dynamically linking C++ libraries sucks so hard. Templates defined in header files can’t easily be put in shared libraries; ABI variations mean that dynamic libraries generally have to be updated in sync; and duplication across modules is bound to happen (thanks to inlined functions and templates). A single “stuck” or outdated .so might completely break a deployment too, which is a much worse situation than deploying a single binary (either you get a new version or an old one, not a broken service).
I've hit the same thing in Rust, probably for the same reasons.
Isn't the simple solution to use detached debug files?
I think Windows and Linux both support them. That's how phones like Android and iOS get useful crash reports out of small binaries, they just upload the stack trace and some service like Sentry translates that back into source line numbers. (It's easy to do manually too)
I'm surprised the author didn't mention it first. A 25 GB exe might be 1 GB of code and 24 GB of debug crud.
> Isn't the simple solution to use detached debug files?
It should be. But the tooling for this kind of thing (anything to do with executable formats including debug info and also things like linking and cross-compilation) is generally pretty bad.
> I think Windows and Linux both support them.
Detached debug files has been the default (only?) option in MS's compiler since at least the 90s.
I'm not sure at what point it became hip to do that around Linux.
Since at least October 2003 on Debian:
[1] "debhelper: support for split debugging symbols"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=215670
[2] https://salsa.debian.org/debian/debhelper/-/commit/79411de84...
Can't debug symbols be shipped as separate files?
The problem is that when a final binary is linked everything goes into it. Then, after the link step, all the debug information gets stripped out into the separate symbols file. That means at some point during the build the target binary file will contain everything. I can not, for example, build clang in debug mode on my work machine because I have only 32 GB of memory and the OOM killer comes out during the final link phase.
Of course, separate binaries files make no difference at runtime since only the LOAD segments get loaded (by either the kernel or the dynamic loader, depending). The size of a binary on disk has little to do with the size of a binary in memory.
> The problem is that when a final binary is linked everything goes into it
I don't think that's the case on Linux, when using -gsplit-dwarf the debug info is put in separate files at the object file level, they are never linked into binaries.
Yes, but it can be more of a pain keeping track of pairs. In production though, this is what's done. And given a fault, the debug binary can be found in a database and used to gdb the issue given the core. You do have to limit certain online optimizations in order to have useful tracebacks.
This also requires careful tracking of prod builds and their symbol files... A kind of symbol db.
Yes, absolutely. Debuginfo doesn't impact .text section distances either way, though.
I’ve seen LLVM dependent builds hit well over 30GB. At that point it started breaking several package managers.
To be fair, they worked at Google, their engineering decisions are not normal. They might just decide that 25 GiB binaries are worth a 0.25% speedup at start time, potentially resulting in tens of millions of dollars' worth of difference. Nobody should do things the way Google does, but it's interesting to think about.
The overall size wouldn't get smaller just because it is dynamically linked, on the contrary (because DLLs are a dead code elimination barrier). 25 GB is insane either way, something must have gone horribly wrong very early in the development process (also why, even ship with debug information included, that doesn't make sense in the first place).
Won't make a bit of difference because everything is in a sort of container (not Docker) anyway. Unless you're suggesting those libraries to be distributed as base image to every possible Borg machine your app can run on which is an obvious non-starter.
> What other strategies can we pursue?
You can use thunks/trampolines. lld can make them for some architectures, presumably also for x86_64. Though I don't know why it didn't in your case.
But, like the large code model it can be expensive to add trampolines, both in icache performance and just execution if a trampoline is in a particularly hot path.
follow-up: https://fzakaria.com/2025/12/29/huge-binaries-i-thunk-theref...
> With this information, the necessity of code-models feels unecessary [sic]. Why trigger the cost for every callsite when we can do-so piecemeal as necessary with the opportunity to use profiles to guide us on which methods to migrate to thunks.
Does the linker have access to the same hotness information that the compiler uses during PGO? Well -- presumably it could, even if it doesn't now. But it would be like a heuristic with a hotness threshold? Do linkers "do" heuristics?
In many ways that is what the PLT is also.
This is what my next post will explore. I ran into some issues with the GOT that I'll have to explore solutions for.
I'm writing this for myself mostly. The whole idea for code models when you have thunks feels unnecessary.
Note, sections without the SHF_ALLOC flag, such as `.debug_*` sections, do not contribute to the relocation distance pressure. Many 10+GiB binaries (likely due to not using split DWARF) might have much smaller code+data and not even close to the limit.
However, Google, Meta, and ByteDance have encountered x86-64 relocation distance issue with their huge C++ server binaries. To my knowledge industry users in other domains haven't run into this problem.
To address this, Google adopted the medium code model approximately two years ago for its sanitizer and PGO instrumentation builds. CUDA fat binaries also caused problems. I suggest that linker script `INSERT BEFORE/AFTER` for orphan sections (https://reviews.llvm.org/D74375 ) served as a key mitigation.
I hope that a range extension thunk ABI, similar to AArch64/Power, is defined for the x86-64 psABI. It is better than the current long branch pessimization we have with -mcmodel=large.
---
It seems that nobody has run into this .eh_frame_hdr implementation limitation yet
* `.eh_frame_hdr -> .text`: GNU ld and ld.lld only support 32-bit offsets (`table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4;`) as of Dec 2025.
25GB seems excessive, but I keep on having the basic compile toolchain as statically compiled executables. It simply works better when things go awry.
shameless plug: if you want to understand the content of this post better, first read the first half of my article on jumps [1] (up to syscall). goes into detail about relocations and position-independent code.
[1] https://gpfault.net/posts/asm-tut-4.html
I've seen terrible, terrible binary sizes with Eigen + debug symbols, due to how Eigen lazy evaluation works (I think). Every math expression ends up as a new template instantiation.
Eigen is one of the worst libraries when it comes to both exe size and compile times. <shudder>
In terms of compile times, boost geometry is somehow worse. You're encouraged to import boost/geometry.hpp, which includes every module, which stalls compile times by several seconds just to parse all the templates. It's not terrible if you include just the headers you need, but that's not the "default" that most people use.
boost is on my “do not ever use ever oh my god what are you doing stop it” list. It’s so bad.
Same.
Oh man, that first paragraph. “Such problems don’t exist…” what a gaslighting response to a publication submittal. The least they could do is ask where this problem emerges and you can hand wavy your answer without revealing business IP.
Also, we, as an industry of software engineers, need to re-examine these hard defaults we thought could never be achieved. Such as the .text limits.
Anyway, very good read.
The HN de-sensationalize algo for submission titles needs tweaking. Original title is simply "Huge Binaries".
agreed. Binaries is a bit too sensational for my taste. this can be further optimized.
"Files So Big They Might As Well Be Trinaries".
"Bins"? :)
01.
Why not?
False