The analysis looks rather half-finished. They did not analyze why so much memory was consumed. If this is the cache which persists after the first call, if it's temporary working memory, or if it's an accumulating memory leak. And why it uses so much memory at all.
I couldn't find any other complaints about rust backtrace printing consuming a lot of memory, which I would have expected if this was normal behaviour. So I wonder if there is anything special about their environment or usecase?
I would assume that the same OOM problem would arise when printing a panic backtrace. Either their instance has enough memory to print backtraces, or it doesn't. So I don't understand why they only disable lib backtraces.
2. and that we ship our binary with debug symbols, with those options
``` ENV RUSTFLAGS="-C link-arg=-Wl,--compress-debug-sections=zlib -C force-frame-pointers=yes" ```
For the panic, indeed, I had the same question on Reddit. For this particular service, we don't expect panics at all, it is just that by default we ship all our rust binaries with backtrace enabled. And we have added an extra api endpoint to trigger a catched panic on purpose for other apps to be sure our sizing is correct.
In the article they talk about how printing an error from the anyhow crate in debug format creates a full backtrace, which leads to an OOM error. This happens even with 4 GB of memory.
Why does creating a backtrace need such a large amount of memory? Is there a memory leak involved as well?
I don't think the 4 GiB instance actually ran into an OOM error. They merely observed a 400 MiB memory spike. The crashing instances were limited to 256 and later 512 MiB.
(Assuming that the article incorrectly used Mib when they meant MiB. Used correctly b=bit, B=byte)
Well, based on the article, if there was a memory leak then they should see the steady increase in memory consumption, which was not the case.
The only explanation I can see (if their conclusion is accurate) is that the end result of the symbolization is more than 400MB additional memory consumption (which is a lot in my opinion), however the process of the symbolization requires more than 2GB additional memory (which is incredibly a lot).
The first increase of the memory limit was not 4G, but something roughly around 300Mb/400Mb, and the OOM did happen again with this setting.
Thus leading to a 2nd increase to 4Gi to be sure the app would not get OOM killed when the behavior get triggered. We needed the app to be alive/running for us to investigate the memory profiling.
Regarding the increase of 400MiB, yeah it is a lot, and it was a surprise to us too. We were not expecting such increase. There are, I think 2 reasons behind this.
1. This service is a grpc server, which has a lot of code generated, so lots of symbols
2. we compile the binary with debug symbols and a flag to compress the debug symbols sections to avoid having huge binary. Which may part be of this issue.
symbols are usually included even with debuglevel 0, unless stripped[0]. And debuginfo is configurable at several levels[1]. If you've set it to 2/full try dropping to a lower level, that might also result in less data to load for the backtrace implementation.
> we compile the binary with debug symbols and a flag to compress the debug symbols sections to avoid having huge binary.
How big are the uncompressed debug symbols? I'd expected processing uncompressed debug symbols to happen via a memory mapped file, while compressed debug symbols probably need to be extracted to anonymous memory.
The compressed symbols sounds like the likely culprit. Do you really need a small executable? The uncompressed symbols need to be loaded into RAM anyway, and if it is delayed until it is needed then you will have to allocate memory to uncompress them.
For this particular service, the size does not matter really. For others, it makes more diff (several hundred of Mb) and as we deploy on customers infra, we want images' size to stay reasonable.
For now, we apply the same build rules for all our services to stay consistent.
Maybe I'm not communicating well. Or maybe I don't understand how the debug symbol compression works at runtime. But my point is that I don't think you are getting the tradeoff you think you are getting. The smaller executable may end up using more RAM. Usually at the deployment stage, that's what matters.
Smaller executables are more for things like reducing distribution sizes, or reducing process launch latency when disk throughput is the issue. When you invoke compression, you are explicitly trading off runtime performance in order to get the benefit of smaller on-disk or network transmission size. For a hosted service, that's usually not a good tradeoff.
It is most likely me reading too quickly. I was caught off guard by the article gaining traction in a Sunday, and as I have other duties during the weekend, I am reading/responding only when I can sneak in.
For your comment, I think you are right regarding compression of debug symbols that add up to the peak memory, but I think you are misleading when you think the debug symbols are uncompressed when the app/binary is started/loaded. Decompression only happens for me when this section is accessed by debugger or equivalent.
It is not the same thing as when the binary is fully compressed, like with upx for example.
I have done a quick sanity check on my desktop, I got.
From rss memory at startup I get ~128 MB, and after the panic at peak I get ~474 MB.
So the peak is taller indeed when the debug section is compressed, but the binary in memory when started is roughly equivalent. (virtual mem too)
I had some hard time getting a source that may validate my belief regarding when the debug symbol are uncompressed. But based on https://inbox.sourceware.org/binutils/20080622061003.D279F3F... and the help of claude.ai, I would say it is only when those sections are accessed.
for what is worth, the whole answer of claude.ai
The debug sections compressed with --compress-debug-sections=zlib are decompressed:
At runtime by the debugger (like GDB) when it needs to access the debug information:
When setting breakpoints
When doing backtraces
When inspecting variables
During symbol resolution
When tools need to read debug info:
During coredump analysis
When using tools like addr2line
During source-level debugging
When using readelf with the -w option
The compression is transparent to these tools - they automatically handle the decompression when needed. The sections remain compressed on disk, and are only decompressed in memory when required.
This helps reduce the binary size on disk while still maintaining full debugging capabilities, with only a small runtime performance cost when the debug info needs to be accessed.
The decompression is handled by the libelf/DWARF libraries that these tools use to parse the ELF files.
I don't think the article is misleading, but I do think it's a shame that all the interesting info is saved for this hackernews comment. I think it would make for a more exciting article if you included more of the analysis along with the facts. Remember, as readers we don't know anything about your constraints/system.
It was a parti pris by me, I wanted the article to stay focus on the how, not much on the why. But I agree, even while the context is specific to us, many people wanted more interest of the surrounding, and why it happened. I wanted to explain the method ¯\_(ツ)_/¯
Collecting a call stack only requires unwinding information (which is usually already present for C++ exceptions / Rust panics), not full debug symbols. This gives you a list of instruction pointers. (on Linux, the glibc `backtrace` function can help with this)
Print those instruction pointers in a relative form (e.g. "my_binary+0x1234") so that the output is independent of ASLR.
The above is all that needs to happen on the production/customer machines, so you don't need to ship debug symbols -- you can ship `strip`ped binaries.
On your own infrastructure, keep the original un-stripped binaries around. We use a script involving elfutil's eu-addr2line with those original binaries to turn the module+relative_address stack trace into a readable symbolized stack trace.
I wasn't aware of llvm-symbolizer yet, seems like that can do the same job as eu-addr2line.
(There's also binutil's addr2line but in my experience that didn't work as well as eu-addr2line)
I once had a faulty python based ai image generator running on my machine that used all 64 gigs of ram and oomed with a memory dump written to fs. This is no fun when that happens. But mostly these kind of bugs are misconfigurations or bad code, never ending while loops, whatever.
Ahh, I did this in Python before I learned about Cursor and Sourceforge‘s Cody. I‘d use a template where I provide a tree of my project structure, and then put code file contents in my template file, and then have a full repo in one giant markdown file. This only worked for smaller projects, but it worked damn well to provide the full context to an LLM to then ask questions about my code :)
great one! and i would recommnd this hands-on guide for diagnosing memory leaks in Rust applications. it explains how to enable heap profiling in 'jemalloc', collect memory allocation data, and generate flame graphs for analysis. https://greptime.com/blogs/2024-01-18-memory-leak#diagnosing...
The analysis looks rather half-finished. They did not analyze why so much memory was consumed. If this is the cache which persists after the first call, if it's temporary working memory, or if it's an accumulating memory leak. And why it uses so much memory at all.
I couldn't find any other complaints about rust backtrace printing consuming a lot of memory, which I would have expected if this was normal behaviour. So I wonder if there is anything special about their environment or usecase?
I would assume that the same OOM problem would arise when printing a panic backtrace. Either their instance has enough memory to print backtraces, or it doesn't. So I don't understand why they only disable lib backtraces.
Hello,
You can see my other comment https://news.ycombinator.com/item?id=42708904#42756072 for more details.
But yes, the cache does persist after the first call, the resolved symbols stay in the cache to speed up the resolution of next calls.
Regarding the why, it is mainly because
1. this app is a gRPC server and contains a lot of generated code (you can investigate binary bloat with rust with https://github.com/RazrFalcon/cargo-bloat)
2. and that we ship our binary with debug symbols, with those options ``` ENV RUSTFLAGS="-C link-arg=-Wl,--compress-debug-sections=zlib -C force-frame-pointers=yes" ```
For the panic, indeed, I had the same question on Reddit. For this particular service, we don't expect panics at all, it is just that by default we ship all our rust binaries with backtrace enabled. And we have added an extra api endpoint to trigger a catched panic on purpose for other apps to be sure our sizing is correct.
In the article they talk about how printing an error from the anyhow crate in debug format creates a full backtrace, which leads to an OOM error. This happens even with 4 GB of memory.
Why does creating a backtrace need such a large amount of memory? Is there a memory leak involved as well?
I don't think the 4 GiB instance actually ran into an OOM error. They merely observed a 400 MiB memory spike. The crashing instances were limited to 256 and later 512 MiB.
(Assuming that the article incorrectly used Mib when they meant MiB. Used correctly b=bit, B=byte)
Well, based on the article, if there was a memory leak then they should see the steady increase in memory consumption, which was not the case.
The only explanation I can see (if their conclusion is accurate) is that the end result of the symbolization is more than 400MB additional memory consumption (which is a lot in my opinion), however the process of the symbolization requires more than 2GB additional memory (which is incredibly a lot).
The author replied with additional explanations, so it seems that the additional 400MB were needed because the debug symbols were compressed.
Sorry if the article is misleading.
The first increase of the memory limit was not 4G, but something roughly around 300Mb/400Mb, and the OOM did happen again with this setting.
Thus leading to a 2nd increase to 4Gi to be sure the app would not get OOM killed when the behavior get triggered. We needed the app to be alive/running for us to investigate the memory profiling.
Regarding the increase of 400MiB, yeah it is a lot, and it was a surprise to us too. We were not expecting such increase. There are, I think 2 reasons behind this.
1. This service is a grpc server, which has a lot of code generated, so lots of symbols
2. we compile the binary with debug symbols and a flag to compress the debug symbols sections to avoid having huge binary. Which may part be of this issue.
> 2. we compile the binary with debug symbols
symbols are usually included even with debuglevel 0, unless stripped[0]. And debuginfo is configurable at several levels[1]. If you've set it to 2/full try dropping to a lower level, that might also result in less data to load for the backtrace implementation.
[0] https://users.rust-lang.org/t/difference-between-strip-symbo... [1] https://doc.rust-lang.org/cargo/reference/profiles.html#debu...
Thanks, was not aware there was granularity for debuginfo ;)
> we compile the binary with debug symbols and a flag to compress the debug symbols sections to avoid having huge binary.
How big are the uncompressed debug symbols? I'd expected processing uncompressed debug symbols to happen via a memory mapped file, while compressed debug symbols probably need to be extracted to anonymous memory.
https://github.com/llvm/llvm-project/issues/63290
Normal build
what we ship The diff is more impressive on some bigger projectsThe compressed symbols sounds like the likely culprit. Do you really need a small executable? The uncompressed symbols need to be loaded into RAM anyway, and if it is delayed until it is needed then you will have to allocate memory to uncompress them.
I will give it a shot next week to try out ;P
For this particular service, the size does not matter really. For others, it makes more diff (several hundred of Mb) and as we deploy on customers infra, we want images' size to stay reasonable. For now, we apply the same build rules for all our services to stay consistent.
Maybe I'm not communicating well. Or maybe I don't understand how the debug symbol compression works at runtime. But my point is that I don't think you are getting the tradeoff you think you are getting. The smaller executable may end up using more RAM. Usually at the deployment stage, that's what matters.
Smaller executables are more for things like reducing distribution sizes, or reducing process launch latency when disk throughput is the issue. When you invoke compression, you are explicitly trading off runtime performance in order to get the benefit of smaller on-disk or network transmission size. For a hosted service, that's usually not a good tradeoff.
It is most likely me reading too quickly. I was caught off guard by the article gaining traction in a Sunday, and as I have other duties during the weekend, I am reading/responding only when I can sneak in.
For your comment, I think you are right regarding compression of debug symbols that add up to the peak memory, but I think you are misleading when you think the debug symbols are uncompressed when the app/binary is started/loaded. Decompression only happens for me when this section is accessed by debugger or equivalent. It is not the same thing as when the binary is fully compressed, like with upx for example.
I have done a quick sanity check on my desktop, I got.
From rss memory at startup I get ~128 MB, and after the panic at peak I get ~627 MB.When compiled with those flags
From rss memory at startup I get ~128 MB, and after the panic at peak I get ~474 MB.So the peak is taller indeed when the debug section is compressed, but the binary in memory when started is roughly equivalent. (virtual mem too)
I had some hard time getting a source that may validate my belief regarding when the debug symbol are uncompressed. But based on https://inbox.sourceware.org/binutils/20080622061003.D279F3F... and the help of claude.ai, I would say it is only when those sections are accessed.
for what is worth, the whole answer of claude.ai
Thanks for running these checks. I’m learning from this too!
Container images use compression too, so having the debug section uncompressed shouldn't actually make the images any bigger.
> Sorry if the article is misleading.
I don't think the article is misleading, but I do think it's a shame that all the interesting info is saved for this hackernews comment. I think it would make for a more exciting article if you included more of the analysis along with the facts. Remember, as readers we don't know anything about your constraints/system.
It was a parti pris by me, I wanted the article to stay focus on the how, not much on the why. But I agree, even while the context is specific to us, many people wanted more interest of the surrounding, and why it happened. I wanted to explain the method ¯\_(ツ)_/¯
Thank you for this in-depth reply! Your answer makes a lot of sense. Also thank you for writing the article!
[flagged]
Can't they print something that llvm-symbolizer would pick up offline?
Yes, that is typically the way to go.
Collecting a call stack only requires unwinding information (which is usually already present for C++ exceptions / Rust panics), not full debug symbols. This gives you a list of instruction pointers. (on Linux, the glibc `backtrace` function can help with this)
Print those instruction pointers in a relative form (e.g. "my_binary+0x1234") so that the output is independent of ASLR.
The above is all that needs to happen on the production/customer machines, so you don't need to ship debug symbols -- you can ship `strip`ped binaries.
On your own infrastructure, keep the original un-stripped binaries around. We use a script involving elfutil's eu-addr2line with those original binaries to turn the module+relative_address stack trace into a readable symbolized stack trace. I wasn't aware of llvm-symbolizer yet, seems like that can do the same job as eu-addr2line. (There's also binutil's addr2line but in my experience that didn't work as well as eu-addr2line)
I once had a faulty python based ai image generator running on my machine that used all 64 gigs of ram and oomed with a memory dump written to fs. This is no fun when that happens. But mostly these kind of bugs are misconfigurations or bad code, never ending while loops, whatever.
Ahh, I did this in Python before I learned about Cursor and Sourceforge‘s Cody. I‘d use a template where I provide a tree of my project structure, and then put code file contents in my template file, and then have a full repo in one giant markdown file. This only worked for smaller projects, but it worked damn well to provide the full context to an LLM to then ask questions about my code :)
Reminds me a bit of a post my colleague wrote a while back: https://www.svix.com/blog/heap-fragmentation-in-rust-applica...
HN discussion: https://news.ycombinator.com/item?id=35469632
Yeap, just recently discovered that pretty much any long living app in Rust should switch jemalloc for this reason.
It's kind of shocking that debug-formatting a backtrace allocates enough memory to OOM the process, though. What's going on there?
great one! and i would recommnd this hands-on guide for diagnosing memory leaks in Rust applications. it explains how to enable heap profiling in 'jemalloc', collect memory allocation data, and generate flame graphs for analysis. https://greptime.com/blogs/2024-01-18-memory-leak#diagnosing...