I used to work on a legacy piece of business software, written mostly in C. One recurring issue we noticed about GDB was that compiling a "debug" version of the binary to be able to GDB it often fixed whatever problem the program was having, such that we were no longer able to reproduce the issue. So it wasn't uncommon for me to have this conversation with my manager:
Me: Customer is having their program X hit a SEGFAULT when they run it.
Manager: Did you transfer over a GDB version to backtrace the core dump?
Me: Yep.
Manager: What did it say?
Me: Well, when I run the same series of steps through the GDB version, it doesn't segfault, so I have nothing to backtrace.
Manager: Huh...
Me: What would you like me to do?
Manager: Move the production binary aside, drop in the GDB version. Leave it there.
Heisenbugs[1] can be incredibly frustrating. Computers are supposed to be deterministic. Even when bugs are incredibly complex[2], it's possible to systematically investigate iff the behavior of the system is deterministic.
I once had to add 7 NOP instructions at the beginning of the bootstrap/"BIOS" code I wrote for a Z80 clone. I couldn't understand why test programs seemed to crash[3] about a 15% of the time. Later, I discovered the same behavior in code that previously worked. I spent over two weeks trying to investigate, which only produced more confusion as the behavior would sometimes go away or get much worse randomly with each change I made.
I finally found the bug using a (hardware) logic analyzer to watch[4] what the CPU was doing on the memory buss, Something wasn't finished resetting inside the CPU. Any instructions that ran too early would trash the internal state of the CPU, causing later instructions to have problems like asserting multiple chip select pins. Multiple ROM/RAM chips would try to drive the buss, and everything dies. The instruction this happened on depended on which instructions were run while the CPU was still resetting. The NOPs simply delay startup to let the CPU stabilize.
Bugs disappearing in debug builds is common to all C and C++ codebases. It's the nature of undefined behavior and it's interaction with modern optimizers.
I'm in the camp that says all builds should at least be built with symbols, so you can still debug release builds. Maybe stripped out of the distributed version, keeping a private copy for coredumps from the field, if you're doing closed source dev.
Here's what I've done: First, you should have a tagged version in the source code control system for everything you ever ship. That should include the tools used by the build process. So you should be able to reproduce the shipped version to the byte.
Second, gdb can give you a stack trace with addresses if there are no symbols. But if you compiled with gdb, you can give pass the flags "-Wl,-M" to the link step, and it will spew a link map out stdout, which you can pipe somewhere. With that file, you can figure out what the addresses in the stack map correspond to.
If you really want to go hard core, once you figure out what function the problem is in, you can compile that file with "-Wa,-ahlms=<filename>.asm.out" to get an assembly listing for that file. From a bit of hex arithmetic, you can find the assembly instruction that corresponds to the crash. The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
> The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
If you run debug builds, the compiler fetches an uninitialized variable out of memory that the OS nicely zeroed before it gave it to the process. If you run even the simplest optimizer, it will say: "Well, obviously, no one cares what this variable starts with, because they didn't initialize it. So let's just use the garbage value left over in this free-at-the-moment register." Expunge your uninitialized variables, and then show me the disassembly code that the compiler got wrong. (Once upon a time I managed a piece of the validation for a C/C++/FORTRAN compiler suite. I have had this conversation more than once. :)
Exception: Real-time code with timing-specific hardware interactions. But that isn't an optimizer bug, that's a design issue.
I agree that this is quite likely. But there is more subtle evilness in undefined behavior. E.g. for "foo(bar(), baz())", it's not clear whether bar() or baz() is executed first (or maybe it is now, but wasn't in earlier iterations of the standard).
OTOH, I'm wondering if GP is building with -Wall and has eliminated all warnings? At least that's what we're doing, and this yields good results.
Yes, yes. Very true, optimizers can break in interesting ways. And yet, 99.9% of the time when some whiner says: “The optimzer broke my software!” it is an uninitialized variable.
The other 0.1% of failures kept my team busy enough. But I never allocated any time to your problem until you proved that you had no uninitialized variables.
The struggle is real! Compiling with debug enabled does change timing enough to hide (or surface!) race conditions.
The best way I've found to deal with this is:
1. Compile with -O2 (or for firmware, -Os) by default, and generate debug symbols in the elf file. Crank it up to -O3 only on a file-by-file basis with #pragma
2. Collect the process memory when an issue is hit
3. Debug it asynchronously using the debug symbols + the memory.
That way, the debugging & the running are decoupled and the timing is stable.
We're building "release with debug info" and then strip the debug info into separate files that are bundled into *-dbg packages like Debian does.
That way the backtraces can be decoded. Although sometimes they are a bit butchered because it's a release build (we use the same config while developing and it's usable).
I ran into this earlier this week. Turned out a macro that was implemented as a wrapper function in debug linked differently in ship. I don't see this first-hand very often but I'd be surprised if this kind of thing isn't an uncommon cause of ship/debug differences.
What a terrible idea. The problem which manifested as a crash before might now manifest in some number silently changing instead. And it's not like it's impossible to debug using a core dump of a "release"-compiled binary, just harder. Thousands of app developers do it every day when looking at crash reports from the field.
There are certain software bugs that cause undefined behavior. Debugging tools such as GDB can only get you so far in identifying and solving such bugs. That’s where static analysis can help.
In this post, we’ll walk through setting up GDB for the following environment:
A Nordic nRF52840 development kit
Hoping this guide will be a useful reference for diving into the QMK[0] port to nRF52840 by Sekigon[1]. Scratch-built keyboards are my hobby and it would be nice to have something beefier than an atmega32u4 handling both the keymap/layout and bluetooth.
Good luck! I'm a big fan of the nRF52840! It's a great fit for a BT+USB keyboard. Hit us up in the blog post comments if you struggle with following Mohammad's instructions.
What I want, and what I think GDB really needs for people to use its Python API's more frequently, is a way to easily manage, share, install, and collaborate with other people's .gdbinit and GDB Python scripts.
I've written a number of these at my previous companies, which all get loaded by default when any developer was debugging (basically wrapping gdb in a 'make gdb' style call). That works for internal development at a company where everyone is using the same flow, but that's nearly impossible in the real world as every one has a different setup.
I'm assuming numerous companies have written a Linked List GDB Printer (such as https://github.com/chrisc11/debug-tips/blob/master/gdb/pytho...) or a script that prints useful information from the global variables particular to the RTOS running. These are all great, but they are a pain to install.
Is there really no better way to share these across the Internet other than "Copy / Paste this text into your .gdbinit"? I'm thinking it would be possible and relatively painless to share these scripts through PyPi, require them, and load them like a normal package, but I haven't seen this approach taken.
With modern JTAG probes and available software (like GDB or OpenOCD) it's unthinkable going back to debugging with SW emulators, in-circuit emulators or with bare LEDs or a serial port (sometimes you have none).
Being able to debug firmware is very important to me, up to the point that the fact a chip (or family) is supported by OpenOCD becomes a determining factor when choosing a component for a new project. Specially if the project is (or could be) open sourced.
I don't like proprietary debuggers. If a chip requires or forces me or my company to purchase Segger XYZ probe or a custom IDE, then it's not discarded right away, but back to the bottom of the pile.
When there is no alternative, I find myself writing OpenOCD flash drivers even before starting any FW development.
At my first job, we didn't have GDB or a console, but you could set a test point high. In the worst case, you attached an LED. In the best case, you maybe had a scope handy to log the data.
In the games industry, I worked with an older programmer that developed games on the Sega Genesis back in the 90s. He said often changing the background colour[1] was the "simplest" way of debugging state or branches.
In my youth, us Amiga demoscene coders used this method too. You could basically measure CPU consumption by some sub-routine by setting a particular color on entry, then resetting on exit. Basically "raster lines" became a measure of time.
Pretty sure I optimized by putting a piece of tape at the exit line on the TV, then trying to make the color bar more narrow. :)
This can even be used for profiling on some systems because the background color can be changed in the middle of a frame with a single register write. So you can basically turn the left/right borders into a color-coded CPU utilization gauge.
Heh. I once had a problem that I could detect at a certain point in software, but couldn't figure out how I got there. I used a test point to trigger a logic analyzer that was attached to the address bus. Scrolling back, I could see enough of the previous instruction fetches to figure it out. (This was before instruction caches on the 68000 line of processors.)
close source indeed, not easy to manually insert some print into binaries :D perhaps a linked in library with hooks can do it... but i'd say gdb is the way to go indeed!
I used to work on a legacy piece of business software, written mostly in C. One recurring issue we noticed about GDB was that compiling a "debug" version of the binary to be able to GDB it often fixed whatever problem the program was having, such that we were no longer able to reproduce the issue. So it wasn't uncommon for me to have this conversation with my manager:
Me: Customer is having their program X hit a SEGFAULT when they run it.
Manager: Did you transfer over a GDB version to backtrace the core dump?
Me: Yep.
Manager: What did it say?
Me: Well, when I run the same series of steps through the GDB version, it doesn't segfault, so I have nothing to backtrace.
Manager: Huh...
Me: What would you like me to do?
Manager: Move the production binary aside, drop in the GDB version. Leave it there.
Heisenbugs[1] can be incredibly frustrating. Computers are supposed to be deterministic. Even when bugs are incredibly complex[2], it's possible to systematically investigate iff the behavior of the system is deterministic.
I once had to add 7 NOP instructions at the beginning of the bootstrap/"BIOS" code I wrote for a Z80 clone. I couldn't understand why test programs seemed to crash[3] about a 15% of the time. Later, I discovered the same behavior in code that previously worked. I spent over two weeks trying to investigate, which only produced more confusion as the behavior would sometimes go away or get much worse randomly with each change I made.
I finally found the bug using a (hardware) logic analyzer to watch[4] what the CPU was doing on the memory buss, Something wasn't finished resetting inside the CPU. Any instructions that ran too early would trash the internal state of the CPU, causing later instructions to have problems like asserting multiple chip select pins. Multiple ROM/RAM chips would try to drive the buss, and everything dies. The instruction this happened on depended on which instructions were run while the CPU was still resetting. The NOPs simply delay startup to let the CPU stabilize.
[1] http://www.catb.org/~esr/jargon/html/H/heisenbug.html
[2] http://www.catb.org/~esr/jargon/html/M/mandelbug.html
[3] "crash" == CPU locked hard with no activity on the memory buss until /RESET was grounded by the watchdog timer (or me)
[4] with a 100-pin PQFP clip-on probe that wouldn't stay attached
Bugs disappearing in debug builds is common to all C and C++ codebases. It's the nature of undefined behavior and it's interaction with modern optimizers.
I'm in the camp that says all builds should at least be built with symbols, so you can still debug release builds. Maybe stripped out of the distributed version, keeping a private copy for coredumps from the field, if you're doing closed source dev.
Here's what I've done: First, you should have a tagged version in the source code control system for everything you ever ship. That should include the tools used by the build process. So you should be able to reproduce the shipped version to the byte.
Second, gdb can give you a stack trace with addresses if there are no symbols. But if you compiled with gdb, you can give pass the flags "-Wl,-M" to the link step, and it will spew a link map out stdout, which you can pipe somewhere. With that file, you can figure out what the addresses in the stack map correspond to.
If you really want to go hard core, once you figure out what function the problem is in, you can compile that file with "-Wa,-ahlms=<filename>.asm.out" to get an assembly listing for that file. From a bit of hex arithmetic, you can find the assembly instruction that corresponds to the crash. The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
> The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
objdump -S can do that.
Thanks. I've been looking for something like that off and on for the better part of 10 years.
Just be careful about the operand order differences between objdump and whatever your compiler spits out. They're not necessarily the same syntax.
-fverbose-asm, IIRC. Not always exactly accurate, but lots better than nothing.
You have uninitialized variables.
If you run debug builds, the compiler fetches an uninitialized variable out of memory that the OS nicely zeroed before it gave it to the process. If you run even the simplest optimizer, it will say: "Well, obviously, no one cares what this variable starts with, because they didn't initialize it. So let's just use the garbage value left over in this free-at-the-moment register." Expunge your uninitialized variables, and then show me the disassembly code that the compiler got wrong. (Once upon a time I managed a piece of the validation for a C/C++/FORTRAN compiler suite. I have had this conversation more than once. :)
Exception: Real-time code with timing-specific hardware interactions. But that isn't an optimizer bug, that's a design issue.
I agree that this is quite likely. But there is more subtle evilness in undefined behavior. E.g. for "foo(bar(), baz())", it's not clear whether bar() or baz() is executed first (or maybe it is now, but wasn't in earlier iterations of the standard).
OTOH, I'm wondering if GP is building with -Wall and has eliminated all warnings? At least that's what we're doing, and this yields good results.
There's all sorts of other things that can hard fault a release build and not a debug build.
Yes, yes. Very true, optimizers can break in interesting ways. And yet, 99.9% of the time when some whiner says: “The optimzer broke my software!” it is an uninitialized variable.
The other 0.1% of failures kept my team busy enough. But I never allocated any time to your problem until you proved that you had no uninitialized variables.
The struggle is real! Compiling with debug enabled does change timing enough to hide (or surface!) race conditions.
The best way I've found to deal with this is:
1. Compile with -O2 (or for firmware, -Os) by default, and generate debug symbols in the elf file. Crank it up to -O3 only on a file-by-file basis with #pragma
2. Collect the process memory when an issue is hit
3. Debug it asynchronously using the debug symbols + the memory.
That way, the debugging & the running are decoupled and the timing is stable.
We're building "release with debug info" and then strip the debug info into separate files that are bundled into *-dbg packages like Debian does.
That way the backtraces can be decoded. Although sometimes they are a bit butchered because it's a release build (we use the same config while developing and it's usable).
I ran into this earlier this week. Turned out a macro that was implemented as a wrapper function in debug linked differently in ship. I don't see this first-hand very often but I'd be surprised if this kind of thing isn't an uncommon cause of ship/debug differences.
What a terrible idea. The problem which manifested as a crash before might now manifest in some number silently changing instead. And it's not like it's impossible to debug using a core dump of a "release"-compiled binary, just harder. Thousands of app developers do it every day when looking at crash reports from the field.
There are certain software bugs that cause undefined behavior. Debugging tools such as GDB can only get you so far in identifying and solving such bugs. That’s where static analysis can help.
[0] https://docs.qmk.fm/
[1] https://github.com/sekigon-gonnoc/qmk_firmware
Good luck! I'm a big fan of the nRF52840! It's a great fit for a BT+USB keyboard. Hit us up in the blog post comments if you struggle with following Mohammad's instructions.
What I want, and what I think GDB really needs for people to use its Python API's more frequently, is a way to easily manage, share, install, and collaborate with other people's .gdbinit and GDB Python scripts.
I've written a number of these at my previous companies, which all get loaded by default when any developer was debugging (basically wrapping gdb in a 'make gdb' style call). That works for internal development at a company where everyone is using the same flow, but that's nearly impossible in the real world as every one has a different setup.
I'm assuming numerous companies have written a Linked List GDB Printer (such as https://github.com/chrisc11/debug-tips/blob/master/gdb/pytho...) or a script that prints useful information from the global variables particular to the RTOS running. These are all great, but they are a pain to install.
Is there really no better way to share these across the Internet other than "Copy / Paste this text into your .gdbinit"? I'm thinking it would be possible and relatively painless to share these scripts through PyPi, require them, and load them like a normal package, but I haven't seen this approach taken.
With modern JTAG probes and available software (like GDB or OpenOCD) it's unthinkable going back to debugging with SW emulators, in-circuit emulators or with bare LEDs or a serial port (sometimes you have none).
Being able to debug firmware is very important to me, up to the point that the fact a chip (or family) is supported by OpenOCD becomes a determining factor when choosing a component for a new project. Specially if the project is (or could be) open sourced.
I don't like proprietary debuggers. If a chip requires or forces me or my company to purchase Segger XYZ probe or a custom IDE, then it's not discarded right away, but back to the bottom of the pile.
When there is no alternative, I find myself writing OpenOCD flash drivers even before starting any FW development.
who needs GDB when you have the good old fashion print statement... O.o
At my first job, we didn't have GDB or a console, but you could set a test point high. In the worst case, you attached an LED. In the best case, you maybe had a scope handy to log the data.
In the games industry, I worked with an older programmer that developed games on the Sega Genesis back in the 90s. He said often changing the background colour[1] was the "simplest" way of debugging state or branches.
[1] https://segaretro.org/Sega_Mega_Drive/Palettes_and_CRAM#Back...
In my youth, us Amiga demoscene coders used this method too. You could basically measure CPU consumption by some sub-routine by setting a particular color on entry, then resetting on exit. Basically "raster lines" became a measure of time.
Pretty sure I optimized by putting a piece of tape at the exit line on the TV, then trying to make the color bar more narrow. :)
This can even be used for profiling on some systems because the background color can be changed in the middle of a frame with a single register write. So you can basically turn the left/right borders into a color-coded CPU utilization gauge.
Heh. I once had a problem that I could detect at a certain point in software, but couldn't figure out how I got there. I used a test point to trigger a logic analyzer that was attached to the address bus. Scrolling back, I could see enough of the previous instruction fetches to figure it out. (This was before instruction caches on the 68000 line of processors.)
GDB lets you add as many print statements as you'd like to a binary you can't change, for one.
close source indeed, not easy to manually insert some print into binaries :D perhaps a linked in library with hooks can do it... but i'd say gdb is the way to go indeed!
if GDB isn't the "good old fashioned", then I need to update my frame of reference ;)