Debugging Firmware with GDB

interrupt.memfault.com

98 points by fra 6 years ago

I used to work on a legacy piece of business software, written mostly in C. One recurring issue we noticed about GDB was that compiling a "debug" version of the binary to be able to GDB it often fixed whatever problem the program was having, such that we were no longer able to reproduce the issue. So it wasn't uncommon for me to have this conversation with my manager:

Me: Customer is having their program X hit a SEGFAULT when they run it.

Manager: Did you transfer over a GDB version to backtrace the core dump?

Me: Yep.

Manager: What did it say?

Me: Well, when I run the same series of steps through the GDB version, it doesn't segfault, so I have nothing to backtrace.

Manager: Huh...

Me: What would you like me to do?

Manager: Move the production binary aside, drop in the GDB version. Leave it there.

pdkl95 6 years ago

Heisenbugs[1] can be incredibly frustrating. Computers are supposed to be deterministic. Even when bugs are incredibly complex[2], it's possible to systematically investigate iff the behavior of the system is deterministic.
I once had to add 7 NOP instructions at the beginning of the bootstrap/"BIOS" code I wrote for a Z80 clone. I couldn't understand why test programs seemed to crash[3] about a 15% of the time. Later, I discovered the same behavior in code that previously worked. I spent over two weeks trying to investigate, which only produced more confusion as the behavior would sometimes go away or get much worse randomly with each change I made.
I finally found the bug using a (hardware) logic analyzer to watch[4] what the CPU was doing on the memory buss, Something wasn't finished resetting inside the CPU. Any instructions that ran too early would trash the internal state of the CPU, causing later instructions to have problems like asserting multiple chip select pins. Multiple ROM/RAM chips would try to drive the buss, and everything dies. The instruction this happened on depended on which instructions were run while the CPU was still resetting. The NOPs simply delay startup to let the CPU stabilize.
[1] http://www.catb.org/~esr/jargon/html/H/heisenbug.html
[2] http://www.catb.org/~esr/jargon/html/M/mandelbug.html
[3] "crash" == CPU locked hard with no activity on the memory buss until /RESET was grounded by the watchdog timer (or me)
[4] with a 100-pin PQFP clip-on probe that wouldn't stay attached
MaulingMonkey 6 years ago

Bugs disappearing in debug builds is common to all C and C++ codebases. It's the nature of undefined behavior and it's interaction with modern optimizers.
I'm in the camp that says all builds should at least be built with symbols, so you can still debug release builds. Maybe stripped out of the distributed version, keeping a private copy for coredumps from the field, if you're doing closed source dev.
- AnimalMuppet 6 years ago
  
  Here's what I've done: First, you should have a tagged version in the source code control system for everything you ever ship. That should include the tools used by the build process. So you should be able to reproduce the shipped version to the byte.
  Second, gdb can give you a stack trace with addresses if there are no symbols. But if you compiled with gdb, you can give pass the flags "-Wl,-M" to the link step, and it will spew a link map out stdout, which you can pipe somewhere. With that file, you can figure out what the addresses in the stack map correspond to.
  If you really want to go hard core, once you figure out what function the problem is in, you can compile that file with "-Wa,-ahlms=<filename>.asm.out" to get an assembly listing for that file. From a bit of hex arithmetic, you can find the assembly instruction that corresponds to the crash. The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
  
  comex 6 years ago
  
  > The hard part is correlating that back to the source code, since I don't know of a way in gdb to get that assembly output with the C/C++ source as comments.
  objdump -S can do that.
  
  AnimalMuppet 6 years ago
  
  Thanks. I've been looking for something like that off and on for the better part of 10 years.
  
  coherentpony 6 years ago
  
  Just be careful about the operand order differences between objdump and whatever your compiler spits out. They're not necessarily the same syntax.
  
  ncmncm 6 years ago
  
  -fverbose-asm, IIRC. Not always exactly accurate, but lots better than nothing.
dbcurtis 6 years ago

You have uninitialized variables.
If you run debug builds, the compiler fetches an uninitialized variable out of memory that the OS nicely zeroed before it gave it to the process. If you run even the simplest optimizer, it will say: "Well, obviously, no one cares what this variable starts with, because they didn't initialize it. So let's just use the garbage value left over in this free-at-the-moment register." Expunge your uninitialized variables, and then show me the disassembly code that the compiler got wrong. (Once upon a time I managed a piece of the validation for a C/C++/FORTRAN compiler suite. I have had this conversation more than once. :)
Exception: Real-time code with timing-specific hardware interactions. But that isn't an optimizer bug, that's a design issue.
- archi42 6 years ago
  
  I agree that this is quite likely. But there is more subtle evilness in undefined behavior. E.g. for "foo(bar(), baz())", it's not clear whether bar() or baz() is executed first (or maybe it is now, but wasn't in earlier iterations of the standard).
  OTOH, I'm wondering if GP is building with -Wall and has eliminated all warnings? At least that's what we're doing, and this yields good results.
- monocasa 6 years ago
  
  There's all sorts of other things that can hard fault a release build and not a debug build.
  
  dbcurtis 6 years ago
  
  Yes, yes. Very true, optimizers can break in interesting ways. And yet, 99.9% of the time when some whiner says: “The optimzer broke my software!” it is an uninitialized variable.
  The other 0.1% of failures kept my team busy enough. But I never allocated any time to your problem until you proved that you had no uninitialized variables.
fra 6 years ago

The struggle is real! Compiling with debug enabled does change timing enough to hide (or surface!) race conditions.
The best way I've found to deal with this is:
1. Compile with -O2 (or for firmware, -Os) by default, and generate debug symbols in the elf file. Crank it up to -O3 only on a file-by-file basis with #pragma
2. Collect the process memory when an issue is hit
3. Debug it asynchronously using the debug symbols + the memory.
That way, the debugging & the running are decoupled and the timing is stable.
de_watcher 6 years ago

We're building "release with debug info" and then strip the debug info into separate files that are bundled into *-dbg packages like Debian does.
That way the backtraces can be decoded. Although sometimes they are a bit butchered because it's a release build (we use the same config while developing and it's usable).
Splines 6 years ago

I ran into this earlier this week. Turned out a macro that was implemented as a wrapper function in debug linked differently in ship. I don't see this first-hand very often but I'd be surprised if this kind of thing isn't an uncommon cause of ship/debug differences.
anyfoo 6 years ago

What a terrible idea. The problem which manifested as a crash before might now manifest in some number silently changing instead. And it's not like it's impossible to debug using a core dump of a "release"-compiled binary, just harder. Thousands of app developers do it every day when looking at crash reports from the field.
snuze 6 years ago

There are certain software bugs that cause undefined behavior. Debugging tools such as GDB can only get you so far in identifying and solving such bugs. That’s where static analysis can help.

equalunique 6 years ago

  In this post, we’ll walk through setting up GDB for the following environment:

      A Nordic nRF52840 development kit

Hoping this guide will be a useful reference for diving into the QMK[0] port to nRF52840 by Sekigon[1]. Scratch-built keyboards are my hobby and it would be nice to have something beefier than an atmega32u4 handling both the keymap/layout and bluetooth.

[0] https://docs.qmk.fm/

[1] https://github.com/sekigon-gonnoc/qmk_firmware

fra 6 years ago

Good luck! I'm a big fan of the nRF52840! It's a great fit for a BT+USB keyboard. Hit us up in the blog post comments if you struggle with following Mohammad's instructions.

tyhoff 6 years ago

What I want, and what I think GDB really needs for people to use its Python API's more frequently, is a way to easily manage, share, install, and collaborate with other people's .gdbinit and GDB Python scripts.

I've written a number of these at my previous companies, which all get loaded by default when any developer was debugging (basically wrapping gdb in a 'make gdb' style call). That works for internal development at a company where everyone is using the same flow, but that's nearly impossible in the real world as every one has a different setup.

I'm assuming numerous companies have written a Linked List GDB Printer (such as https://github.com/chrisc11/debug-tips/blob/master/gdb/pytho...) or a script that prints useful information from the global variables particular to the RTOS running. These are all great, but they are a pain to install.

Is there really no better way to share these across the Internet other than "Copy / Paste this text into your .gdbinit"? I'm thinking it would be possible and relatively painless to share these scripts through PyPi, require them, and load them like a normal package, but I haven't seen this approach taken.

scoutt 6 years ago

With modern JTAG probes and available software (like GDB or OpenOCD) it's unthinkable going back to debugging with SW emulators, in-circuit emulators or with bare LEDs or a serial port (sometimes you have none).

Being able to debug firmware is very important to me, up to the point that the fact a chip (or family) is supported by OpenOCD becomes a determining factor when choosing a component for a new project. Specially if the project is (or could be) open sourced.

I don't like proprietary debuggers. If a chip requires or forces me or my company to purchase Segger XYZ probe or a custom IDE, then it's not discarded right away, but back to the bottom of the pile.

When there is no alternative, I find myself writing OpenOCD flash drivers even before starting any FW development.

Circuits 6 years ago

who needs GDB when you have the good old fashion print statement... O.o

emidln 6 years ago

At my first job, we didn't have GDB or a console, but you could set a test point high. In the worst case, you attached an LED. In the best case, you maybe had a scope handy to log the data.
- danbolt 6 years ago
  
  In the games industry, I worked with an older programmer that developed games on the Sega Genesis back in the 90s. He said often changing the background colour[1] was the "simplest" way of debugging state or branches.
  [1] https://segaretro.org/Sega_Mega_Drive/Palettes_and_CRAM#Back...
  
  unwind 6 years ago
  
  In my youth, us Amiga demoscene coders used this method too. You could basically measure CPU consumption by some sub-routine by setting a particular color on entry, then resetting on exit. Basically "raster lines" became a measure of time.
  Pretty sure I optimized by putting a piece of tape at the exit line on the TV, then trying to make the color bar more narrow. :)
  
  0xcde4c3db 6 years ago
  
  This can even be used for profiling on some systems because the background color can be changed in the middle of a frame with a single register write. So you can basically turn the left/right borders into a color-coded CPU utilization gauge.
- AnimalMuppet 6 years ago
  
  Heh. I once had a problem that I could detect at a certain point in software, but couldn't figure out how I got there. I used a test point to trigger a logic analyzer that was attached to the address bus. Scrolling back, I could see enough of the previous instruction fetches to figure it out. (This was before instruction caches on the 68000 line of processors.)
saagarjha 6 years ago

GDB lets you add as many print statements as you'd like to a binary you can't change, for one.
- vectorEQ 6 years ago
  
  close source indeed, not easy to manually insert some print into binaries :D perhaps a linked in library with hooks can do it... but i'd say gdb is the way to go indeed!
fra 6 years ago

if GDB isn't the "good old fashioned", then I need to update my frame of reference ;)