I think I found a Mac kernel bug (2018)

164 points by goranmoomin 3 years ago

alin23 3 years ago

Coincidentally, I also stumbled upon a way to make the kernel of Apple Silicon Macs panic and restart while developing the https://lowtechguys.com website.

I distilled the problem in a repo so it can be reproduced with a single command: https://github.com/alin23/m1-panic

I found it while on Monterey and reported it 2 times through Feedback Assistant, but it still happens on Ventura.

NOTE: Don't try it without saving all your work, it has a very high chance of restarting your computer forcefully.

saagarjha 3 years ago

The bug is in one of these two lines:
https://github.com/apple-oss-distributions/xnu/blob/5c2921b0...
https://github.com/apple-oss-distributions/xnu/blob/5c2921b0...
This uses the macro SLIST_REMOVE (https://github.com/apple-oss-distributions/xnu/blob/5c2921b0...) to remove an item from the linked list. If you look at the code to do this it's a pretty simple linked list traversal: https://github.com/apple-oss-distributions/xnu/blob/5c2921b0.... However, it doesn't have a check for the end of the list, so the item must be in the list, or it's going to walk right off the end and dereference a null pointer. In this case that's exactly what it ended up doing.
- alin23 3 years ago
  
  That's.. wow, ok. How exactly did you end up in that source code at that specific line?
  I know you're well versed in reverse engineering Darwin, and I'm reading your posts and trying to improve my skills in this daily, but this seems way over my skillset.
  Did you debug this using KDK or m1n1? Do you have a setup always ready for debugging a Darwin kernel?
  
  saagarjha 3 years ago
  
  In theory I have all of those, but currently I have none, so it's manual work. Your best friend in diagnosing a kernel crash is a KDK. If you have one that matches your build, it will have symbols in it. With a little bit of math you can take the backtrace in the crash log and slide it appropriately to match the binary. Personally I use LLDB for this. Here's an example of what this looks like on an x86-64 kernel (Apple silicon has its own math but it's largely the same): https://github.com/saagarjha/unxip/issues/14#issuecomment-10.... The kernel is typically compiled with optimization, so there's a lot of inlining and code folding, but with function names, source files, and instruction offsets it's pretty trivial to match it to the code Apple publishes.
  In this case I do not have a KDK for that build. In fact Apple has been unable to produce one for a couple of months, a inadequacy which I have repeatedly emphasized to them because of how critical they are for stuff like this. Supposedly they are working on it. Whatever; in lieu of that I got to figure out how good the tooling for analyzing kernels is these days, which was my real goal anyways.
  For this crash log I downloaded the IPSW file for your build, 22A400. All of them get linked on The iPhone Wiki, e.g. https://www.theiphonewiki.com/wiki/Firmware/Mac/13.x. Once you unpack the IPSW (it's a zip file) there are compressed kernelcache files inside. Apple changed the format of these this year so most of the tooling breaks on it, but https://github.com/blacktop/ipsw was able to decompress them. Then I loaded it in to Binary Ninja, which apparently doesn't support them either but compiling this person's plugin (+166 submodules, and a LLVM & Boost build) gets it to work: https://github.com/skr0x1c0/binja_kc.
  From there you can load up the faulting address from the crash log and see what the function looks like. In this case, a bunch of junk has been inlined into it but there's a really obvious and fairly unique string reference for "invalid knote %p detach, filter: %d". From there, you can compare it against the actual source code to see which one matches the "shape" of the function you're looking at. I happened to also pull up an older kernel which did have a KDK available and then compared its assembly to the new one to match it up to ptsd_kqops_detach. The disassembly of the crashing code is obviously doing a linked list walk so you can figure out exactly which line it is from that.
  If I wasn't lazy I might also fire up a debugger to see why the function had walked off the end of the list but without KDKs things get pretty bad, not that they're very good to begin with. I don't have a m1n1 setup (I should probably do this at some point) and the things I do have, like remote debugging or the VM GDB stub, are not really worth suffering through for a Hacker News comment.
  
  alin23 3 years ago
  
  Saagar, thank you so much! This is priceless!
  I was in the process of trying to get I²C working through the built-in HDMI port of the Apple Silicon Macs (the one containing the MCDP29xx HDMI-to-DP converter chip) and been hitting a lot of dead ends while looking at kexts and opaque firmware blobs. This is going to help a lot as the KDK seems to contain logging messages related to DDC that I've never seen before.
  I also found SIP disabled + Frida very useful for debugging without going through the KDK/m1n1 route. Not sure if it also helps with kext code though, I mostly used it for SkyLight and other private frameworks, but it's very nice to be able to also alter the code while it is running in realtime, or sometimes simply log specific function calls with argument value to get an idea what action causes which code to run.
  
  saagarjha 3 years ago
  
  Unfortunately patching the kernel or injecting your own code into it is quite difficult, unlike the situation in userspace. Though I haven’t gotten a chance to try it I think running a kernel debugger through m1n1 to be the best strategy to doing dynamic analysis of the kernel.
trollied 3 years ago

Might have been better to report it to the security people. This sort of thing can be exploitable.
- nemetroid 3 years ago
  
  They did report it to Apple, multiple times:
  > I found it while on Monterey and reported it 2 times through Feedback Assistant, but it still happens on Ventura.
  
  lilyball 3 years ago
  
  "to the security people" means emailing the relevant email address, not Feedback Assistant.
  
  nemetroid 3 years ago
  
  If it doesn't end up with the relevant people, that's Apple's problem.
  
  saagarjha 3 years ago
  
  It does eventually. If you want a prompt response you should contact product security.
  
  hackmiester 3 years ago
  
  Shouldn't Apple be the ones who really want to respond promptly? Why should we work around bugs in Apple's issue reporting system?
  
  saagarjha 3 years ago
  
  I don’t really see the problem with getting faster responses by contacting Apple’s security team directly for potential vulnerabilities when compared to the general-purpose bug tracker.
  
  LawTalkingGuy 3 years ago
  
  By an hour or two, maybe. But since the last version of the OS? No.
  
  nigamanth 3 years ago
  
  Usually big companies such as Discord give perks to the bug hunters who find bugs. Apparently Apple doesn't have that. There are probably people at Apple who won't admit that they have bugs, when every operating system has bugs, the code is too big to not create a single bug or exploit.
  
  SpelingBeeChamp 3 years ago
  
  Huh? Apple has a robust bug bounty program.
  https://security.apple.com/bounty/
- bjoli 3 years ago
  
  I have found 2 crashes in osX back in Yosemite. I have reported them with every release since.
  I have no idea of they work on the arm Macs, but I will have the ability to check in a couple of days. Probably nothing exploitable, but still a hard crash.
- saagarjha 3 years ago
  
  It’s a null deref.
catiopatio 3 years ago

What’s the feedback ID #?
macshome 3 years ago

Can you put the panic log text into the repo as well?
- alin23 3 years ago
  
  Sure, added here: https://github.com/alin23/m1-panic#panic-crash-report-after-...
metadat 3 years ago

Do you know what the underlying problematic instruction sequence is? Or the precise location where it halts?
- alin23 3 years ago
  
  I have no idea how I could find that, given that the system freezes completely.
  Maybe tracing the CPU instructions using LLDB might be possible, but the bug is most likely in the kernel code so this would not help much.
  
  catiopatio 3 years ago
  
  You can debug the kernel remotely over Ethernet: https://developer.apple.com/documentation/apple-silicon/debu...
  If that still fails, virtualization tools provide debugging interfaces you can use to step the execution of the virtualized CPU; e.g. VMware’s “debugStub” feature.
  
  sharikous 3 years ago
  
  You can't with Apple Silicon. It's shameful in my opinion. You still can load a core dump or view the state after a NMI but you can't run the kernel under a debugger.
  
  saagarjha 3 years ago
  
  You can, the experience is just not very good.
  
  Firmwarrior 3 years ago
  
  haha, let's just let the Apple engineers on this thread figure that shit out
  
  bri3d 3 years ago
  
  The best public kernel debugger for Apple Silicon is m1n1 from the Asahi project.
- saagarjha 3 years ago
  
  ldr x9, [x8, #0x18]!
MuffinFlavored 3 years ago

does it work in zsh or bash? only fish?

throwaway09223 3 years ago

It's surprisingly easy to stumble into crash bugs when playing around with processes.

I remember a decade or two ago I ran into a linux bug where the kernel would panic if a process was killed with an open descriptor on its /proc entries. That is:

open /proc/$pid/something; kill -9 $pid #kernel crash

We unfortunately discovered this when using fuser in a runscript to kill stale versions of a process, eg:

sudo fuser --kill --namespace tcp 80 # kill whatever is listening on port 80

This would reliably cause kernel panics every so often, with one straightforward shell command. This ended up causing a big problem because it was part of a runscript which ran on bootup. But, it normally would do nothing so it went unnoticed until the app in question had a startup problem, leaving a copy of itself dangling listening on the port -- and instead of killing the old instance, it began crashing the entire system in a loop. Oops.

teawrecks 3 years ago

It was also unsurprisingly easy to crash a kernel a decade or two ago.
- jeffparsons 3 years ago
  
  I remember a time when you had to be careful to not reveal your IP address to untrusted peers (e.g. on IRC) because a single specially malformed packet called the "Ping of Death" would reliably crash any internet-connected Windows PC.
  That was a wild time. Nobody talked about security back then. The idea that everything in our lives would eventually run over the internet just wasn't on people's minds.

kayodelycaon 3 years ago

It's freezing the querying of process status, which is very not good, but that isn't the entire kernel. If it was the entire kernel, you wouldn't be able to use Ctrl+C.

anyfoo 3 years ago

Yes. The title of the actual blog post seems more accurate.

hinkley 3 years ago

In the long dark ago there was a program called 'crashme' which would generate and run random code from user space to see if it could cause kernel panics.

jwilk 3 years ago

https://people.delphiforums.com/gjc/crashme.html
- xcdzvyn 3 years ago
  
  I presume there's a low yet non-zero chance this inadvertently messes up something on the FS?
civopsec 3 years ago

Did it predate the “fuzzing” term?
- hinkley 3 years ago
  
  By at least a decade or two. The last time I saw one in the wild was probably around 1998, and it was a very old idea by then.

throwawaaarrgh 3 years ago

It's very easy to freeze a system as a non-root user; cause too many interrupts, consume too many resources, etc. Many kinds of infinite loop will lock a system hard. Hell, you can crash systems with too many packets.

And it's very easy to cause ps to hang. Many different kernel syscalls hang / are blocking. Mostly you see this with kernel features dependent on a resource that doesn't resolve itself, like a stuck disk, network filesystem, etc. But other various quirks of the system can cause blocking.

adrian_b 3 years ago

While what you say is true, these are nonetheless kernel bugs.
The kernel should never let any user process consume so many resources as to cause a system freeze.
The kernel must not only be able to preempt any user process at any time, stopping it to consume all CPU time, but it must also prevent any user process to completely fill a SSD or HDD, because that can prevent many programs from starting.
- throwawaaarrgh 3 years ago
  
  Preemptibility is an optional kernel design feature. Not all kernels have it and not in all ways. If it's intentionally designed that way it's not a bug. No kernel stops users from filling up disks (though some filesystems have such limits as features, which most of us turn off)
  Pretty much the only kernels that are totally preemptible are RTOS and they still don't stop you from shooting yourself in the foot.
- astrange 3 years ago
  
  A computer's job is to do whatever you ask it to do; that includes using all disk or memory up, if that's what you really want. There's not really a way to prevent breaking the OS (or just using up the battery) without preventing you from using all the computer you bought.
  A phone is different since it always has to be able to make phone calls.
mort96 3 years ago

The things you mention cause "freezing" by asking the kernel to do too much so that it doesn't have time to deal with other stuff. Those issues are unfortunate, but really hard to completely avoid.
The interesting thing here is that the described bug isn't just overloading the kernel with work or starving it of resources, it's something which seems completely innocent.

saagarjha 3 years ago

Past discussion of this bug: https://news.ycombinator.com/item?id=16082861

dang 3 years ago

Thanks! Macroexpanded:
I think I found a Mac kernel bug? - https://news.ycombinator.com/item?id=16250677 - Jan 2018 (117 comments)

anon291 3 years ago

Not surprised. I wrote some kqueue code in C once that not only froze the Mac Kernel. It caused the entire computer to crash. I reported the issue to Apple, and never really heard back. They don't really care, as long as all the mac store apps work, in my experience.

This was one call to kqueue with incorrect (but not particularly malicious, just normal C silliness) arguments, and boom!

amenghra 3 years ago

In 2017, I wrote:
Over the years, I have found numerous four different bugs in Apple's Calculator app. Here is today's wtf.
Switch calculator to Scientific mode (⌘-2). Type: 1, 0, ^, 2, 0, enter, command-c (to copy), command-v (to paste)
Expected result: an amount in $$$ I wouldn't mind having. Actual result: smallest number to appear 6 times in Pascal's triangle
I reported all four and they never acknowledged any of them. They didn't fix any of them either. Doesn't motivate anyone to report more bugs to them.
Some of the bugs got auto-closed after a while. Eventually, the bugs did get fixed, except the clipboard thing ¯\_(ツ)_/¯.
- lilyball 3 years ago
  
  I just tested right now, ⌘C copies the string "1e20" and ⌘V pastes that same string. So yeah it looks like it was fixed.
  
  anamexis 3 years ago
  
  I also just tested now and reproduced the bug. The key thing is that you are pasting back into the calculator – which presumably is just stripping letters.
  The behavior is maybe a bit surprising, but I could also see it being defensible. You can't type "1e20" into the calculator, so why would you be able to paste it in?
  Outside of plain text, I don't think clipboard operations are necessarily expected to be reversible.
  
  amenghra 3 years ago
  
  That’s a good point. Maybe you should be able to type “1e20” in scientific mode? It would make the calculator more feature full and prevent “losing” data if you copy/paste in the middle of a long calculation.
  
  lilyball 3 years ago
  
  "e" in scientific mode is a shortcut for the "ln" function.
  That said, I tested again, and once again copying "1e20" and pasting it into the Calculator works just fine. It's definitely not treating it the same as pressing each key separately. I'm testing on macOS 13.0.1.
- jwilk 3 years ago
  
  > smallest number to appear 6 times in Pascal's triangle
  You mean exacty 6 times or at least 6 times?
  
  amenghra 3 years ago
  
  Exactly 6 times. What's happening is if the result is 1e<something>, a roundtrip to the clipboard results in 1<something>. I.e. 1e20 becomes 120.
JonathonW 3 years ago

Bugs do get fixed (eventually; they're not always timely about it depending on severity), but Apple's feedback systems are and always have been a black hole. Basically, as a reporter, the only time you hear anything back from Apple about a bug report is if they need additional information; nothing else in their process is visible externally (until you go back and retest a few macOS releases later and your bug is or isn't fixed).
catiopatio 3 years ago

Is it still reproducible?
- anon291 3 years ago
  
  I haven't had a mac in years. This was ~2010.

byteduck 3 years ago

My naïve guess is that this is probably some sort of lock contention thing.

enedil 3 years ago

Lock contention usually impacts performance, but not liveness.
- saagarjha 3 years ago
  
  Usually, but in truly pathological cases you may starve something important essentially indefinitely.
mort96 3 years ago

Lock contention from running a syscall once?

IceWreck 3 years ago

Is this fixed now ?

kayodelycaon 3 years ago

Tested it on macOS 12.3. It's fixed.
- eesmith 3 years ago
  
  https://github.com/hishamhm/htop/issues/682#issuecomment-377... (the htop issue) says "according to others above, Apple has seemingly fixed this in 10.13.4."

resters 3 years ago

I asked ChatGPT to write some programs like this and it refused!