Safepoints and Fil-C

86 points by matt_d 6 days ago

> Fil-C's pollchecks also support stop-the-world (via the FILC_THREAD_STATE_STOP_REQUESTED bit). This is used for:

> - Implementing fork(2), which needs all threads to stop at a known-good point before the fork child jettisons them.

Makes me wonder how it handles `vfork()`, but I think it needs just a safepoint and no stop-the-world since, after all, the child side of `vfork()` is executing in the same address space as the parent (until exec-or-exit), so it's as though it's the same thread as in the parent (which is stopped waiting for the child to exec-or-exit). All the more reasons that `fork()` is evil and `vfork()` much better.

pizlonator 2 days ago

I don't support vfork(2) right now, only fork(2).
I have some super crazy ideas about how to make vfork(2) work, but I don't particularly like any of them. I'll implement it if I find some situation where vfork(2) is strongly required (so far I haven't found even one such case). The closest is that I've found cases where I have to add the CLOEXEC pipe trick to do error handling the fork(2) way rather than the vfork(2) way.
- kmavm 2 days ago
  
  Hi Fil! Congrats on all the amazing progress on Fil-C.
  We needed to port all the user-level fork(2) calls to vfork(2) when working on uCLinux, a port of Linux to MMU-less microcontrollers[1]. It used to be that paging MMUs were kinda expensive (those TLBs! so much associativity!!!), and the CPU on your printer/ethernet card/etc. might not have that much grit. Nowadays not so much.
  Still. A hard-and-fast use for vfork(2), as requested perhaps.
  [1] http://www.ibiblio.org/lou/old/ViewStation/
  
  pizlonator 2 days ago
  
  It'll be a while before Fil-C is appropriate for use in MMU-less microcontrollers.
- cryptonector 2 days ago
  
  Remember that the child side of `vfork(2)` can only do async-signal-safe things as `vfork(2)` is documented, and it's really as if it is executing in the same thread as the parent (down to thread-locals, I do believe), so really, `vfork(2)` shouldn't really be all that special for Fil-C! You might want to try it.
  As u/foota notes, it's important. The point of `vfork(2)` is that it's _fast_. https://news.ycombinator.com/item?id=30502392
  
  pizlonator 2 days ago
  
  > Remember that the child side of `vfork(2)` can only do async-signal-safe things as `vfork(2)` is documented
  And if it does anything that isn't async-signal-safe, then all bets are off.
  So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.
  I'd have to have so many checks in so many weird places!
  > The point of `vfork(2)` is that it's _fast_
  Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).
  
  cryptonector 2 days ago
  
  > And if it does anything that isn't async-signal-safe, then all bets are off.
  > So, the Fil-C implementation of vfork(2) would have to have some way of checking (either statically or dynamically) that nothing async-signal-unsafe happens. That's the hard bit. That's why I think that implementing it is crazy.
  Not really. See, Fil-C already makes those things that would not be safe be safe except returning from the caller of `vfork(2)`. Treat the child as if it's the same thread as in the parent and let it do whatever it would do.
  Well, I guess the biggest issue is that you'd have to deal with how to communicate between the child and the GC thread, if you have to. One option is to just not allow the GC to run while a child of `vfork(2)` hasn't exec'ed-or-exit'ed.
  > > The point of `vfork(2)` is that it's _fast_
  > Yup. That's the reason to have it. But I haven't yet found a situation where a program I want to run is slow because I don't have vfork(2). The closest thing is that bash is visibly slower in Fil-C due to forking, but bash always uses fork(2) anyway - so in that case what I need to do is make my fork(2) impl faster, not implement vfork(2).
  Typically it's programs with large RSS that need it, so things like JVMs, which probably wouldn't run under Fil-C.
  
  pizlonator 2 days ago
  
  Simple example that breaks the world:
  void foo(void) { vfork(); }
  In this case, the vfork child is returning. Eventually it'll exit. In the meantime, it's clobbered the vfork parent's stack.
  Kaboom
  
  cryptonector 2 days ago
  
  Yes, _that_ is not to be allowed. I think you could have LLVM know about `vfork(2)` and treat that as an error -- heck, clang already knows how to do that, does it not?
  
  pizlonator 2 days ago
  
  That's just one example and the problem isn't just checking this one case, but any generalization of it. It can't be a fully static check because you could achieve similar things with longjmp or exceptions (both of which Fil-C supports).
  And then there are similar issues with heap accesses and safepoints. The vfork child has to particular in safepoints except that it cannot quite (you can't use pthread_mutex/pthread_cond primitives in the vfork child to synchronize with threads in the vfork parent).
  Anyway, I'm thought about this a lot, and I could keep coming at you with reasons why it's hard. It's not like I haven't implemented vfork(2) out of some abstract fear.
  
  cryptonector 2 days ago
  
  Another option is to make `posix_spawn(3)` in the program's C library work with Fil-C to let Fil-C implement it w/o instrumentation in its C library. This would work for me.
  
  cryptonector 2 days ago
  
  Can we make the C library mutex lock and condvar primitives work on the vfork() child? It's practically like a thread...
  
  pizlonator 2 days ago
  
  That might be the way to go but my best ideas for how to make vfork work sidestep all of that.
  It's just a lot of work. It basically means that the whole Fil-C runtime has to be aware of this alternate mode where we're the vfork child and we have to play by different rules.
  Anyway, long story short:
  - Yeah I know vfork(2) is important. It's just not the most important thing on my plate. The constant time crypto situation we were talking about in the other thread is definitely more important, for example!
  - I know how to do vfork(2) but every version of it that I've come up with is a ton of work and carries the risk of introducing really wacky stability bugs and safety bugs.
  So, vfork is just a high risk, high cost, medium reward, medium urgency kind of thing. That probably means it'll be a while before I implement it.
  
  cryptonector 2 days ago
  
  Understood, and I agree that constant-time assembly-coded crypto is much more important. Moreover, if you make `posix_spawn()` fast that's enough, and you can probably do that by proxying it from the victim program's C library to Fil-C's.
  
  trws 2 days ago
  
  The other place it comes up is launchers and resource managers. We actually have a series of old issues and implementation work on flux (large scale resource manager for clusters) working around fork becoming a significant bottleneck in parallel launch. IIRC it showed up when we had ~1gb of memory in use and needed to spawn between 64 and 192 processes per node. That said, we actually didn’t pivot to vfork, we pivoted to posix_spawn for all but the case where we have to change working directory (had to support old glibc without the attr for that in spawn). If you’re interested I think we did some benchmarking with public results I could dredge up.
  Anyway, much as I have cases where it matters I guess what I’m saying is I think you’re right that vfork is rarely actually necessary, especially since you’d probably have a much easier time getting a faster and still deterministic spawn if it ever actually becomes a bottleneck for something you care about.
  
  cryptonector 2 days ago
  
  > That said, we actually didn’t pivot to vfork, we pivoted to posix_spawn for all but the case where we have to change working directory (had to support old glibc without the attr for that in spawn).
  You can always accomplish that sort of thing by using a helper program that ultimately execs the desired one -- just prefix it and its arguments to the intended argv.
  
  trws a day ago
  
  Quite so. We would have too, but I left out the nasty bit that someone had at one point put a callback argument in an internal launching API that runs between fork and exec. Still working on squashing the last of those.
  
  cryptonector 2 days ago
  
  > The other place it comes up is launchers and resource managers.
  Yes, and typically for the same reasons as in JVMs etc: `vfork()` is just faster.
foota 2 days ago

I was looking at https://fil-c.org/programs_that_work and saw "dash 0.5.12. One tiny change: use fork(2) instead of vfork(2).", so maybe it doesn't :)

cryptonector 2 days ago

@pizlonator I wonder if you couldn't bracket all assembly with `filc_exit`/`filc_enter` as you do system calls. When you know the assembly doesn't allocate memory then this should work fine but.. ah, it's the stack allocations that are also interesting, right? So you'd have to ensure that the assembly has enough stack space to execute _and_ that it does not invoke the heap allocator. But that seems doable for things like cryptography in in OpenSSL's libcrypto.

pizlonator 2 days ago

Yeah you could do that. It's tantamount to having an FFI. It's super risky though - lots of ways that the assembly code could break Fil-C's assumptions and then you'd get memory safety issues.
My long term plan is to give folks a way to write memory safe assembly. I have some crazy ideas for how to do that
- cryptonector 2 days ago
  
  Yeah, but it would go a looong way to getting us (the public) to use Fil-C in production. Like I'd really like to be able to run OpenSSH using Fil-C and I really don't want to have to worry about the C-coded crypto being non-constant-time.
  What I suggest as a medium-term solution might be to have macros that expand to nothing if Fil-C is not used but which expand to a Fil-C macro decoration that indicates that the author thinks the assembly (or assembly coded function) is Fil-C-safe.
  Now... I know, there's more, right, like you burn a register to keep a pointer to the Fil-C thread data structure, and the assembly needs to not step on that, so ok, it's probably harder than I'm making it seem, but maybe not much harder because you can save that pointer as part of the exit and restore it as part of the enter.
  I do trust that your crazy ideas are crazy but workable though!
  
  pizlonator 2 days ago
  
  That's a good point.
  Note that statically checked inline asm is very achievable, so those folks who do constant time crypto by concealing their math operators behind inline asm will get what they need.
  But I guess you really want the OpenSSL out-of-line assembly to work?
  
  cryptonector 2 days ago
  
  > But I guess you really want the OpenSSL out-of-line assembly to work?
  Yes. And that code generally does not invoke the allocator or stray out of the bounds of the given buffers, so at the very least you could mark that assembly as trusted for this mode.
  
  pizlonator 2 days ago
  
  Yeah, I'll have to come up with a decent way of allowing this, at some point.
  But see https://fil-c.org/runtime
  It's really about creating an FFI for Fil-C, because Fil-C has its own calling convention and symbol mangling. It could be a lot of work.
  And, worst case, it ends up being a footgun
  
  cryptonector 2 days ago
  
  I'm sure for many people's $WORK the ability to run OpenSSH built with Fil-C and constant-time crypto would be amazing, and it would be great advertising for Fil-C. But there is no way any of us would run OpenSSH built with Fil-C in production w/o constant-time crypto.
  
  pizlonator 2 days ago
  
  That's good to know.
  If I made the assembly memory safe under Fil-C rules by running it through a transform that inserted additional instructions but did not otherwise change what was happening, would you trust that it's still constant-time?
  
  Vecr 2 days ago
  
  Yes. Don't branch on key or in-clear data. Otherwise, ok.
  If a user is doing onion wrapping, they don't want you to branch on the code data either.
  
  cryptonector 2 days ago
  
  You can reason about it, that if the instructions in question are not dependent on a key or plaintext value then it won't affect the constant time nature of the implementation of the cryptographic algorithm.

correct_horse 2 days ago

Fil-C seems interesting, and I didn’t understand the details of how multi-threaded garbage collectors worked before reading it (I still don’t but I’m closer!). The tradeoff between a compacting garbage collector (Java) vs what you can bolt on to C without forking LLVM is particularly interesting.

pjmlp 2 days ago

Note that it is very simplistic to say Java has a compacting garbage collector, between Oracle JDK, OpenJDK, Temurin, Azul, PTC, Aicas, microEJ, OpenJ9, GraalVM, Jikes RVM, and the cousin Android, there is pleothora of GC configurations to chose from, that go beyond that.
- whizzter 2 days ago
  
  Not to mention that many of them have multiple GC's that are selected as startup options.
  Personally though I can't wait for Azul's patents to start running out so that OS vendors can start implementing similar machinery into mainline OS's for all runtimes to use.
  
  pjmlp 2 days ago
  
  Unfortunately that problem is cultural, not patents.
cryptonector 2 days ago

TFA is really good at explaining how a threaded GC works with minimal impact when it's not running.

kragen 2 days ago

This is interesting and looks highly readable! But I don't understand it yet.

cyberax 2 days ago

Polling for the safepoint signal adds overhead to tight loop, so Golang uses asynchronous signals to interrupt threads. It needs it for async pre-emption, not just GC. It also results in not knowing the exact state of the frame, so it has to scan the last frame of the stack conservatively.

There are no good ways to deal with that currently. Some VMs even used CPU instruction simulators to interpret the code until it hit the known good safepoint.

I had one idea about improving that: have a "doppelganger" mirror for the inner tight loops, that is instruction-by-instruction identical to the regular version. Except that backjumps are replaced with the call into the GC safepoint handler.

It'd be interesting to try this approach with Fil-C.

pizlonator 2 days ago

> Polling for the safepoint signal adds overhead to tight loop
The usual trick, which Fil-C doesn't do yet, is to just unroll super tight loops
Also, the pollcheck doesn't have to be a load-and-branch. There are other ways. There's a bottomless pit of optimization strategies you can do.
Conservative scanning wouldn't be sound in Fil-C, because the LLVM passes that run after FilPizlonator could break the fundamental assumptions of conservative scanning
- cyberax 2 days ago
  
  What other methods of stopping at safepoints do you think are viable? Another approach is code patching, but it's not possible on iOS, and it's a bad idea in general.
  
  pizlonator 2 days ago
  
  The most commonly used optimization is to have just a load, or just a store, rather than a load-and-branch. Basically you access a page that you mprotect to trigger the handshake.
  Unrolling loops is also super common. Recognizing loops that have a bounded runtime is also common.
  My favorite technique to try one day is:
  1. just record where the pollcheck points are but don't emit code there
  2. to handshake with a thread, shoot it with a signal and have the signal handler interpret the machine code starting at the ucontext until it gets to a pollcheck
  
  cyberax 2 days ago
  
  Memory protection games just don't scale. You're modifying global structures, resulting in tons of contention. That's also why compacting GCs prefer to use card marking rather than mprotect.
  About 2. - that's exactly what I meant. The downside is that now you have to write a full x86 emulator, maybe including SIMD instructions. Good luck.
  And that's also what I want to try to avoid. Imagine generating a copy of the innermost loops, with the nearest backbranch replaced with a jump into the GC code. Then from the signal handler you just need to find the offset between the current instruction pointer and the mirror copy, adjust the pointer, and exit from the signal handler.
  
  pizlonator 2 days ago
  
  > Memory protection games just don't scale. You're modifying global structures, resulting in tons of contention. That's also why compacting GCs prefer to use card marking rather than mprotect.
  I’m describing what production JVMs do. They do it because it’s a speedup for them. Like, those compacting GCs that you’re saying use card marking are using page faults for safe points more often than not.
  It’s true that page protection games don’t result in speedups for generational store barriers. A safe point is not that.
  
  cyberax 2 days ago
  
  I believe, Oracle JVM uses polling?
  This is the class responsible for safepoints: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
  And this is one of the implementations: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
  The arming code: https://github.com/openjdk/jdk/blob/4b544f93ad0e2beae4c80e06...
  I might be missing something, but I'd be terribly surprised if VM games are worth it.
  
  pizlonator 2 days ago
  
  That JVM totally uses the page protection trick.
  Look at the code that talks about polling page.
  
  cyberax 12 hours ago
  
  D'oh. I don't know how I missed it. It's right there around the line 69. Nice.
  Well, I'm definitely surprised it works.