The "clone the entire address space and then call exec" idiom is indeed wildly inefficient -- that's why the horror which is vfork was invented -- but I'm not convinced that putting everything which sits between fork and execve into io_uring (or, as a comment snarkily suggests, ebpf) is the solution. There's just too many things userland might want to do.
I wonder if the best solution lies somewhere in the vicinity of "fork but only copy a small part of the address space" -- rather than copying the entire address space as in fork (only to use a tiny portion and throw away the rest) or copying none of the address space as in vfork (the paging tables are shared between parent and child until exec) if we can identify what memory the child will need to access before calling _exit or exec (say, "the current function and its local variables") then we could create an address space with just a few paging tables entries.
Kind of like the "zygote" forking model (early in the main process lifetime, a zygote process gets forked off, and when the main process wants another worker it asks the zygote to fork one off) except that the "zygote" is more like an induced pluripotent stem cell, having been reverted from an adult state.
> I wonder if the best solution lies somewhere in the vicinity of "fork but only copy a small part of the address space"
I think the best solution would be if every relevant syscall took a process handle, so you can run it either in the current process or in a non-started child process
That's not going to happen on Linux because it would be a radical change to the Linux syscall API. But if one were designing an OS from scratch today I think it would make sense to do things that way.
That doesn't sound much different from regular vfork()? It isn't that evil, you just need a small assembly shim (or if you're courageous, a bit of massaging the compiler output) to safely call another function with its own stack frame, as well as some care to disable signal handlers in the child. It's mostly for silly setuid-binary reasons that the libc people tend to dislike it.
Also, there's no way that libc people would want to work with the compiler people to locate the current stack frame to copy. So you'd end up with an assembly shim with a definite stack size anyway.
Yes, that's what said. Fork doesn't copy the data but it does copy the address space. For a large process (say, a database with multiple GB of data) that's a lot of paging tables -- many MB of them if you're using 4 kB pages.
You're resurrecting memories from the mists of time, but I seem to remember a common design pattern back in the day for cases like this was to have a persistent lightweight parent that would fork a processing child that could then request the parent perform fork-and-exec operations.
Surely it also marks the parent process' pages as COW too? If only the child is RO but the parent still has RW mappings to the same physical pages, writes from the parent will be observed in the child, which is wrong for COW. You either have to copy the pages immediately (in which cases there's no COW) or you have to make all mappings to the physical page RO.
The implications of this is that even if you immediately execve in the child, you still have to pay for the cost of setting COW on the entire address space and then later faulting on every single writable page in the parent process. The performance impact might not be massive, but it's not nothing.
> You either have to copy the pages immediately (in which cases there's no COW) or you have to make all mappings to the physical page RO.
One of these enhanced fork exec calls stops the parent until the child execs. Then you don't need to touch the parent page mappings or worry about concurrency. (Although it's not ideal if the parent is threaded)
I don't think adding this to io_uring is at all bad. But I don't think it enough to solve the problem. If for no other reason, than because it requires using the machinery of io_uring, which adds quite a bit of complexity.
However, maybe I'm missing something, but it seems like linux already has functionality that could make spawning a process a lot more efficient and threadsafe. My idea is basically to use clone or clone3 to create a new process in a new thread group that shares the original processes memory (that is with CLONE_VM but not CLONE_THREAD). And pass a function point to call (instead of returning on the child process) and a heap-allocated stack for the child process to use.
Then there is no need to copy the address space, and you can do more things to prep before calling exec, since other threads can still release locks, you can write to memory, etc.
The downsides I see are that you wouldn't be able to safely modify the current environment variables since that would impact the parent process, and there might be some weirdness with the child process having copies of file descriptors instead of the originals. The first is easy to work around though, and the latter probably wouldn't be an issue in most cases.
Another thought I've had is that if there was a more efficient single syscall for spawning a process that combined fork and exec, even if it is a lot less flexible than fork/exec or the io_uring equivalent, something simple could probably meet the needs of most applications and benefit performance and safety in the common case where you don't need complex setup before calling execve.
Funnily enough, embyronic processes, which goes along with the zygote naming scheme. Similar idea of a small process specifically for forking, but this is a more comprehensive model.
Described in these nicely written comments by others:
> if we can identify what memory the child will need to access before calling _exit or exec (say, "the current function and its local variables")
I mean, a good deal of the parameters you need for the relevant syscalls are strings, which means it's not sufficient to copy just the stack frame, but all the memory reachable from the stack frame. Which is a nontrivial problem if you're assuming C/C++-style code.
> Kind of like the "zygote" forking model (early in the main process lifetime, a zygote process gets forked off, and when the main process wants another worker it asks the zygote to fork one off) except that the "zygote" is more like an induced pluripotent stem cell, having been reverted from an adult state.
interestingly enough, i thought of the same concept, except i did not get to implement that (for a few reasons). is "zygote" a term you made up or is it an established pattern?
This paragraph surprised me:
> Furthermore it is the only reasonable way to keep a reference to a binary and a set of shared libraries that can be exec‘ed. In the model used on Windows and Mac, renderers are exec’ed as needed from the chrome binary. However, if the chrome binary, or any of its shared libraries are updated while Chrome is running, we‘ll end up exec’ing the wrong version. A version x browser might be talking to a version y renderer. Our IPC system does not support this (and does not want to!).
I think the Chrome team overthought this. If you update firefox and try to perform an action which spawns a new process, it just politely demands the user restart the browser.
I agree with Pavel that extending the clone syscall is a better idea than this patch set. The flexibility that Josh and Gabriel talk about seems wholly unnecessary. In every use of fork-(do stuff)-exec I've ever seen, the below two observations remained true:
1. Everything needed in the "do stuff" part was known prior to the call to fork
2. Any failures in the "do stuff" part would scrap the child process and report an error to the parent process
This smells like accidental complexity. What's the point if you have to use the chain in a very specific way, and it can only achieve one thing? That could just be made a single op.
This seems like yet another way for ferrying code/state machines into the kernel. We already have bpf.
For now, I'd settle for an RT-safe way to create a new process that then calls execve. AFAIK, this doesn't for Linux and may not exist for any *nix kernel at this time (not sure about this second part).
Darwin has a posix_spawn() syscall. I'm not sure if it's RT-safe, but it is actually a syscall - it's not a wrapper for vfork+execve like it is on Linux.
The "clone the entire address space and then call exec" idiom is indeed wildly inefficient -- that's why the horror which is vfork was invented -- but I'm not convinced that putting everything which sits between fork and execve into io_uring (or, as a comment snarkily suggests, ebpf) is the solution. There's just too many things userland might want to do.
I wonder if the best solution lies somewhere in the vicinity of "fork but only copy a small part of the address space" -- rather than copying the entire address space as in fork (only to use a tiny portion and throw away the rest) or copying none of the address space as in vfork (the paging tables are shared between parent and child until exec) if we can identify what memory the child will need to access before calling _exit or exec (say, "the current function and its local variables") then we could create an address space with just a few paging tables entries.
Kind of like the "zygote" forking model (early in the main process lifetime, a zygote process gets forked off, and when the main process wants another worker it asks the zygote to fork one off) except that the "zygote" is more like an induced pluripotent stem cell, having been reverted from an adult state.
> I wonder if the best solution lies somewhere in the vicinity of "fork but only copy a small part of the address space"
I think the best solution would be if every relevant syscall took a process handle, so you can run it either in the current process or in a non-started child process
That's not going to happen on Linux because it would be a radical change to the Linux syscall API. But if one were designing an OS from scratch today I think it would make sense to do things that way.
That is basically how Fuchsia handles it https://fuchsia.dev/fuchsia-src/reference/kernel_objects/pro...
That doesn't sound much different from regular vfork()? It isn't that evil, you just need a small assembly shim (or if you're courageous, a bit of massaging the compiler output) to safely call another function with its own stack frame, as well as some care to disable signal handlers in the child. It's mostly for silly setuid-binary reasons that the libc people tend to dislike it.
Also, there's no way that libc people would want to work with the compiler people to locate the current stack frame to copy. So you'd end up with an assembly shim with a definite stack size anyway.
But Linux doesn't clone the entire address space: it copies the page table, RO: if the child attempts to write, then it uses COW.
So if fork/clone is followed immediately by exec/execve/etc., there is minimal copying.
Yes, that's what said. Fork doesn't copy the data but it does copy the address space. For a large process (say, a database with multiple GB of data) that's a lot of paging tables -- many MB of them if you're using 4 kB pages.
You're resurrecting memories from the mists of time, but I seem to remember a common design pattern back in the day for cases like this was to have a persistent lightweight parent that would fork a processing child that could then request the parent perform fork-and-exec operations.
But it's been a while....
Surely it also marks the parent process' pages as COW too? If only the child is RO but the parent still has RW mappings to the same physical pages, writes from the parent will be observed in the child, which is wrong for COW. You either have to copy the pages immediately (in which cases there's no COW) or you have to make all mappings to the physical page RO.
The implications of this is that even if you immediately execve in the child, you still have to pay for the cost of setting COW on the entire address space and then later faulting on every single writable page in the parent process. The performance impact might not be massive, but it's not nothing.
> You either have to copy the pages immediately (in which cases there's no COW) or you have to make all mappings to the physical page RO.
One of these enhanced fork exec calls stops the parent until the child execs. Then you don't need to touch the parent page mappings or worry about concurrency. (Although it's not ideal if the parent is threaded)
I don't think adding this to io_uring is at all bad. But I don't think it enough to solve the problem. If for no other reason, than because it requires using the machinery of io_uring, which adds quite a bit of complexity.
However, maybe I'm missing something, but it seems like linux already has functionality that could make spawning a process a lot more efficient and threadsafe. My idea is basically to use clone or clone3 to create a new process in a new thread group that shares the original processes memory (that is with CLONE_VM but not CLONE_THREAD). And pass a function point to call (instead of returning on the child process) and a heap-allocated stack for the child process to use.
Then there is no need to copy the address space, and you can do more things to prep before calling exec, since other threads can still release locks, you can write to memory, etc.
The downsides I see are that you wouldn't be able to safely modify the current environment variables since that would impact the parent process, and there might be some weirdness with the child process having copies of file descriptors instead of the originals. The first is easy to work around though, and the latter probably wouldn't be an issue in most cases.
Another thought I've had is that if there was a more efficient single syscall for spawning a process that combined fork and exec, even if it is a lot less flexible than fork/exec or the io_uring equivalent, something simple could probably meet the needs of most applications and benefit performance and safety in the common case where you don't need complex setup before calling execve.
Does the glibc clone wrapper not already do this?
I think it does support doing that. But I've never seen that pattern used. And it isn't used in many higher-level implementations.
Possibly just becaus clone is a linux specific API, whereas fork/exec is more portable.
Funnily enough, embyronic processes, which goes along with the zygote naming scheme. Similar idea of a small process specifically for forking, but this is a more comprehensive model.
Described in these nicely written comments by others:
https://news.ycombinator.com/item?id=32794270 https://news.ycombinator.com/item?id=30510318 https://news.ycombinator.com/item?id=29697645
> if we can identify what memory the child will need to access before calling _exit or exec (say, "the current function and its local variables")
I mean, a good deal of the parameters you need for the relevant syscalls are strings, which means it's not sufficient to copy just the stack frame, but all the memory reachable from the stack frame. Which is a nontrivial problem if you're assuming C/C++-style code.
> Kind of like the "zygote" forking model (early in the main process lifetime, a zygote process gets forked off, and when the main process wants another worker it asks the zygote to fork one off) except that the "zygote" is more like an induced pluripotent stem cell, having been reverted from an adult state.
interestingly enough, i thought of the same concept, except i did not get to implement that (for a few reasons). is "zygote" a term you made up or is it an established pattern?
Well established pattern. See e.g. in Chromium: https://chromium.googlesource.com/chromium/src/+/HEAD/docs/l...
(I don't think the Chromium developers invented it either, it's just a convenient reference.)
This paragraph surprised me: > Furthermore it is the only reasonable way to keep a reference to a binary and a set of shared libraries that can be exec‘ed. In the model used on Windows and Mac, renderers are exec’ed as needed from the chrome binary. However, if the chrome binary, or any of its shared libraries are updated while Chrome is running, we‘ll end up exec’ing the wrong version. A version x browser might be talking to a version y renderer. Our IPC system does not support this (and does not want to!).
I think the Chrome team overthought this. If you update firefox and try to perform an action which spawns a new process, it just politely demands the user restart the browser.
and I hate this, this is super inconvinient when auto-updates are enabled. I am glad Chrome authors went out of the way to fix this.
(The other option would be to convince Linux distributions to implement special updater, but I am sure implementing zygote thong was easier)
I agree with Pavel that extending the clone syscall is a better idea than this patch set. The flexibility that Josh and Gabriel talk about seems wholly unnecessary. In every use of fork-(do stuff)-exec I've ever seen, the below two observations remained true:
1. Everything needed in the "do stuff" part was known prior to the call to fork
2. Any failures in the "do stuff" part would scrap the child process and report an error to the parent process
3: the stuff has to be done in the child to avoid problems. Like in shells.
This smells like accidental complexity. What's the point if you have to use the chain in a very specific way, and it can only achieve one thing? That could just be made a single op.
This seems like yet another way for ferrying code/state machines into the kernel. We already have bpf.
For now, I'd settle for an RT-safe way to create a new process that then calls execve. AFAIK, this doesn't for Linux and may not exist for any *nix kernel at this time (not sure about this second part).
Darwin has a posix_spawn() syscall. I'm not sure if it's RT-safe, but it is actually a syscall - it's not a wrapper for vfork+execve like it is on Linux.
Not RT-safe.