The Unix process API is unreliable and unsafe (2021)

232 points by todsacerdoti 2 years ago

kccqzy 2 years ago

Note that this article is really talking about the general case, but in practice a lot of techniques can work if you have narrower requirements or if you have more control over what you run.

For example, in 1.1.4 the author talks about why containers are not a solution giving three distinct reasons. But if we change our perspective a little bit, none of the three reasons are blocking. The first is that it's not easy; but `docker run` or `podman run` is easy. Even systemd units start with separate control groups to allow you to terminate everything at once. The second reason was about gdb; when was the last time you used gdb in production? If you are using gdb someone is interactively using the computer and can be relied upon to clean up processes manually. The third reason is that containers are more heavyweight, but there's no need to make every subprocess a separate container: if multiple processes should be managed as a single unit (including the case when we'd want to terminate a whole group of processes) they should run in the same container.

So with a slight change of perspective we find the problem easily solved. It had trade offs but it works well enough in practice that only very few purists have a problem with it. Not to diss on the author—I think this type of perfectionist thinking is illuminating in terms of API design—but pragmatically it's a solved problem.

catern 2 years ago

I respect the attempt to shift perspective very much, but you only engaged at a surface level with the reasons I listed.
>The first is that it's not easy; but `docker run` or `podman run` is easy.
I was referring to easy use from a full fledged programming language. When you start a subprocess in your programming language of choice, do you always run it in a container? I seriously doubt it, and the reason for that is because it's hard.
>The second reason was about gdb
No, the second reason was about user namespaces, which break many things including ptrace, which in turn breaks gdb, as just one example. There's lots of useful tracing and monitoring software which makes occasional use of ptrace.
>if multiple processes should be managed as a single unit (including the case when we'd want to terminate a whole group of processes) they should run in the same container
Yes, that's true. And indeed, scripting use cases often have that characteristic, where everything can be terminated at once at the end. You can compare this to missile-style garbage collection: Just never free your memory/processes. Unfortunately, long-lived applications both need to free their memory over time and need to clean up their processes over time.
- eru 2 years ago
  
  > I was referring to easy use from a full fledged programming language. When you start a subprocess in your programming language of choice, do you always run it in a container? I seriously doubt it, and the reason for that is because it's hard.
  I use a (standard) library to start subprocesses in my favourite programming languages.
  For running a container, I'd also use a library.
  Seems about equally hard, no?
- blincoln 2 years ago
  
  Do user namespaces still break ptrace and gdb? I know gdb versions from 8 onward are much better at handling containerised processes, but I don't know if the reason why is related to what you're describing.
wahern 2 years ago

The same is true for the process group/session + controlling terminal solution: the solution doesn't work recursively (can't do process management downstream), and it also requires child processes to abstain from changing SIGHUP handler or mask, but in the vast majority most cases none of those limitations are a problem. Combined with POSIX fcntl locks[1] on a PID file, this is my go to generic solution for Unix-portable[2], multiprocess daemons. The amount of code required in the supervisor component is quite trivial, yet covers almost all of your bases.
[1] fcntl locks permit querying the PID of the lock holder, so you don't need to write the PID to the file, providing a solution to the PID file race and loaded gun dilemmas. (There's still a race, but the same race exists with Linux containers, and both can be resolved in similar manner--query PID, send SIGSTOP, verify PID association, send SIGKILL or SIGCONT.)
[2] One of the crucial behaviors, that the kernel atomically sends SIGHUP to all processes in the group if the controlling process terminates, isn't guaranteed by POSIX, but it's the behavior on all Unix I've tried--AIX, FreeBSD, macOS, Linux, NetBSD, OpenBSD, and Solaris.
temp2022account 2 years ago

> Even systemd units start with separate control groups to allow you to terminate everything at once
My experience has been that _literally_every_ advanced cloud architecture can be simplified to:
- Host w/1 administration IP on a vlan w/ admin services on that IP, then N customer/service IPs w/ customer+service apps bound to them (11.12.13.14:5432 for customer A postgres, 25.26.27.28:5432 for customer B postgres, whatever)
- For each IP+service combo, configure the service to use its own directory of storage. Back each storage according to what it needs; the DB is backed by block RAM > SSD partitions > HDD partitions > SMB nonsense.
- For each Host+customer+service triple, write a python (complex)/shell (simple) deployment.py/sh and health script.py/sh to handle 95% of monthly deployment+maintenance needs.
Done. Scale to MxN HostsxServices across M IPs by N services over N customers.
ilyt 2 years ago

Now have a slight change of perspective and imagine you are a program that runs another program - and wants to make sure you are in control of anything that program could spawn

shanemhansen 2 years ago

The author said something that is technically correct but I feel gives the wrong impression to us folks who may be trying to solve the "kill all child processes" problem. There is a simple bash one-liner to ensure child processes are killed that doesn't require root and it's using one of their examples of a process leak:

    unshare -U --map-user=$(id -u) --map-group=$(id -g) -f -p sh -c '{ sleep inf &} &'

The author doesn't share this example because "user namespaces introduce a number of quirks, such as breaking gdb (by breaking ptrace), so they also can't be used by most users". I disagree that the container/unshare approach can't be used by most users. So strace works just fine on my machine and gdb is able to attach and print a backtrace. Now it's true that gdb itself prints out a warning when I do this:

    warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.

So:

- user namespaces do solve the "kill all child processes" problem

- strace still works

- gdb from outside the namespace isn't fully supported

So if you wanted to debug a child subprocess with gdb you'd presumably have to invoke gdb via unshare so it shared the same pid namespace (not tested).

user3939382 2 years ago

> There is a simple bash one-liner…
> unshare -U --map-user=$(id -u) --map-group=$(id -g) -f -p sh -c '{ sleep inf &} &'
nh2 2 years ago

This does not work for me, at least not in the way most people would expect it to.
When I run:
> unshare -U "--map-user=$(id -u)" "--map-group=$(id -g)" -f -p sh -c 'command sleep infinity'
then Ctrl+C has no effect.
Sending SIGTERM to `unshare` has no effect.
Sending SIGKILL to `unshare` reparents the inner process `sleep infinity` to my computer's top-level systemd, where it continues to run (process leak).
So not sure how that ensures "child processes are killed".
You need at least the `--kill-child` flag:
> unshare -Ufp --kill-child -- bash -c "command sleep infinity"
See https://unix.stackexchange.com/questions/393210/why-does-uns...
I added the `--kill-child` flag to unshare because Linux did not offer a reliable way to kill child processes when pressing the "Cancel build" button in my CI pipeline.
With the above, SIGKILL against `unshare` will reliably tear it down and everything below it.
But Ctrl+C still has no effect, and SIGTERM against `unshare` still has no effect. So I agree with the post author that the Linux process API is unreliable. This stuff should be easy.

jmmv 2 years ago

I went down this particular rabbit hole of trying to terminate processes reliably a few years ago. The context was termination of tasks and tests by Bazel.

You can find the text in https://jmmv.dev/2019/11/wait-for-process-group.html and the two extra linked posts in the last paragraph, one providing a solution for Linux and the other for macOS. It isn’t pretty nor easy, but it’s doable to various degrees of correctness.

ary 2 years ago

This links one of my favorite critiques of API design: 'A fork() in the road'

https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

It's very much worth a read.

cryptonector 2 years ago

Yup. PIDs are racy unless they are direct children processes' and you've not reaped them yet. And it goes on.

Windows has a much better process API, except for CreateProcess() (the less said about which the better).

One thing I generally do when I have a multi-process program (one that starts multiple worker processes, say), is to have a pipe with the write end only in the parent process and whose read end the children include in their I/O event loops. That way when the parent exits the children find out and then they too exit. The parent will still try to signal them, but say the parent gets `SIGKILL`ed: the children find out and they exit.

rand_flip_bit 2 years ago

Curious why you think CreateProcess is worse than fork/exec. Sure it takes about a dozen parameters but is that really the end of the world?!? It’s much much easier to use correctly and doesn’t have nearly as many of the pitfalls as fork/exec. Especially in large processes with lots of memory allocated. I genuinely don’t understand why people dislike it so much.
- cryptonector 2 years ago
  
  I don't think CreateProcess() is worse than fork/exec, and I didn't even mention fork/exec. It's pretty natural that one should even jump to the conclusion that "dislike of CreateProcess()" -> "preference for fork/exec", but I'm a bit of a contrarian: I think fork() is awful: https://news.ycombinator.com/item?id=30502392
  My specific complaint about CreateProcess() is everything to do with how the child process is bootstrapped and how arguments get passed (as a single string! that the child has to parse like it were a shell! gaaaa!!!).
  
  msm_ 2 years ago
  
  Correct me if I'm wrong, but this is how Window's ABI works. Even command line applications have to parse their arguments, and this is just something that C's runtime usually does before main (in contrast to linux ABI, where every process gets an argument array).
  
  rand_flip_bit 2 years ago
  
  Pedantically, C’s runtime doesn’t parse the command line arguments, CommandLineToArgvW does, and that function lives in shell32.dll, not the C runtime (Nor is it dependent on the C runtime). Ofc the C runtime is free to ignore that routine and implement parsing itself using the results from GetCommandLine. Either way though, yeah, it has to parsed in the child process. But that isn’t a huge bottleneck.
  
  cryptonector 2 years ago
  
  Yes, yes, and it's terrible.
  
  LgWoodenBadger 2 years ago
  
  What else is there besides fork/exec?
  
  neutrono 2 years ago
  
  Only one I could think of is posix_spawn[0], although it has different constraints than fork()/exec()...
  [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/p...
  
  the8472 2 years ago
  
  clone3+exec is a more powerful variant but very difficult to use unless you're libc.
  io_uring_spawn is a WIP. https://lpc.events/event/16/contributions/1213/
  
  thayne 2 years ago
  
  There is posix_spawn, which on linux I think is implemented using clone (which is more flexible than fork) and exec.
  
  cryptonector 2 years ago
  
  vfork/exec. posix_spawn(). Yes, CreateProcess(), but with the specific things about it that suck fixed.
  
  rand_flip_bit 2 years ago
  
  See my other comment on posix_spawn. It has its own issues and TOCTOU bugs that prevent it from being an actual alternative.
  
  cryptonector 2 years ago
  
  What, O_CLOEXEC? Well, yes, you really need to use the new syscalls that let you set O_CLOEXEC atomically on new FDs.
  
  rand_flip_bit 2 years ago
  
  There is no such thing that allows you to avoid the race condition. There are a few issues:
  - You can't control every call to open, and thus enforce O_CLOEXEC on all files (by default)
  - Because of this, most files will be leaked into the child, this is likely not desirable, especially since some distros have low open file limits for processes
  - posix_spawn (the replacement for fork/exec) allows you to specify a list of file actions to perform, including opening and closing, this seems like a solution at first glance
  - However, there is a TOCTOU race here, as you need to first make a list of file actions with posix_spawn_file_actions and then call posix_spawn. Note that every file you want to close needs to have it's own file action, this means you need to determine all the files that are open and manually add each one. This alone introduces the problem of determining all open files in your process.
  - In a multi-threaded program it is possible for another thread to open a file between the calls to posix_spawn_file_actions and before posix_spawn, thus creating the potential for files to leak into the child.
  - Even in a single threaded program, it is possible for posix_spawn to to invoke functions established with pthread_atfork, and atfork handlers are allowed to call signal-safe functions, including but not limited to open. Implementations aren't required to call atfork handlers, and modern glibc doesn't, but this is by far no guarantee.
  - Therefore, my argument is that posix_spawn cannot be used to create a process with a guaranteed minimal and clean state, and so you are back to square-one with fork/exec.
  The defaults for working with these APIs are just completely wrong, and very hard to get correct. The issues with fork/exec are numerous and nuanced, and most people simply aren't aware of the issues or don't care. There is a specific song and dance that needs to be performed when using fork/exec and usually you want to hide all of that behind a library function... which will look something similar to CreateProcess.... sure you might use the builder pattern to make it look nicer, but you really don't want the fork/exec split.
  Here are few other issues with fork/exec, non-exhaustive:
  - Only signal-safe functions can be invoked between fork and exec. This means you need to be super careful with any stdlib code you invoke between these two (or better yet, just don't).
  - Multithreaded programs cannot call fork without exec. period. The state of objects such as mutexes and condition variables will be inconsistent. This is implied by the above, but I wanted to specifically call this out.
  - Detecting if exec failed instead of the program requires using an extra pipe marked with CLOEXEC, I have seen too much code using a magic exit code (which is wrong)
  - Cleaning up the state of the child process and not accidentally creating a zombie is a bit tricky and there are some race conditions to be aware of. pidfd is not a solution if you need to support older kernels, although helps tremendously.
  - Interaction with signals is a bit messy.
  - When fork is called, all pages will be marked as copy-on-write, this can be slow for processes with lots of memory allocated, and is completely redundant if your goal is to call exec. If other threads exist and are writing to memory, the pages they touch will be copied unnecessarily.
  - Like I harped on earlier, files are inherited by default, not the other way around. You should be required to manually list the fds that you want the child to inherit (likely stdin, stderr and stdout only for 99% of cases).
  - Distinguishing exec failure from exceeding but the process failing requires a CLOEXEC pipe
  - If exec fails, _exit must be called! you cannot terminate the child in any way that might run destructors, of invoke callbacks/handlers as these can perform I/O and would thus be observable.
  CreateProcess is just much better, and the whole "it takes 12 parameters how awful" argument against it is 100% a non-issue. It isn't 1960 anymore, it's okay to have a function with a name longer than 6 letters and more than 3 parameters.
  
  cryptonector 2 years ago
  
  > There is no such thing that allows you to avoid the race condition. There are a few issues:
  > - You can't control every call to open, and thus enforce O_CLOEXEC on all files (by default)
  Eh, you can if you open-code everything that might not call `pipe2(2)`, or `accept4(2)`, etc. It's not great, but it is possible. You can also LD_PRELOAD a shim to make everything do that -- also not great, but possible.
  You can also do the spawn server thing, which solves the problem, though you need to spawn the spawn server early, and if not, well, yeah.
  You can also close (on the child side of vfork()) all the FDs you don't know. That mostly works, unless you are running in a context where you need to keep some FDs you don't know about.
  We don't have a time machine. We only have these workarounds. It's not all bad.
- jborean93 2 years ago
  
  Most of the complaints I've seen are about the number of args and the complexity of calling it vs something simple like fork. There are a lot of knobs to turn which you need to be explicit about. That's not even getting into the whole ProcThreadAttributeList and the myriad of options it exposes.
  In saying all that I do prefer the `CreateProcess*` APIs on Windows vs the POSIX ones but that might be because I understand the former better.
  
  rand_flip_bit 2 years ago
  
  It’s fair to say the initial surface area of the API is more complex (at first glance). However, fork/exec don’t really scale that well beyond toy examples you see in blogs or CS101. Thing is, if you are using fork/exec, you are likely also calling a few other functions for things like redirecting IO, dealing with exec failing (creating a pipe with O_CLOEXEC, and you need to remember to use _exit instead of exit (or anything else) if exec fails. Plus there is all the complexity of dealing with the child pid, signals, not accidentally creating a zombie, etc. Ohh not to mention that all the pages in the processes memory need to be marked copy-on-write when fork is called, only for that to be reversed when exec is called. I hope you don’t have dozens or hundreds of gigabytes allocated in a multithreaded program, that will cost you. Ohh and of course, you can’t call anything between fork and exec that isn’t signal safe, so you need to be really careful, especially in languages with implicit allocation or exceptions. Ohh bonus points for the fact that it’s ill-formed for multithreaded programs to call fork without exec since that can leave things like locks in inconsistent states.
  The way both APIs handle file inheritance is absolutely horrendous, especially since most libcs don’t set the necessary flags. posix_spawn doesn't solve this either, since posix_atfork can open more files in the child, and multithreaded programs can have a TOCTOU bug if another thread opens files between the call to posix_spawn_file_actions and posix_spawn. TBF Windows is actually worse in this regard, since it’s race condition is a little more subtle. Ironically the best to way manage all of this nonsense is to create a child process first thing in main (another binary), which isn’t multithreaded, closes almost all files (uses a whitelist of fds) upfront, and spawns processes on behalf of the parent when requested (IPC). Ideally you would write this in C, minimize usage of libc, and avoid allocating tons of memory.
  
  robocat 2 years ago
  
  To add, two good links on issues to beware of with fork():
  https://lwn.net/Articles/785430/
  https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
  
  stevenhuang 2 years ago
  
  > Ironically the best to way manage all of this nonsense is to create a child process first thing in main
  That's a sensible strategy. Do you know if this design pattern has a name, of creating this "clean" process image base for forks via ipc?
  
  cryptonector 2 years ago
  
  I would call it a spawn server, based on the "open server" or "doas" pattern, where you have a process that creates children to setuid() (etc.) which then set a private IPC service that open()s files and sends back the open FD to the caller.
emmelaich 2 years ago

There's a fun story about pids and TOCTOU. If you start the process then signal it soon after, you are extremely unlikely to run foul of the race.
Except ... on AIX at some time, they randomised pid usage. So it was now far more likely!
The reason they randomised pid usage was because the random number routine used the pid+time+something else!
monocasa 2 years ago

pidfds solve some of those problems.
- cryptonector 2 years ago
  
  Indeed, they do.
  One can approximate pidfd in multi-processed programs on OSes that lack it, but that's about it. pidfd needs to be first-class.

nickdothutton 2 years ago

It is after reading pieces like these that I'm reminded of how fortunate I am to have had experience of other "serious" Operating Systems, used at scale, in complex and sometimes unfriendly environments. Namely VAX/VMS. Although some might feel the title was a little clickbaity, I enjoyed the article.

DeathArrow 2 years ago

VMS was released for x86, so if you miss it you can give it a spin.
https://vmssoftware.com/about/news/2022-07-14-openvms-v92-fo...
- skissane 2 years ago
  
  Thus far the x86 port is only available to paying customers. x86 hobbyist program is expected very soon now (within the next few days/weeks). Until then, the best x86 option for hobbyist use is probably running the Alpha version under an emulator. (I don't know if any Itanium emulators are available.) Or emulated VAX–OpenVMS for VAX is no longer legally available to hobbyists, but not hard to find if you don't care about the legalities of it.
  
  icedchai 2 years ago
  
  Any idea how much a license is? I'd legit buy one for personal use, but if it's thousands of dollars I'm better off sticking with my old Alpha.
  
  skissane 2 years ago
  
  AFAIK, VMS Software doesn’t publish a price list.
  But I expect it would be several thousand dollars. I doubt they really want hobbyists as paying customers because the cost of servicing them is likely to greatly outweigh the revenue potential.

deathanatos 2 years ago

> 1.1.4 A should run B inside a container

I think the author knows this, but you don't have to start a full-blown container if all you want is to solve the article's stated problem of process leaks. Become a new pid NS: point 1, the subprocess.run criticism is fixed (it just works); point 2, I don't believe a pid NS requires either root or a user NS; and all that remains is point 3. It doesn't require you to start a separate init, you can be the init, i.e., whatever your top-level service is. IIRC, the only two requirements is handling SIGTERM (which you should probably already be doing) and reaping reparented orphans who then die. But also dumb-init is available? The article notes using a separate init, too: "This init process will do nothing but increase the load on the system, and it will prevent us from directly monitoring the started processes." and … no? dumb-init, in a container I have here that's run for >2 weeks, has used < 20 ms of CPU time. RSS of 522 KiB. You'll be fine. I'm not sure how it "will prevent us from directly monitoring the started processes" — it would live above you in the process tree. You'd monitor the started process the same way you would any started process.

Edit: ah, crap, I've got it wrong. A new PID NS requires root (or user NSes); being a subreaper, I think, maybe does not. But I'm not sure being a subreaper is sufficient; you want the subtree reaped on the subtree root's death.

(I'm also not sure that the subreaper approach is sufficient: if the subreaper itself dies, the processes leak.)

mike_hock 2 years ago

The subreaper is also gonna have the same footprint as the pidns init, and is more complicated.
It's just as flawed a solution as the other flawed solutions. We can accept the subreaper being bug-free as a requirement for this workaround to be working, but we can't prevent it from being sigkilled.

dataangel 2 years ago

> Shell scripts make starting processes trivial, but it's almost unthinkable that, say, bash, would integrate functionality for starting containers, so that every process is started in a container.

Doooooooo it

edgyquant 2 years ago

Is it me or does this not make sense? Bash glues and pipes together commands, has network access etc. every process being a container would require either knowing all commands and being able to ensure containers have proper access (even across pipes) or that containers were so open as to defeat the purpose.
- ryukafalz 2 years ago
  
  Bash may need a high level of authority as it's spawning processes and wiring them together, but those processes don't necessarily need to have as much access as you do themselves.
  Take the venerable `cowsay` for example. Currently, running `cowsay` (as with any other program) can cryptolocker your hard drive, delete all your files, reach out to arbitrary servers on the internet, etc. But how much access to your system does it actually need to do its job? Well, mostly... STDIN, and STDOUT, really.
  Yes, actually doing this is complicated. Another reply has linked to some broad info about object-capability security, but here's a good introduction to the subject: http://habitatchronicles.com/2017/05/what-are-capabilities/
  ...and this paper is an excellent deep dive into a capability system: https://mumble.net/~jar/pubs/secureos/secureos.html
- lozenge 2 years ago
  
  The problem is every executable can impersonate the user, it has access to do anything the user can do, including deleting or encrypting all their files, reading ssh private keys etc. Network access is rarely concerning unless the program has access to credentials.
  
  nyrikki 2 years ago
  
  Nothing is stopping you from using namespaces, and containers are just namespaces with cgroups etc
  But containers aren't jails, pid and uid remapping is just remapping.
  A huge problem container has to drop capabilities on the honor system. In the default docker mode, running as root, anyone who can launch a container can read from any block device if they don't drop the mknod capability as an example.
  Actually a privileged container can update the bios or even load arbitrary kernel modules in the host context or change kernel parameters as it is a shared kernel.
  I tried to get the docker folks to add a conf option disallow privileged container but they refused.
  You can run in user mode now but most people want persistence and other features that don't allow for that.
  The important point is if you assume containers are a security feature you are going to have a bad time. Jails were bad enough and containers are just one step up from chroots as far as security go.
  namespace isolation is the main benefit of containers.
  Selinux and apparmor are far more appropriate than containers for the security concerns. While I don't personally like selinux, apparmor profiles are pretty easy to write.
  
  nyrikki 2 years ago
  
  Plus the 'leaks' in the Linux process API is even worse as each container may run its own tiny-init
  Containers make the first point of the OP far worse by adding way more pid namespaces.
  
  Karellen 2 years ago
  
  > The problem is every executable can impersonate the user,
  Um, what?
  What do you mean by "impersonate" here? What does a process that does not impersonate the user look like? Do you just mean "executables that run as the user"?
  When you log in, and a shell is started that runs as you, is that shell impersonating the user?
  When you execute commands, as yourself, those commands run with your credentials. Because you ran them. Isn't that, like, the point?
  
  dllthomas 2 years ago
  
  Typically, any program I run has the totality of my (regular user) authority, which may let it do things I did not intend.
  Related:
  https://en.wikipedia.org/wiki/Ambient_authority
  https://en.wikipedia.org/wiki/Confused_deputy_problem
  https://en.wikipedia.org/wiki/Object-capability_model
  
  derefr 2 years ago
  
  > What does a process that does not impersonate the user look like?
  A command running inside a virtual machine, maybe?
  
  eru 2 years ago
  
  Or check out how iOS and Android use permissions.
  
  klooney 2 years ago
  
  Bubblewrap is a good tool in this space https://github.com/containers/bubblewrap
  
  bombolo 2 years ago
  
  Use firejail then
  
  traverseda 2 years ago
  
  Firejail
- wmf 2 years ago
  
  Maybe cgroups would be better than full containers here.
  
  GauntletWizard 2 years ago
  
  Which cgroups? Containers are not actually a thing in kernel-land. They're a combination of Process, Network, User, and other namespacing.
  
  wmf 2 years ago
  
  No, cgroups are a separate API from namespaces. https://man7.org/linux/man-pages/man7/cgroups.7.html
  
  GauntletWizard 2 years ago
  
  You're not wrong, but the point remains - Are you going to limit their CPUs? Are you going to limit their RAM? Network Performance?
  The collections of Cgroups and Namespaces (and for all that they are different APIs, you almost never use one without the other, so perhaps it's bet to refer to the whole group of them as "Containers" or "Containment" to differentiate it from Docker-style containers) is complex and flexible for a reason, even if an absurd proportion of the common cases can be solved with a reasonable set of defaults of them.
mattpallissard 2 years ago

Done. https://pallissard.net/2022/06/27/limiting_application_resou...
Tl'dr two functions "dispatch" that calls systemd-run and "wrap" that takes a command, a memory limit, and a cpu limit.
- nine_k 2 years ago
  
  systemd is not bash. Otherwise indeed true.

sitkack 2 years ago

Excellent article! Thanks for posting it. It outlines all the problems and then offers a solution with this tool (by the author)

https://github.com/catern/supervise

1vuio0pswjnm7 2 years ago

"I only know one existing solution that fixes all these problems without sacrificing flexibility or generality.

Use the C utility supervise to start your processes; for Python, you can use its associated Python library."

C utility written in 1999. Last updated in 2001. I'm still using it everyday, not always with multilog and svscan.

ben0x539 2 years ago

Simply use systemd/cgroups they tell me and now I'm trying to figure out how my systemd service somehow ended up controlling processes for another service.

remram 2 years ago

Can you create cgroups as an unprivileged user?
- ben0x539 2 years ago
  
  Oh, good question, I have been running some stuff as systemd user services but I don't actually know if they get real cgroups.

aidenn0 2 years ago

Do cgroups solve any of these problems? I was mildly surprised to not see them mentioned.

wmf 2 years ago

Where the author talks about containers you can mentally substitute cgroups since Linux containers are cgroups + namespaces.
- rcoveson 2 years ago
  
  That's how I look at it too, but lots of people don't look at it that way, hence all the handwaving about "too heavyweight" and "seems like overkill" etc.
  Largely because of Docker and Kubernetes, many think of a container as all of the following:
  1. A cgroup + [all or nearly all of the] unshare-able namespaces
  2. A writable, disposable overlay on top of an immutable "image", which may be lazily downloaded and extracted
  3. A resource managed by a userspace daemon managed by a userspace utility over a socket
  4. Optionally, a seccomp-bpf filter or apparmor profile or something
  But there's a whole useful spectrum between a vanilla process and a Docker container like that. Lots of points on that spectrum still feel highly container-ized but aren't really much more heavyweight than a vanilla process.
  Beyond that, in the point about PID namespaces, the author should mention that there are ultra-light-weight init implementations that are barely a factor in overhead.
remram 2 years ago

Can you create them at all, as an unprivileged process, or does it need something else to set it up for you like systemd?
If you have to rely on systemd then arguably is a systemd solution, not a Unix or Linux one.
jwilk 2 years ago

They were mentioned in 2.1.1.

dataflow 2 years ago

It seems there isn't even anything written about FD_CLOEXEC and its associated race conditions either, as far as I can tell. Basically it's impossible to portably spawn a subprocess in a safe manner if you don't have sufficient control over all the code running in your process, because you might duplicate file descriptors into the child that you might not have intended, and that can break things in the parent.

jclulow 2 years ago

> you might duplicate file descriptors into the child that you might not have intended, and that can break things in the parent.
This is not true on all systems. For example, on ilumos we have fdwalk(3C) and in particular closefrom(3C). These allow you to dup2() your intended descriptors to a contiguous block starting at fd 3 (assuming you keep the stdio descriptors at 0-2) and then close everything from one past the last descriptor you intend the child to inherit.
https://illumos.org/man/3C/closefrom
- dataflow 2 years ago
  
  Interesting, but how would the subprocess understand the resulting FDs? Doesn't that mean you can't really launch a subprocess that doesn't cooperate? In particular, A launches B, and B launches C, then C would have no idea what the FD numbers represent unless B somehow actively receives that information from A through some custom mechanism and communicates that information to C... right? So if B is a random general-purpose process (say, a Bash or Python process) then you're out of luck? Unless I'm missing something.
  
  jclulow 2 years ago
  
  I think there are two kinds of transitions:
  - on fork, you are by definition implicitly cooperating with the parent as you are the same software image as the parent; in the case of transitive forks you can be expected, I think, to continue cooperating.
  - on exec, you move from implicit cooperation to the need for explicit cooperation, through some kind of interface contract; e.g., the BASH_XTRACEFD environment variable. It is incumbent upon you to hide the good silverware before making the exec transition, as exec is a trust boundary; like when you sell your house, you don't get to control what the new owner of the process does with it, except through kernel-enforced mechanisms like privileges and resource controls and so on.
  
  dataflow 2 years ago
  
  I think you're introducing complexity into the discussion that isn't relevant. All I'm saying is if you do something like
  (rm -f /run/lock/temp && exec 5<> /run/lock/temp && echo test123 > /run/lock/temp && perl -e 'system("bash -c \"cat /dev/fd/5\"")')
  it needs to work fine, and this shouldn't require any particular cooperation between Perl and Bash.
  FDs need to be inherited correctly... just like environment variables already are.
  
  wahern 2 years ago
  
  Every explicit or implicit redirection in that shell line is basically a call to dup2. At its core, the shell command line processor is effectively[1] a trivial interpreter loop that maps each successive token or group of tokens to fork, exec, open, or dup2. This is what makes fork/exec elegant as compared to CreateProcess; it's just not apparent unless you appreciate how the parts are intended to work together and what they can accomplish.
  [1] It's literally a small, trivial loop in the original Bourne shell source code. In modern shells job control and other niceties require additional bookkeeping, but those aren't necessarily implicated in your example. I don't have a URL at hand for that source file, but I did find it last year or the year before, so it shouldn't be too hard to find if you're curious.
  
  dataflow 2 years ago
  
  I'm sorry, I 100% disagree. A trivial interpreter loop requires fewer lines of code with fork/exec than CreateProcess, therefore the former is more elegant? It might make sense to judge artistic works that way, but is that really the correct yardstick for judging the engineering of a system by anyone other than a CS 101 student tasked with writing a basic REPL? That's like saying adding 2 numbers with my abacus is easier than with your calculator, therefore my abacus is more elegant than your calculator.
rwmj 2 years ago

AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but the possibility that some library might not be using it? (That is to say, *_CLOEXEC if used does not have race conditions)
However we usually cope with that by closing all unknown/unexpected file descriptors after fork and before exec. Linux even has a system call to make that easier: https://man7.org/linux/man-pages/man2/close_range.2.html
- dataflow 2 years ago
  
  > AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but the possibility that some library might not be using it?
  Not exactly. The problem is that you have to be able to set it atomically from the creation of the file descriptor. Setting it after creation is subject to a race condition where a fork occurs in the interim. There's no portable way to do that, and people often ignore O_CLOEXEC even when there's a platform-dependent way to pass it. (How often do you see dup3() called, for example? And how often do you see higher-level languages and libraries expose this and force callers to make a conscious decision?)
  > However we usually cope with that by closing all unknown/unexpected file descriptors after fork and before exec.
  You can't really do that portably (well, maybe unless you want to call close() billions of times). And even if/when you can do that, you run into the reverse problem, where you might close descriptors that were supposed to be duplicated into the subprocess but that you didn't know about. (One example is when a user performs redirect inside a shell like 2>&3 and wants it to work inside a descendant process - you don't want to just randomly close FDs you don't recognize.)
  
  mike_hock 2 years ago
  
  TFA talks specifically about Linux (despite the title). AFAIK most if not all Linux APIs provide a way to set CLOEXEC on creation (as in, there is at least one alternative that does). So that's a solution, but of course, every library must adhere to it.
  Or you have to accept that when spawning a new process, you have to know which FDs you want to leak.
  But doing it portably? Lol yeah, forget it. Posix is a farce.
  
  wahern 2 years ago
  
  > Posix is a farce.
  accept4, dup3, pipe2, F_DUPFD_CLOEXEC, MSG_CMSG_CLOEXEC, and SOCK_CLOEXEC are already in the draft for POSIX Issue 8. Most of these have already been in the BSDs and Solaris for a few years, now, at least. macOS is the major exception.
  Also in the draft are close-on-fork analogs: F_DUPFD_CLOFORK, FD_CLOFORK, MSG_CMSG_CLOFORK, O_CLOFORK, and SOCK_CLOFORK.
  
  mike_hock 2 years ago
  
  Cool, better late than never.
  
  rwmj 2 years ago
  
  Oh for sure (and I meant O_CLOEXEC, not FD_CLOEXEC) ...
  We are fairly thorough with using dup3, pipe2 etc in our own code, on platforms that support it. Of course hard to control other people's code.

wmf 2 years ago

I kept expecting Capsicum to step from behind the curtain but no.

loeg 2 years ago

Capsicum is about sandboxing code in the same process, not really related to the problem the article is talking about. FreeBSD's somewhat related mechanism to Linux pidfd is pdfork / pdkill: https://man.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2&n...

kwhitefoot 2 years ago

That was interesting and clearly written, I wish all such articles were as clear.

jamesdutc 2 years ago

I recently wrote an autorunner[1] (like Entr[2] and Watchexec[3]) so I have some recent exposure to this problem. (I will be releasing it on Github shortly.) My autorunner allows running interactive programmes, so it is very sensitive to lingering child processes.

For the purposes of the autorunner, I use approach 1.1.3 (“always write down the pid of every process you start, or otherwise coordinate between A and B”) and leave it to the user to figure out what happens if the child process misbehaves with relation to any processes it starts.

However, I want to point out that approach 1.1.4 (“A should run B inside a container”) is easier to do than one might expect, and I'd like to plug one of my favourite utilities—Bubblewrap[4]. The Bubblewrap documentation says “[y]ou are unlikely to use it directly from the commandline, although that is possible” but I have built some amazing little tools from it.

Try the following invocation:

    bwrap --ro-bind / / --proc /proc --unshare-pid ps

This launches `ps` in a PID namespace with a new `/proc` (since `ps` will read from the host proc otherwise) and the root filesystem mounted readonly. Any procesesses within the PID namespace should have been created by the immediate command that `bwrap` launched. There are also flags `--die-with-parent` and `--as-pid-1` which can further reduce runtime overhead. If you really need a supervisor process, this can be as simple as a `/bin/sh` script that `kill TERM --timeout 1000 KILL` in a loop on everything it sees in `ps`.)

As you can see, there's a lot you can do with this tool with significantly lower overhead than using Docker. It has been my goal for some time to extract some of the functionality of Bubblewrap into a Zsh extension to allow accessing these mechanisms with even lower overhead. I think the creation of namespaces is a missing primitive in Linux shells, and being able to quickly construct namespaced environments allows for a style of safe, robust, simple shell scripting. e.g., if you create a mount namespace to run your script, you can actually be looser about parameterising file locations (since the namespace can ensure everything is exactly where you want it to be.)

[1] https://fosstodon.org/@dontusethsicode/110019380909461936

[2] http://eradman.com/entrproject/

[3] https://watchexec.github.io/

[4] https://github.com/containers/bubblewrap

jrootabega 2 years ago

Looks interesting. Have you needed or found any good ways to detach the wrapped code from the terminal where you first launch the wrapper? (for security mostly) I haven't found a good way to do that with bwrap other than using sudo or su and their pty feature. bwrap's --new-session flag didn't play nice with interactive programs in my attempts.

jiveturkey 2 years ago

Too bad the article doesn't discuss contracts, the Solaris solution. As the article is very linux focused, I imagine the author is blissfully unaware.

Arch-TK 2 years ago

The article mentions non-linux solutions to these problems, such as those in FreeBSD, but realistically this is a Unix wide problem as irrespective of what solution any particular flavour of Unix gives for these problems, there's no standard solution.
- jiveturkey 2 years ago
  
  100% agree on the breadth. However each flavor should learn from what the others have done. Especially if you write about it, as an educational piece vs just documentation, I'm just expressing my disappointment that known excellent solutions aren't part of the discussion.

kajaktum 2 years ago

Aside from having to run a short command line program to get an stdout/stderr from; I have never had to use processes. It has always been simple threads. Why and when would I want to spawn a process instead?

adrianN 2 years ago

For example because you want better isolation guarantees than threads provide. Say you spawn workers running requests for different users with different permissions.

gtirloni 2 years ago

Is Fuchsia any better for what the article is concerned about?

https://fuchsia.dev

Arch-TK 2 years ago

When quickly glancing at https://fuchsia.dev/fuchsia-src/reference/syscalls/process_c... , It looks like it does not have the same issues.

evilotto 2 years ago

Is basic fork/exec from a large process still slow or have newer apis fixed that?

anfilt 2 years ago

On linux fork is pretty fast since it's copy on write. Exec or anything similar is gonna be slower. For example you want to exec some program in the elf format. Well exec is likely to lead these things: One is likely got to load the file from non-volatile storage. Then parse the elf file and place each section in the correct place in memory, setup other memory maps, and stack ect.. Not just that though then run time linker probably has to run loading and linking each .so file, also patching code with the correct relocations as needed. Loading a new process is a lot more expensive where as fork can reuse a lot of that work since it's just a clone of the process at the point fork was called and was already done.
- evilotto 2 years ago
  
  Unless something has changed in the past decade, fork on linux is really slow when you're forking a large multi-GB process because even though the memory is COW all the kernel page tables still need to be copied. This is one reason why large servers will have a separate "launcher" process that is forked early on rather than forking themselves. (the other reason of course is threads).
  
  anfilt 2 years ago
  
  Well that is needed to support isolated address spaces. It might be possible to share some the of the page tables safely. Still copying page tables is a lot faster than copying many gigs of data around, but for most that don't have massive page tables this is pretty fast. You can also use things like MADV_HUGEPAGE to make the page tables smaller.
  Now if this is a problem and the application does not need address space isolation. clone() with the CLONE_VM flag will get rid of this problem, and clone() is nice in the fact you can get more granularity/control. This control basically ranging between most of the continuum between thread and separate process depending on your needs.
- pjmlp 2 years ago
  
  The problem with UNIX, is the same as Web browsers, in theory POSIX should be the same everywhere, in practice not really.
  So whatever Linux does better, or any other UNIX for that matter, might be a pool of surprises somewhere else.
pengaru 2 years ago

vfork() / CLONE_VFORK / posix_spawn()

chubot 2 years ago

DJB's self pipe trick solves the awkwarness of #4 -- waiting for a process exits plus other events non-deterministically:

https://cr.yp.to/docs/selfpipe.html

FWIW a shell is basically two alternating, non-overlapping event loops:

- a select() loop on the input terminal FD for getting keystrokes (e.g. GNU readline)

- the waitpid(-1) loop for running code, i.e. get the next process that exited

It never actually does both at the same time -- it doesn't wait for processes and stream input simultaneously, which is awkward without the self-pipe trick.

---

Regarding adversarial processes, yes you need something like Linux cgroups to solve that problem. In traditional Unix, a process that can run arbitrary code can always escape your attempts to kill it.

IIRC you can start a Linux process in a freezer cgroup, and stop everything in the cgroup. I recall reading the docs for an HPC platform that does that, and I'm sure Docker does it in some way too.

---

I'd be interested in where `supervise` is used in production ... it seems like there is a bigger story behind this article!

(copy of lobste.rs comment)

teddyh 2 years ago

> It never actually does both at the same time -- it doesn't wait for processes and stream input simultaneously
Couldn’t you use a signalfd to receive the SIGCHLD, and listen to both the terminal and signalfd at the same time?
- chubot 2 years ago
  
  So there was a correction here, zsh and fish do wait for processes in the interactive part, but apparently not bash/dash/mksh:
  https://lobste.rs/s/om32da/unix_process_api_is_unreliable_un...
  As far as your question, shells don't use signalfd() because it's Linux-only!
  Most shells predate not just signalfd(), but Linux itself :) I wouldn't use it in a newer shell for portability.
  If it ever came up, I probably I would use the self-pipe trick.
  BTW I think there was some argument awhile ago that signalfd() is actually quite bad in the presence of threads, but a shell doesn't have threads so that doesn't matter.
  
  teddyh 2 years ago
  
  > but apparently not bash
  In bash, try “set -o notify” (same as “set -b”). This will turn on immediate notification of terminations of subjobs.
  So it’s not that bash lacks support for it, it’s that bash wants to be compatible with old Unix shells, but offers other behavior as an option.

ezekiel68 2 years ago

There is zero new information in this article that has not been known since the early days of what became Unix. It seems to me that a reasonable summary of the content might be: "It is difficult to keep track of important things properly and no one should be asked to do it."

eru 2 years ago

Computers should help us keep track of things.
- ezekiel68 2 years ago
  
  And sure enough, they do. When I fork() a process, I get a return code which may be a child pid. I can hold it as a variable, put it in a hashmap, save it to a file, output it to a logging stream, kill it, or send a signal to it. The API ain't perfect but the world has not come crashing down upon us due to this in the past half-century as far as I have seen.
  
  eru 2 years ago
  
  Aren't you proving too much here?
  The argument you bring here can also be advanced in favour of Windows 95 running stable enough, can't it?

teo_zero 2 years ago

I don't understand why the article lists 4 issues, when 2 of them were solved in 2019:

> pidfd is a great solution to the third and fourth problems

So why talking about them? Is it just for proving the superiority of the author's own utility "supervise"?

bolangi 2 years ago

Where process supervision is required under unix, you can use systemd, the linux-only solution pushed by redhat, or one of the small supervision suites such as s6 developed by skarnet.org.

slondr 2 years ago

What happens when s6 crashes, then?

jclulow 2 years ago

Many of these problems are at least partially solved on other UNIX systems like illumos. To run down the top-level list:

1. It's easy for processes to leak

illumos has contracts[1][2][3], which were developed as part of the Service Management Facility[4]. They are another process grouping abstraction that allows SMF to track trees of processes, whether they daemonise or not. They allow for tracking or ignoring certain events (e.g., a fatal signal sent from outside the contract, a process within the contract that aborts and dumps core, if a particular process, or all processes, terminate in some way) and for doing certain kinds of automatic cleanup (e.g., terminate all processes in the contract when the process that owns the contract is terminated). These are managed by the kernel, so they are effectively inescapable even when not held correctly by a user process.

2. It's impossible to prevent malicious processes leaks

This is not really true for us either. Between contracts, and resource controls[5], and privileges[6], I expect one would be able to limit the malicious or accidental escape of processes from supervision or any run-away resource consumption caused by, say, a fork bomb.

3. Processes have global, reusable IDs

This is true on some level, but in practice I think that's just part of UNIX and when you have solved 1-2 and 4 in other ways, it's not actually that bad. If you want to kill everything in a contract you own, even without knowing the full list of pids, you can do that with ct_ctl_abandon(3CONTRACT)[7] which takes a contract file descriptor. The termination action (which could be to tear down all of the processes) will take effect then.

4. Process exit is communicated through signals

This is not entirely true, in the sense that they are by default for classic UNIX applications -- but they need not be. We have forkx(2)[8], which has the FORK_NOSIGCHLD and FORK_WAITPID flags. These request that SIGCHLD is not posted for process termination, and that the classic UNIX wait(2) family of calls will not receive notification or reap children. You can use these from a library, and then manage your own waitid(2) or waitpid(3C) calls on the specific process IDs you are responsible for reaping. You can also use contracts to receive notifications of events about processes within the contract coming and going.

[1]: https://illumos.org/man/5/contract

[2]: https://illumos.org/man/3LIB/libcontract

[3]: https://illumos.org/man/3CONTRACT/

[4]: https://illumos.org/man/7/smf

[5]: https://illumos.org/man/7/resource_controls

[6]: https://illumos.org/man/7/privileges

[7]: https://illumos.org/man/3CONTRACT/ct_ctl_abandon

[8]: https://illumos.org/man/2/forkx

pjmlp 2 years ago

Yeah, but that is the thing with UNIX wars, there is POSIX and then whatever each UNIX variant does around it.

not_enoch_wise 2 years ago

you're exciting me

IAmPaigeAT 2 years ago

> unreliable and unsafe

Just like the information you’re providing if you can’t be bothered to setup an ssl certificate for a page

userbinator 2 years ago

Is "unreliable and unsafe" the new "considered harmful"? Because it sure feels like that.

calt 2 years ago

I think it's quite a bit more descriptive and objective than "considered harmful."

wang_li 2 years ago

This reads like they have a set of requirements and since the Unix model doesn’t meet their requirements, the Unix model is bad. As opposed to it’s fine for those who have different requirements.

Arch-TK 2 years ago

I think the article is pretty clear on that, if your requirements do not include "reliable" and "safe" then yes, the unix APIs meet your requirements. But if your requirements include "reliable" and "safe" then they definitely don't.
- thayne 2 years ago
  
  That's overly general. The unix APIs don't meet requirements if you need to reliably and safely reap child processes, and all of their descendants. And on linux, there are non-posix APIs that help in certain situations (pid namespaces and pidfds).
  And for many, maybe most, applications, leaking a child process isn't really a problem.
  
  Arch-TK 2 years ago
  
  Leaking child processes is inherently unreliable, it can lead to lots of issues in any non-trivial program, but even in the most trivial of programs it can lead to basic problems like accidentally fork-bombing yourself because you made a small programming error. There's no guaranteed way to avoid that, therefore to a lesser or greater extent, depending on the circumstances, it is not reliable.
  The APIs are inherently unsafe, this is because once you are in a situation where you have leaked a process, you can't safely kill it without risking other processes. In some situations this could even be abused by an attacker to cause other problems.
  So yes, in the most boring of applications where you have a user in front of a computer, there's no need for the reliability of processes having lifetimes bounded by the references held to those processes over their life. Likewise, there's probably no massive need for safety because if a process gets killed unintentionally, the user can fix it. But just because there exist circumstances where neither reliability nor safety are important, doesn't mean that the APIs aren't inherently unreliable and unsafe.
  Furthermore, yes, linux has solutions for half of these problems, FreeBSD has solutions for other parts of these problems, Solaris has solution for other problems. The article calls out "the UNIX APIs" for a reason. The original APIs are as bad as the article says, and they have to be worked around with OS specific non-portable APIs. Which are, by definition, not "standard UNIX APIs" anymore.