fxtentacle a month ago

In case anyone wonders about the why: We have some services where we do updates without dropping a client connection by first spawning an additional server, then moving traffic over to the new server, and then the old server will detect that it's idle and its socket has been removed and terminate itself.

At least, that was the theory. In practice, sometimes freak accidents cause libUSB to hang when the process exits. That leads to the old server keeping its file handles open. That means even if log rotation deletes the log file, the disk space cannot be reclaimed. And that means over time the HDD mysteriously fills with deleted files stuck in zombie processes. Of course, one could use killall to terminate all those servers. But that's difficult to coordinate so that you never accidentally terminate one of the new servers, but instead only the old ones.

That's where bombtag comes in. You first mark all running servers as "should exit in 5 minutes" and then you do the usual hot reload. If any server gets stuck, bombtag will forcibly kill it, first with SIGTERM and then with SIGKILL.

For added fancyness, I also relocate the env vars and overwrite argv[0] so that inside the "ps" output, bombtag can show what PID it's waiting for and the current countdown in seconds, e.g.

  user 2239536  0.0  0.0  18316  1792 pts/5    TN   23:00   0:00 cat
  user 2239542  0.0  0.0   6092   512 pts/5    SN   23:00   0:00 bombtag -p 2239536 -n cat (29s -> TERM)
  • necovek a month ago

    Why not combine the regular `kill`/`killall` with `at` and `ps`?

    • fxtentacle a month ago

      Because the old and the new processes have exactly the same name and parameter. But we only want to kill the old processes and only after the new ones have started.

      • necovek a month ago

        I must be missing something.

        Why wouldn't something like 'echo "kill -KILL $(ps -C [name] -o pid=)" | at now + 2 minutes' work? You obviously need to run this before the new processes are started up, but it will only do the `kill` 2 minutes later after the new processes are started.

        To expand on what the above command does:

        1. echo is to print out a command to add as the job for the `at` command

        2. kill command sends the KILL signal (you can also do -9, but I prefer names like TERM or INT or KILL for clarity), and "materializes" the PID when it's echoed out, and does not get re-run after 2 minutes

        3. at command allows you to schedule execution of a command on POSIX systems (see https://linux.die.net/man/1/at)