I do something similar (but less portable and more verbose) in C++ sometimes when I want to prototype something. My boilerplate is something like this:
HA! I was about to share my way, but you seem to have done more or less what I have in C instead!
I was having fun with old Golang C code before they started bootstrapping in Go and have found they were using interpunct to separate function names, so I decided to play around with shellscript-like C code to goof around:
#if 0
set -e; [ "$0" -nt "$0.bin" ] &&
gcc -O2 -Wall -Wextra -Wpedantic -std=c11 "$0" -o "$0.bin" -s
exec "$0.bin" "$@"
#endif
#include <stdio.h>
void runtime·sysAlloc(void);
int main(void)
{
unsigned age = 44;
printf("I am %u years old.\n", age);
runtime·sysAlloc();
return 0;
}
void runtime·sysAlloc(void) {
printf("Hello from C code, %d!\n", 90);
}
Based on the title, I was expecting this to be about quines [1].
If you aren't familiar, quines are programs which produce their own source as their only output. They're quite interesting and worth a dive if you haven't explored them before.
My personal favorites are the radiation-hardened variety, which still produce the original pre-modified source even when any single character is removed before the program is run.
Wikipedia's specific radiation-hardened example looks like goobly-gook to me, could you explain it in principle? Also does it only compensate for removed characters, or also for "bit flips" i.e. a changed character?
The basic principle of radiation-hardened programs is to have as many safety nets as possible. The particular example in Wikipedia does this by, at the topmost layer for example, having two same string literals and `eval`-ing the longer literal (which would then use that literal to complete the quine). Since this `eval` call can be disrupted instead, it first checks whether two literals are same and calls a dedicated code path:
/#{eval eval if eval==instance_eval}}/.i rescue##/
This code path is carefully designed so that it always remains a valid Ruby code in any case and will execute the following line only when `eval` was not called. This kind of principle drives the entire source code.
> also for "bit flips" i.e. a changed character?
No. You can trivially make it invalid by changing the first character (`e`) to, say, `%` or 0xE5. It will be generally impossible to make such program for many languages with fixed classes of code characters.
> the right conceptual basis for build systems: ”build is the first stage of execution”
I have long been thinking the same. And also: "Running tests is the first stage of deploying in production" etc.
In other words: There is often a whole dependency graph between various stages (install toolchain, install 3rd-party dependencies, do code generation, compile, link/bundle, test, run, …) and each of those stages should ideally be a pure function mapping inputs to outputs. In my experience, we often don't do a good job making this graph explicit, nor do we usually implement build steps as pure functions or think of build systems as another piece of software which needs to be subject to the same quality criteria that we apply to any other line of code we write. As a result, developer experience ends up suffering.
Large JavaScript projects are particularly bad in this regard. Dependencies, auto-generated code and build output live right alongside source code, essentially making them global state from the point of view of the build system. The package.json contains dozens of "run" commands which more often than not are an arcane mix of bash scripts invoking JS code. Even worse, those commands typically need to be run in juuust the right order because they all operate on the same global state. There is no notion of one command depending on another. No effort put into isolating build tasks from each other and making them pure. No caching of intermediate results. Pipelines take ages even though they wouldn't have to. Developers get confused because a coworker introduced a new npm command yesterday which now needs be to run before anything else. Ugghhh.
"Build is the first stage of execution" resonates with me too.
It follows that build systems should be written in terms of real, encapsulated APIs that separate interface from implementation.
Make-like build systems are like programming in assembly language: there is no structure that isolates one build step from another. Make offers no way of creating a "local variable" -- an intermediate file that is hidden from any other build step. The filesystem is like memory: as long as you can figure out the filename (address) somehow, you can read and write whatever you want.
Even if you can declare that one build step depends on another, there's usually nothing that prevents a step from reading (or even writing!) files from other build steps that it did not declare a dependency on.
What you want is a build system that lets you define real APIs, with truly private state. For example, cc_library() in Bazel (cxx_library() in Buck2) lets you define a C++ library in terms of the srcs/hdrs/etc, but you don't know how it will be implemented. Other rules and targets can consume the outputs of cc_library(), but only the files that are publicly exposed in the returned providers.
This model brings true encapsulation to build systems, which makes it much more feasible to change implementation details without breaking users.
It's basically impossible to program with no side effects in any general purpose language, and providing a build system language likely requires side-effect rich APIs too (to write files, call external tools, etc).
Which is to say that you can perfectly well implement a "clean" build setup with make, and provide an "interface" for parts of the system to tie into. Files and directories are as good an input/output interface as any other, and you can do some simple safeguards with permissions on directories and files.
To create a local, intermediate file with make, you create a file in TMPDIR (eg with mktemp or tempfile) and remove it at the end, setting the permissions the way you want for any nested steps not to be able to access it.
I think one build tool that has the fundamental abstraction well thought out is mill (written in Scala). Every task is just an ordinary function written in plain Scala, outputting something. Other functions can use this output by simply calling that function. In many cases the output will be simply a target directory, that is cached when none of the dependencies have changed.
"TCC can also be used to make C scripts, i.e. pieces of C source that you run as a Perl or Python script. Compilation is so fast that your script will be as fast as if it was an executable."
In D, this is a fully-supported used case. The DMD compiler provides an executable called rdmd that can be used as the shebang executable at the top of any D file, and the shebang itself is also codified as valid D syntax.
This is completely off topic for the actual article, but I had a bit of fun a while back working out how to make self-executing C and OCaml scripts/programs.
It's an interesting exercise working out what is both a comment in the target language but is also an executable shell script. For C it's reasonably straightforward but for OCaml it's quite subtle:
Seems to be replicating mainframe JCL and in-stream data sets. Processing instructions and input are combined in a single file. Used all the time for compiling, running utilities etc.
I'm guessing that this (IBM) example is setting the delimeter to '@@' to avoid problems with the comment - JCL also understands the '/*' sequence. I've not seen it used with other languages (Cobol etc.)
The way I see it, this is something a true built-from-source system could do with their packaging system to enable no-effort code changes for any system utility and true trust in you running what you have source for (other than backdoored hardware).
Debian is pretty far off from this vision (if we also want performant execution), but I wonder how do the Gentoo, ArchLinux and Nix fare in this regard? Is this something that could be viably built with their current packaging formats?
In arch at least it is reasonnably easy to download the source for a package, modify it locally, build it and install it. Not sure if that's what you are asking for?
I think no matter how easy a seperate source package is, it's still a seperate thing, and not the same (not as good) as the source being a built in part of the package.
freebsd/gentoo ports comes close where if you pretend that pkg doesn't exist, or at least imagine a world where it's only used for the absolute minimum necessary bootstrap, then ports is probably the closest.
The source is still actually a seperate thing even then so I think even ports with no pkg usage is still not really there.
Imagine the package itself being the source, the one and only form of the package is the source. If it builds an executable, that executable is actually just an automatically generated throw-away artifact that you don't care about and don't save or distribute. Maybe most normal users don't even know where the compiled bin really lives, buried in some /var tree or something, or maybe even in a kind of kernel level db. All the user ever overtly interacts with is actually the self-building source package. When you want to copy it or delete it etc, that's the only thing you touch and everything else is just automatically managed cache.
Then it's not merely easy to get the source and modify it, it's simply THE package in the first place. If you can even run a thing, then you automatically and indellibly also have the full source to that thing. That would be pretty huge I think.
It would probably result in slow installs and updates like gentoo or freebsd ports, but only if we only imagine switching to this today as the only variable changed, out of context, without also imagining the last 40 years of tooling and os development optimizing pain points to make whatever we do most go faster, if we had decided to package apps this way all along.
Indeed, but I also don't see why this couldn't extend to the kernel too: it would self-compile upon boot.
There is an obvious bootstrapping issue (there has to be a compiler for any language you want to compile, including the one your compiler is in), but it's certainly interesting food for thought.
Even if, for some reason, there are multiple `usr` folders, the use of `env` means it will eventually call the executable.
As for getting rid of the shebang - swapping the ! with a / means that the line and character counts don't change so you get meaningful error messages.
Somewhat adjacent- I recently discovered https://github.com/rofl0r/rcb2 - it can take it quite far without using make file. And similarly to OP - it allows to keep relevant build info right in the source code. (Rcb2 is great at prototype stage, but obviously at some point makefiles are worth spending time on)
Although the substantial functionality overlap between shells and programming languages has caused many to attempt to combine them, someone a while back (Yossi Kreinin?) had a reason that's probably not what we want: a shell should be optimised for quickly doing specific things, and programming languages should be optimised for concisely doing general things. In particular, this suggests that bare words in shells should default to being literals, and variable substitution will take extra characters, while bare words in programming languages should default to being variables, and literal values will take extra characters.
IOCCC is almost irrelevant here, but one particular winning entry is relevant: 2000/tomx [1]. Nowadays IOCCC is hosted in GitHub Pages and it is hard to look at the verbatim source, so the following is the entire source code:
As a shell script, anything starting with `#`, `true<WS>` and `false<WS>` are meant to be ignored and variable declarations are compatible with Make. The next lines to be executed are therefore `make -f $0 $1` and `exit 0`, as you would expect. Shell scripts are parsed linewise so subsequent lines are ignored.
As a Makefile, again `#` lines are ignored and variable declarations are compatible but lines starting with `true<WS>` and `false<WS>` have to be a valid construct, so they are made into rules. Make doesn't rely much on the rule's recipe (you can do way more weird things by setting `SHELL` to non-shells by the way!), so indented lines are essentially ignored unless the rule is triggered. Next lines are usual stuffs for Makefile. As `true` (and anything follows, see the next paragraph) would be the first and thus default rule target, it has to trigger `all` manually and ensure that `all` is always executed in this way by setting the final `.PHONY` rule.
As a C source code, `#` lines are now preprocessor directives and used to hide the first `true` token from the compiler. As an identifier followed by a colon wouldn't be a valid C code at the top level, `true` should be followed by a comment which extends to the second-to-last line. (There is no technical reason for `*/` to be a requisite here I believe.) The final line is an usual indented C code, which is a part of `.PHONY` which recipe is ignored. Since Make will interpret `/*` as path patterns, they have to be also handled by `.PHONY`.
Nice. I worked out most of it; the C source part was easy and the Makefile was mostly comprehensible(except for the "true"/"false" rules) but the shell script part was what stumped me; now i see the sneaky whitespace in "false :".
A true IOCCC winner.
One tip for folks trying to figure out this (and other multi-modal) code is to use your editor's syntax highlight feature based on filetype. So for example in vim renaming this file to ".sh" and ".mk" gives you shell script and makefile structures.
TFA? No, it’s a C / Bourne shell polyglot where the shell part compiles and runs the C part. I’ve also used this technique when I needed to post self-contained examples (e.g. to mailing lists), but I don’t know if people actually appreciated it.
I do something similar (but less portable and more verbose) in C++ sometimes when I want to prototype something. My boilerplate is something like this:
(the trailing semi-colons in the script part is to make my editor indent the C++ code properly)HA! I was about to share my way, but you seem to have done more or less what I have in C instead!
I was having fun with old Golang C code before they started bootstrapping in Go and have found they were using interpunct to separate function names, so I decided to play around with shellscript-like C code to goof around:
That's exactly how I keep my quick test C or C++ programs. I like how one can keep all the compilation options in the same file this way.
I love the opening #if 0. So many tricks from multilingual quines, maybe they can actually be useful :)
I absolutely love this, super clever!
Based on the title, I was expecting this to be about quines [1].
If you aren't familiar, quines are programs which produce their own source as their only output. They're quite interesting and worth a dive if you haven't explored them before.
My personal favorites are the radiation-hardened variety, which still produce the original pre-modified source even when any single character is removed before the program is run.
[1]: https://en.wikipedia.org/wiki/Quine_(computing)
You can chain 128 quines: https://github.com/mame/quine-relay
Their "keep ruby weird" quine is my favourite: https://www.youtube.com/watch?v=IgF75PjxHHA
Looks ok to me - can't see anything wrong.
rubocop says no offenses :D
Wikipedia's specific radiation-hardened example looks like goobly-gook to me, could you explain it in principle? Also does it only compensate for removed characters, or also for "bit flips" i.e. a changed character?
The basic principle of radiation-hardened programs is to have as many safety nets as possible. The particular example in Wikipedia does this by, at the topmost layer for example, having two same string literals and `eval`-ing the longer literal (which would then use that literal to complete the quine). Since this `eval` call can be disrupted instead, it first checks whether two literals are same and calls a dedicated code path:
This code path is carefully designed so that it always remains a valid Ruby code in any case and will execute the following line only when `eval` was not called. This kind of principle drives the entire source code.> also for "bit flips" i.e. a changed character?
No. You can trivially make it invalid by changing the first character (`e`) to, say, `%` or 0xE5. It will be generally impossible to make such program for many languages with fixed classes of code characters.
Thanks! I also found this article[1] helpful, particularly the section describing what happens if parts of the Outside Wrapper get deleted.
[1] https://codegolf.stackexchange.com/a/100785
> the right conceptual basis for build systems: ”build is the first stage of execution”
I have long been thinking the same. And also: "Running tests is the first stage of deploying in production" etc.
In other words: There is often a whole dependency graph between various stages (install toolchain, install 3rd-party dependencies, do code generation, compile, link/bundle, test, run, …) and each of those stages should ideally be a pure function mapping inputs to outputs. In my experience, we often don't do a good job making this graph explicit, nor do we usually implement build steps as pure functions or think of build systems as another piece of software which needs to be subject to the same quality criteria that we apply to any other line of code we write. As a result, developer experience ends up suffering.
Large JavaScript projects are particularly bad in this regard. Dependencies, auto-generated code and build output live right alongside source code, essentially making them global state from the point of view of the build system. The package.json contains dozens of "run" commands which more often than not are an arcane mix of bash scripts invoking JS code. Even worse, those commands typically need to be run in juuust the right order because they all operate on the same global state. There is no notion of one command depending on another. No effort put into isolating build tasks from each other and making them pure. No caching of intermediate results. Pipelines take ages even though they wouldn't have to. Developers get confused because a coworker introduced a new npm command yesterday which now needs be to run before anything else. Ugghhh.
"Build is the first stage of execution" resonates with me too.
It follows that build systems should be written in terms of real, encapsulated APIs that separate interface from implementation.
Make-like build systems are like programming in assembly language: there is no structure that isolates one build step from another. Make offers no way of creating a "local variable" -- an intermediate file that is hidden from any other build step. The filesystem is like memory: as long as you can figure out the filename (address) somehow, you can read and write whatever you want.
Even if you can declare that one build step depends on another, there's usually nothing that prevents a step from reading (or even writing!) files from other build steps that it did not declare a dependency on.
What you want is a build system that lets you define real APIs, with truly private state. For example, cc_library() in Bazel (cxx_library() in Buck2) lets you define a C++ library in terms of the srcs/hdrs/etc, but you don't know how it will be implemented. Other rules and targets can consume the outputs of cc_library(), but only the files that are publicly exposed in the returned providers.
This model brings true encapsulation to build systems, which makes it much more feasible to change implementation details without breaking users.
It's basically impossible to program with no side effects in any general purpose language, and providing a build system language likely requires side-effect rich APIs too (to write files, call external tools, etc).
Which is to say that you can perfectly well implement a "clean" build setup with make, and provide an "interface" for parts of the system to tie into. Files and directories are as good an input/output interface as any other, and you can do some simple safeguards with permissions on directories and files.
To create a local, intermediate file with make, you create a file in TMPDIR (eg with mktemp or tempfile) and remove it at the end, setting the permissions the way you want for any nested steps not to be able to access it.
I think one build tool that has the fundamental abstraction well thought out is mill (written in Scala). Every task is just an ordinary function written in plain Scala, outputting something. Other functions can use this output by simply calling that function. In many cases the output will be simply a target directory, that is cached when none of the dependencies have changed.
Too bad that mill is JVM-specific. :\
"TCC can also be used to make C scripts, i.e. pieces of C source that you run as a Perl or Python script. Compilation is so fast that your script will be as fast as if it was an executable."
https://bellard.org/tcc/tcc-doc.html
https://github.com/vnmakarov/mir is also great for this, the c2m binary is an optimizing C11 JIT runner, even accept "*.c" as input.
In D, this is a fully-supported used case. The DMD compiler provides an executable called rdmd that can be used as the shebang executable at the top of any D file, and the shebang itself is also codified as valid D syntax.
Walter, is that you?
Nope, that would be https://news.ycombinator.com/user?id=WalterBright
This is completely off topic for the actual article, but I had a bit of fun a while back working out how to make self-executing C and OCaml scripts/programs.
It's an interesting exercise working out what is both a comment in the target language but is also an executable shell script. For C it's reasonably straightforward but for OCaml it's quite subtle:
https://libguestfs.org/nbdkit-cc-plugin.3.html#C-plugin-as-a... https://libguestfs.org/nbdkit-cc-plugin.3.html#Using-this-pl...
Seems to be replicating mainframe JCL and in-stream data sets. Processing instructions and input are combined in a single file. Used all the time for compiling, running utilities etc.
I'm guessing that this (IBM) example is setting the delimeter to '@@' to avoid problems with the comment - JCL also understands the '/*' sequence. I've not seen it used with other languages (Cobol etc.)
https://en.wikipedia.org/wiki/Job_Control_Language#In-stream...*The way I see it, this is something a true built-from-source system could do with their packaging system to enable no-effort code changes for any system utility and true trust in you running what you have source for (other than backdoored hardware).
Debian is pretty far off from this vision (if we also want performant execution), but I wonder how do the Gentoo, ArchLinux and Nix fare in this regard? Is this something that could be viably built with their current packaging formats?
In arch at least it is reasonnably easy to download the source for a package, modify it locally, build it and install it. Not sure if that's what you are asking for?
I think no matter how easy a seperate source package is, it's still a seperate thing, and not the same (not as good) as the source being a built in part of the package.
freebsd/gentoo ports comes close where if you pretend that pkg doesn't exist, or at least imagine a world where it's only used for the absolute minimum necessary bootstrap, then ports is probably the closest.
The source is still actually a seperate thing even then so I think even ports with no pkg usage is still not really there.
Imagine the package itself being the source, the one and only form of the package is the source. If it builds an executable, that executable is actually just an automatically generated throw-away artifact that you don't care about and don't save or distribute. Maybe most normal users don't even know where the compiled bin really lives, buried in some /var tree or something, or maybe even in a kind of kernel level db. All the user ever overtly interacts with is actually the self-building source package. When you want to copy it or delete it etc, that's the only thing you touch and everything else is just automatically managed cache.
Then it's not merely easy to get the source and modify it, it's simply THE package in the first place. If you can even run a thing, then you automatically and indellibly also have the full source to that thing. That would be pretty huge I think.
It would probably result in slow installs and updates like gentoo or freebsd ports, but only if we only imagine switching to this today as the only variable changed, out of context, without also imagining the last 40 years of tooling and os development optimizing pain points to make whatever we do most go faster, if we had decided to package apps this way all along.
Indeed, but I also don't see why this couldn't extend to the kernel too: it would self-compile upon boot.
There is an obvious bootstrapping issue (there has to be a compiler for any language you want to compile, including the one your compiler is in), but it's certainly interesting food for thought.
BTW I don't mean to imply that it's a good idea necessarily, just that it is an interesting idea.
Sometimes it's useful to do something like:
Even if, for some reason, there are multiple `usr` folders, the use of `env` means it will eventually call the executable.As for getting rid of the shebang - swapping the ! with a / means that the line and character counts don't change so you get meaningful error messages.
Ah; the /* at the beginning is expanded by the shell to a proper path while the trailing #*/ becomes a shell comment.
what kind of sorcery is this :D
Somewhat adjacent- I recently discovered https://github.com/rofl0r/rcb2 - it can take it quite far without using make file. And similarly to OP - it allows to keep relevant build info right in the source code. (Rcb2 is great at prototype stage, but obviously at some point makefiles are worth spending time on)
Didn't this exist conceptually anyway as the C shell (csh) where the scripting language was "closer to" C?
https://en.wikipedia.org/wiki/C_shell
It seems like you are on your way to making the C++ shell.
Although the substantial functionality overlap between shells and programming languages has caused many to attempt to combine them, someone a while back (Yossi Kreinin?) had a reason that's probably not what we want: a shell should be optimised for quickly doing specific things, and programming languages should be optimised for concisely doing general things. In particular, this suggests that bare words in shells should default to being literals, and variable substitution will take extra characters, while bare words in programming languages should default to being variables, and literal values will take extra characters.
You might all enjoy The International Obfuscated C Code Contest https://www.ioccc.org/
https://www.ioccc.org/years.html
IOCCC is almost irrelevant here, but one particular winning entry is relevant: 2000/tomx [1]. Nowadays IOCCC is hosted in GitHub Pages and it is hard to look at the verbatim source, so the following is the entire source code:
(All indentations here are actually tabs.)[1] https://www.ioccc.org/2000/tomx.c
This compiles as both "./tomx.c" and "make -f tomx.c" to produce the executable "tomx"; but am not fully clear how?
As a shell script, anything starting with `#`, `true<WS>` and `false<WS>` are meant to be ignored and variable declarations are compatible with Make. The next lines to be executed are therefore `make -f $0 $1` and `exit 0`, as you would expect. Shell scripts are parsed linewise so subsequent lines are ignored.
As a Makefile, again `#` lines are ignored and variable declarations are compatible but lines starting with `true<WS>` and `false<WS>` have to be a valid construct, so they are made into rules. Make doesn't rely much on the rule's recipe (you can do way more weird things by setting `SHELL` to non-shells by the way!), so indented lines are essentially ignored unless the rule is triggered. Next lines are usual stuffs for Makefile. As `true` (and anything follows, see the next paragraph) would be the first and thus default rule target, it has to trigger `all` manually and ensure that `all` is always executed in this way by setting the final `.PHONY` rule.
As a C source code, `#` lines are now preprocessor directives and used to hide the first `true` token from the compiler. As an identifier followed by a colon wouldn't be a valid C code at the top level, `true` should be followed by a comment which extends to the second-to-last line. (There is no technical reason for `*/` to be a requisite here I believe.) The final line is an usual indented C code, which is a part of `.PHONY` which recipe is ignored. Since Make will interpret `/*` as path patterns, they have to be also handled by `.PHONY`.
Nice. I worked out most of it; the C source part was easy and the Makefile was mostly comprehensible(except for the "true"/"false" rules) but the shell script part was what stumped me; now i see the sneaky whitespace in "false :".
A true IOCCC winner.
One tip for folks trying to figure out this (and other multi-modal) code is to use your editor's syntax highlight feature based on filetype. So for example in vim renaming this file to ".sh" and ".mk" gives you shell script and makefile structures.
Is the linked example a quine or something else? It's a quine, right?
TFA? No, it’s a C / Bourne shell polyglot where the shell part compiles and runs the C part. I’ve also used this technique when I needed to post self-contained examples (e.g. to mailing lists), but I don’t know if people actually appreciated it.
//bin/env gcc $0 -g -o ${0%.} && ./${0%.} ; exit
thats the one i've been using. feel free to adopt and change.