Ask HN: A retrofitted C dialect?
Hi I'm Anqur, a senior software engineer with different backgrounds where development in C was often an important part of my work. E.g.
1) Game: A Chinese/Vietnam game with C/C++ for making server/client, Lua for scripting [1]. 2) Embedded systems: Switch/router with network stack all written in C [2]. 3) (Networked) file system: Ceph FS client, which is a kernel module. [3]
(I left some unnecessary details in links, but are true projects I used to work on.)
Recently, there's a hot topic about Rust and C in kernel and a message [4] just draws my attention, where it talks about the "Rust" experiment in kernel development:
> I'd like to understand what the goal of this Rust "experiment" is: If we want to fix existing issues with memory safety we need to do that for existing code and find ways to retrofit it.
So for many years, I keep thinking about having a new C dialect for retrofitting the problems, but of C itself.
Sometimes big systems and software (e.g. OS, browsers, databases) could be made entirely in different languages like C++, Rust, D, Zig, etc. But typically, like I slightly mentioned above, making a good filesystem client requires one to write kernel modules (i.e. to provide a VFS implementation. I do know FUSE, but I believe it's better if one could use VFS directly), it's not always feasible to switch languages.
And I still love C, for its unique "bare-bone" experience:
1) Just talk to the platform, almost all the platforms speak C. Nothing like Rust's PAL (platform-agnostic layer) is needed. 2) Just talk to other languages, C is the lingua franca (except Go needs no libc by default). Not to mention if I want WebAssembly to talk to Rust, `extern "C"` is need in Rust code. 3) Just a libc, widely available, write my own data structures carefully. Since usually one is writing some critical components of a bigger system in C, it's just okay there are not many choices of existing libraries to use. 4) I don't need an over-generalized generics functionality, use of generics is quite limited.
So unlike a few `unsafe` in a safe Rust, I want something like a few "safe" in an ambient "unsafe" C dialect. But I'm not saying "unsafe" is good or bad, I'm saying that "don't talk about unsafe vs safe", it's C itself, you wouldn't say anything is "safe" or "unsafe" in C.
Actually I'm also an expert on implementing advanced type systems, some of my works include:
1) A row-polymorphic JavaScript dialect [5]. 2) A tiny theorem prover with Lean 4 syntax in less than 1K LOC [6]. 3) A Rust dialect with reuse analysis [7].
Language features like generics, compile-time eval, trait/typeclass, bidirectional typechecking are trivial for me, I successfully implemented them above.
For the retrofitted C, these features initially come to my mind:
1) Code generation directly to C, no LLVM IR, no machine code. 2) Module, like C++20 module, to eliminate use of headers. 3) Compile-time eval, type-level computation, like `malloc(int)` is actually a thing. 4) Tactics-like metaprogramming to generate definitions, acting like type-safe macros. 5) Quantitative types [8] to track the use of resources (pointers, FDs). The typechecker tells the user how to insert `free` in all possible positions, don't do anything like RAII. 6) Limited lifetime checking, but some people tells me lifetime is not needed in such a language.
Any further insights? Shall I kickstart such project? Please I need your ideas very much.
[1]: https://vi.wikipedia.org/wiki/V%C3%B5_L%C3%A2m_Truy%E1%BB%81...
[2]: https://e.huawei.com/en/products/optical-access/ma5800
[3]: https://docs.ceph.com/en/reef/cephfs/
[4]: https://lore.kernel.org/rust-for-linux/Z7SwcnUzjZYfuJ4-@infr...
[5]: https://github.com/rowscript/rowscript
[6]: https://github.com/anqurvanillapy/TinyLean
In 2014 John Regehr and colleagues suggested what he called Friendly C[0], in an attempt to salvage C from UB. About bit more than a year later, he concluded that the project wasn't really feasible because people couldn't agree on the details of what Friendly C should be.[1]
In the second post, there's an interesting comment towards the end:
> Luckily there’s an easy away forward, which is to skip the step where we try to get consensus. Rather, an influential group such as the Android team could create a friendly C dialect and use it to build the C code (or at least the security-sensitive C code) in their project. My guess is that if they did a good job choosing the dialect, others would start to use it, and at some point it becomes important enough that the broader compiler community can start to help figure out how to better optimize Friendly C without breaking its guarantees, and maybe eventually the thing even gets standardized. There’s precedent for organizations providing friendly semantics; Microsoft, for example, provides stronger-than-specified semantics for volatile variables by default on platforms other than ARM.
I would argue that this has happened, but not quite in the way he expected. Google (and others) has chosen a way forward, but rather than somehow fixing C they have chosen Rust. And from what I see happening in the tech space, I think that trend is going to continue: love it or hate it, the future is most likely going to be Rust encroaching on C, with C increasinly being relegated to the "legacy" status like COBOL and Fortran. In the words of Ambassador Kosh: "The avalanche has already started. It is too late for the pebbles to vote."
0: https://blog.regehr.org/archives/1180 1: https://blog.regehr.org/archives/1287
I think the problem with "friendly C", "safe C++" proposals is they come from a place of "I want to continue using what I know in C/C++ but get some of the safety benefits. I'm willing to trade some of the safety benefits for familiarity". The problem is the friendly C/safe C++ that people picture from that is on a spectrum. On one end you have people that really just want to keep writing C++98 or C99 and see this as basically a way to keep the network effects of C/C++ by having other people write C who wouldn't. The other extreme are people who are willingly to significantly rework their codebases to this hypothetical safe C.
The people on one end of this spectrum actually wouldn't accept any of the changes to meaningfully move the needle, while the people on the other end have already moved or are moving to Rust.
Then in the middle you have a large group of people but not one that agrees on which points of compatibility they will give up for which points of safety. If someone just said "Ok, here's the standard variant, deal with it", they might adopt it... but they wouldn't be the ones invested enough to make it and the people who would make it have already moved to other languages.
History has already proven this with Objective-C and C++, also with TypeScript, while those languages provide stronger safety guarantees over plain old C and JavaScript, there are always those that will keep using the old tricks on the new system.
Only removing copy-paste compatibility fixes that.
> Luckily there’s an easy away forward, which is to skip the step where we try to get consensus.
This is true, the Benovolant Dictator model, versus the Rule by committee model problesm.
Committees are notorius for having problems coming to a consensus, because everyone wants to pull in a different direction, often at odds with everyone else.
Benevolent dictators get things done, but it's not necessarily what people want.
And, we live in hope that they stay benevolent.
The problem with "safe pockets in ambient unsafety" is that C and C++ intentionally disallow this model. It doesn't matter what you do to enforce safety within the safe block, the definition of Undefined Behavior means that code elsewhere in your program can violate any guarantees you attempt to enforce. The only ways around this are with a language that doesn't transpile to C and doesn't have undefined behavior like Rust, or a compiler that will translate C safely like zig attempts to do. Note that zig still falls short here with unchecked illegal behavior and rustc has struggled with assumptions about C's undefined behavior propagating into LLVM's backend.
Safe pockets in ambient unsafety does have benefits though. For example, some code has a higher likelihood of containing undefined behavior (code that manipulates pointers and offsets directly, parsing code, code that deals with complex lifetimes and interconnected graphs, etc), so converting just that code to safe code would have a high ROI.
And once you get to the point where a large chunk of code is in safe pockets, any bugs that smell of undefined behavior only require you to look at the code outside of the safe pockets, which hopefully decreases over time.
There are also studies that show that newly written code tends to have more undefined behavior due to its age, so writing new code in safe pockets has a lot of benefit there too.
There are plenty of attempts at "safe C-like" languages that you can learn from:
C++ has smart pointers. I personally haven't worked with them, but you can probably get very close to "safe C" by mostly working in C++ with smart pointers. Perhaps there is a way to annotate the code (with a .editorconfig) to warn/error when using a straight pointer, except within a #pragma?
> Just talk to the platform, almost all the platforms speak C. Nothing like Rust's PAL (platform-agnostic layer) is needed. 2) Just talk to other languages, C is the lingua franca
C# / .Net tried to do that. Unfortunately, the memory model needed to enable garbage collection makes it far too opinionated to work in cases where straight C shines. (IE, it's not practical to write a kernel in C# / .Net.) The memory model is also so opinionated about how garbage collection should work that C# in WASM can't use the proposed generalized garbage collector for WASM.
Vala is a language that's inspired by C#, but transpiles to C. It uses the gobject system under the hood. (I guess gobjects are used in some linux GUIs, but I have little experience with it.) Gobjects, and thus Vala, are also opinionated about how automatic memory management should work, (In this case, they use reference counting.), but from what I remember it might be easier to drop into C in a Vala project.
Objective C is a decent object-oriented language, and IMO, nicer than C++. It allows you to call C directly without needing to write bindings; and you can even write straight C functions mixed in with Objective C. But, like C# and Vala, Objective C's memory model is also opinionated about how memory management should work. You might even be able to mix Swift and Objective C, and merely use Objective C as a way to turn C code into objects.
---
The thing is, if you were to try to retrofit a "safe C" inside of C, you have to be opinionated about how memory management should work. The value of C is that it has no opinions about how your memory management should work; this allows C to interoperate with other languages that allow access to pointers.
> C# / .Net tried to do that. Unfortunately, the memory model needed to enable garbage collection makes it far too opinionated to work in cases where straight C shines. (IE, it's not practical to write a kernel in C# / .Net.
It was pratical enough for Singularity and Midori.
Those projects failed due to lack of leadership support, not technical issues.
Additionally, Android and ChromeOS are what Longhorn userspace could have looked like if leadership support was there, instead of rebooting the whole approach with C++ and COM, that persists to this day in Windows desktop land, with WinRT doubling down on that approach, and failing as well, again due to leadership.
Gobjects are a nightmare. A poor reimplementation of C++ on top of C. You have to know what "unref" function to call and that type to cast. For all the drawbacks of C++, it would have been less bad than Gobject.
It's less so opinionated and more so that WASM GC spec is just bad and too rudimentary to be anywhere near enough for more sophisticated GC implementations found in JVM and .NET.
It's been awhile since I skimmed the proposal. What I remember is that it was "just enough" to be compatible with Javascript; but didn't have the hooks that C# needs. (I don't remember any mentions about the JVM.)
I remember that the C# WASM team wanted callbacks for destructors and type metadata.
Personally, having spent > 20 years working in C#, destructors is a smell of a bigger problem; and really only useful for debugging resource leaks. I'd rather turn them off in the WASM apps that I'm working on.
Type metadata is another thing that I think could be handled within the C# runtime: Much like IntPtr is used to encapsulate native pointers, and it can be encapsulated in a struct for type safety when working with native code, there can be a struct type used for interacting with non-C# WASM managed objects that doesn't contain type metadata.
Here's the issue which gives an overview of the problems: https://github.com/WebAssembly/gc/issues/77
Further discussion can be found here: https://github.com/dotnet/runtime/issues/94420
Turning off destructors will not help even a little because the biggest pain points are support for byref pointers and insufficient degree of control over object memory layout.
I'm a lot less experienced than you, but since you're collecting ideas, I'll give my opinion.
For me personally, the biggest improvements that could be made to C aren't about advanced type system stuff. They're things that are technically simple but backwards compatibility makes them difficult in practice. In order of importance:
1) Get rid of null-terminated strings; introduce native slice and buffer types. A slice would be basically struct { T *ptr, size_t count } and a buffer would be struct { T *ptr, size_t count, size_t capacity }, though with dedicated syntax to make them ergonomic - perhaps T ^slice and T @buffer. We'd also want buffer -> slice -> pointer decay, beginof/endof/countof/capacityof operators, and of course good handling of type qualifiers.
2) Get rid of errno in favor of consistent out-of-band error handling that would be used in the standard library and recommended for user code too. That would probably involve using the return value for a status code and writing the actual result via a pointer: int do_stuff(T *result, ...).
3) Get rid of the strict aliasing rule.
4) Get rid of various tiny sources of UB. For example, standardize realloc to be equivalent to free when called with a length of 0.
Metaprogramming-wise, my biggest wish would be for a way to enrich programs and libraries with custom compile-time checks, written in plain procedural code rather than some convoluted meta-language. These checks would be very useful for libraries that accept custom (non-printf) format strings, for example. An opt-in linear type system would be nice too.
Tool-wise, I wish there was something that could tell me definitively whether a particular run of my program executed any UB or not. The simpler types of UB, like null pointer dereferences and integer overflows, can be detected now, but I'd also like to know about any violations of aliasing and pointer provenance rules.
Love all the ideas here.
I found it might be possible to tackle "strict aliasing" and "pointer provenance" with a type system and I would head down to it early. The approach might sound like Rust's `MaybeUninit` but I didn't think much about it.
I've already implemented procedural metaprogramming in a JS dialect of mine [1], it's also trivial to use it to implement compile-time format strings. I would improve the whole experience in this new C-like language.
Again, very very practical ideas here. Great thanks!
[1]: https://github.com/rowscript/rowscript/blob/16cb7e1/core/src...
I highly agree that they need to add slice and buffer types to the standard headers. Especially true because the recently added counted_by attribute.
Definitely feel the strict aliasing rule should be opt in.
And there are a lot of small UB that can be eliminated here and there.
I'll add adding types as first class objects. Make typeof actually useful
There are approaches with at least partly the same goals as you mentioned, e.g. Zig. Personally I have been working on my own C replacement for some time which meets many of your points (see https://github.com/micron-language/specification); but the syntax is derived from my Oberon+ language, not from C (even if I use C and C++ for decades, I don't think it's a good syntax); it has compile-time execution, inlines and generic modules (no need for macros or a preprocessor); the current version is minimal, but extensions like inheritance, type-bound procedures, Go-like interfaces or the finally clause (for a simple RAII or "deferred" replacement) are already prepared.
> There are approaches e.g. Zig.
Yes! Zig has done a great job on many C-related stuff, e.g. they've already made it possible to cross-compile C/C++ projects with Zig toolchain years ago. But I'm still quite stupidly obsessed with source-level compatibility with C, don't know if it's good, but things like "Zig uses `0xAA` on debugging undefined memory, not C's traditional `0xCC` byte" make me feel Zig is not "bare-bone" enough to the C world.
> Micron and Oberon+ programming language.
They look absolutely cool to me! The syntax looks inspired from Lua (`end` marker) and OCaml (`of` keyword), CMIIW. The features are pretty nice too. I would look into the design of generic modules and inheritance more, since I'm not sure what a good extendability feature would look like for the C users.
Well BTW, I found there's only one following in your GitHub profile and it's Haoran Xu. Any story in here lol? He's just such a genius making a better LuaJIT, a baseline Python JIT and a better Python interepreter all happen in real life.
> The syntax looks inspired from Lua (`end` marker) and OCaml (`of` keyword), CMIIW
Oberon+ and Micron are mostly derived from Wirth's Oberon and Pascal lineage. Lua inherited many syntax features from Modula-2 (yet another Wirth language), and also OCaml (accidentally?) shares some keywords with Pascal. If you are interested in even more Lua similarities, have a look at https://github.com/rochus-keller/Luon, which I published recently, but which compiles to LuaJIT and thus serves different use-cases than C.
> I would look into the design of generic modules
I found generic modules to be a good compromise with simplicity in mind; here is an article about some of the motivations and findings: https://oberon-lang.github.io/2021/07/17/considering-generic...
> Haoran Xu, making a better LuaJIT
You mean this project: https://github.com/luajit-remake/luajit-remake? This is a very interesting project and as it seems development continues after a break for a year.
> source-level compatibility with C
not sure if this is exactly what you meant, but in Zig you can #include a C header and then "just" invoke the function. no special FFI syntax or typecasting (except rich enums and strings). it can produce compatible ASTs for C and Zig.
Notable approaches to compatibility with C might be: 1) LLVM, like Rust and Zig did (Zig stopped using it in 2023), since LLVM IR is good for being compatible and optimizing. 2) Other backends like libgccjit, I mentioned this because rustc has a libgccjit backend and its author who is a libgccjit maintainer loves the simplicity of it, one could also think of it as a programmable GCC. 3) Code generation directly to C, that's what Koka does.
So I was talking about the direct C codegen approach, but there's still much of some mess like one needs to choose a C standard and knows how to verify the generated code.
I have plan to do just this. See N3211 and N3395 for an initial sketch.
https://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_lo...
Here is a sound static analyzer that can identify all memory safety bugs in C/C++ code, among other kinds of bugs:
https://www.absint.com/astree/index.htm
You can use it to produce code that is semi-formally verified to be safe, with no need for extensions. It is used in the aviation and nuclear industries. Given that it is used only by industries where reliability is so important that money is no object, I never bothered to ask them how much it costs. Few people outside of those industries knows that it exists. It is a shame that the open source alternatives only support subsets of what it supports. The computing industry is largely focused on unsound approaches that are easier to do, but do not catch all issues.
If you want extensions, here is a version of C that relies on hardware features to detect pointer dereferences to the wrong places through capabilities:
https://github.com/CTSRD-CHERI/cheri-c-programming
It requires special CHERI hardware, although the hardware does exist.
Astree is a pain in the butt. Even if it were free, I'd recommend it to very few people. It's not usable without someone (often a team) being responsible for it full time.
TrustInSoft is the higher quality option, polyspace is the more popular option, and IKOS is probably the best open source option. I've also had luck with tools from Galois Inc and the increasingly dated rv-match tool.
Funny you mentioned TrustInSoft but not its open-source origin (which is still under development and evolving in a different direction), Frama-C. I cannot compare it to IKOS, but their usefulness depends a lot on the type of code and verification needs.
I want to like Frama-C, but I've never managed to actually use it successfully on real projects and the repeated experiences have soured me on it. Getting a recent version installed was a serious chore last time I tried, and no one else is willing to deal with ACSL to get the full value.
Tell me more.
Here's a thing... There's been many of them and they all die because they don't provide enough benefit over the status quo. Cyclone https://en.wikipedia.org/wiki/Cyclone_(programming_language) is probably the most known one. There's Safe C https://www.safe-c.org/ A bit further from just "dialect" there's OOC https://ooc-lang.github.io/ and Vala https://vala.dev/
But the only thing that really took off was effort to change things at the very base level rather than patch issues: Rust, Zig, Go.
> But the only thing that really took off was effort to change things at the very base level rather than patch issues.
Exactly, that's the most important takeaway I got from all the discussions here: I will be patching issues while people are flooding in for better general approaches.
I might shift much of my direction, but lucky that there are many "lessons" (like ooc) to learn.
You mentioned D, but are you familiar with D's BetterC?
https://dlang.org/spec/betterc.html
The goal with BetterC is to write D code that's part of a C program. There's no runtime, no garbage collector, or any of that. Of course you lose numerous D features, but that's kind of the point - get rid of the stuff that doesn't work as part of a C program.
In my opinion SPLint (http://splint.org/) would be a nice approach. It is a way to specify ownership semantics, inout parameters etc., but also allows to specify arbitrary pre- and postconditions. It works by annotating whole functions, their parameters, types and variables. These are then checked by calling splint on the codebase, you can also opt out of several checks by flags or using the preprocessor.
Example from the documentation: My main problem was that it was annoying to add to a project, but that is only because you need to specify ownership semantic, not because of the syntax which is short and readable, and that the program is sometimes crashing and there doesn't seem to be active development.I believe what programmers actually want is clean dialect-free C with sidecar files.
It seems people pretty universally dislike type annotations and overly verbose comments, like Ruby's YARD or Java's Javadoc. Also, if your new language doesn't compile with a standard C compiler, kernel usage is probably DOA. That means you want to keep the source code pure C and store additional data in an additional file. That additional file would then contain stuff like pointer type annotations, object lifecycle and lifetime hints, compile-time eval hints, and stuff to make the macros type safe. Ideally, your tool can then use the C code and the sidecar file together to prove that the C code is bug-free and that pointers are handled correctly. That would make your language as safe as Rust to use.
The hardcore C kernel folks can then just look at the C code and be happy. And you and your users use a special IDE to modify the C code and the sidecar file simultaneously, which unlocks all the additional language features. But as soon as you hit save, the editor converts its internal representation back into plain C code. That means, technically, the sidecar file and your IDE are a fancy way of transpiling from whatever you come up with to pure C.
I love this idea so much.
I got stuck at how to tackle "new language doesn't compile with a standard C compiler" for many times, but my solution is much worse than yours: Like LuaJIT, they left one unreadable "minilua" C file [1] to bootstrap some stuff, we could have a source-code version of the "new C" compiler, compile things twice. That sounds bad.
For languages with a very advanced type system that compiles to C, I could only think of Koka [2], which translates the "algebraic effect and handlers" code into pure C, achieving pure C generators, coroutines and async/await without the support of setjmp/setcontext. But the generated C code is unreadable, I would definitely think about how to handle the readability and debugging issue with sidecar files.
[1]: https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/host/minilua....
[2]: https://koka-lang.github.io/koka/doc/book.html
Its never a bad idea to have a better C. despite having alternatives that somewhat work, usually C is just the only logical choice in certain domains, because i think a certain freedom to express things memory wise, which ofc have a lot of pitfalls, but are needed.
i'd say if you want to make such a language, build embedded or core OS code with it. things that do MMIO, DMA interactions, low level IO in kernel code or firmware (more embedded).
if you can solve it in that domain, everyone will love you forever.
> Build embedded or core OS code with it. things that do MMIO, DMA interactions, low level IO in kernel code or firmware (more embedded).
I have a friend currently writing a GC in C and I was making it a project to test around my ideas (where problems with generic containers, compile-time eval and static reflection kicked in so early). But these you mentioned sound more important to me, I have more friends working on RISC-V projects, I should find some inspirations from their projects. Thanks for the mention!
C is still evolving. Instead of creating a new C dialect, why not try improving C itself? You can prototype new features with Clang and submit a technical proposal to the C committee for review. Regarding "memory safety" specifically, many of the challenges folks face with RAM management are related to bounds checking so consider prototyping a slices concept [1].
[1] https://www.digitalmars.com/articles/C-biggest-mistake.html
The problem with this is that even seemingly basic, obviously desirable proposals can take years of labor and politicking to get through the committee. See JeanHeyd Meneide's valiant struggle to get an #embed preprocessor directive standardized [1] - it took five years, and I'm pretty sure the C++ equivalent (std::embed) is still in the oven.
When faced with that, it's only natural that people lean hard towards dialects and new languages. They move faster (Rust went from a standing start to 1.0 in ~five years) and offer far more freedom.
[1]: https://thephd.dev/finally-embed-in-c23
C2Y, the next C revision, introduced "enable safe programming" into the C standard charter [1]. The C committee is eager for proposals like this.
Adding a new feature, like slices, as a Clang extension would be considerably faster than creating a new dialect or language, and it would be immediately usable by every C codebase building with Clang. Even if the feature is "slow" to be incorporated into the standard, it would still be accessible as a compiler extension in the interim.
[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3223.pdf
Have a look at ATS, it is memory-safe and designed for kernel development. There's a kernel and arduino examples. Fluent C interop.
No tactics metaprogramming but it'll give you a start.
Oh I heard about it and oops just hate myself for forgetting it.
And the first sentence [1] of its pointer type introduction, exactly says everything I said here...
> [...] greatest motivation behind ATS is to make it employed to construct safe and reliable programs running in OS kernel.
Also found this interesting Reddit thread [2]. Time to bring some old gems back with good ergonomics now, it won't be that hard. Hold my beer for a while.
[1]: https://ats-lang.sourceforge.net/DOCUMENT/INT2PROGINATS/HTML...
[2]: https://www.reddit.com/r/ProgrammingLanguages/comments/uacib...
We seem to have the same desire for a “cleaned up C.” Could you say more about how metaprogramming would work? I doubt you want to put lifetimes into the type system to any degree. The reason C compiles so much quicker than C++ is the lack of features. Every feature must be crucial. Modules are crucial to preserving C.
> We seem to have the same desire for a “cleaned up C.”
That's so great! But sad that no enough ideas and argument came up here. :'(
> How metaprogramming would work?
When it comes to "tactics" in Coq and Lean 4 (i.e. DSL to control the typechecker, e.g. declare a new variable), there are almost equivalent features like "elaborator reflection" in Idris 1/2 [1] (e.g. create some AST nodes and let typechecker check if it's okay), and most importantly, in Scala 3 [2], you could use `summonXXX` APIs to generate new definitions to the compiler (e.g. automatically create an instance for the JSON encoding trait, if all fields of a record type is given).
So the idea is like: Expose some typechecker APIs to the user, with which one could create well-typed or ready-to-type AST nodes during compile time.
[1]: https://docs.idris-lang.org/en/latest/elaboratorReflection/e...
[2]: https://docs.scala-lang.org/scala3/reference/contextual/deri...
> Lifetime and compilation speed.
Yes exactly, I was considering features from Featherweight Rust [3], some subset of it might be partially applied. But yes it should be super careful on bringing new features in in case of compilation speed.
It's also worth to mention that C compiler itself would do some partial "compile-time eval" like constant folding, during optimization. I know some techniques [4] to achieve this during typechecking, not in another isolated pass, and things like incremental compilation and related caching could bring benefits here.
[3]: https://dl.acm.org/doi/10.1145/3443420
[4]: https://en.wikipedia.org/wiki/Normalisation_by_evaluation
> Every feature must be crucial.
I want to hear more of your ideas on designing such language too, and what's your related context and background for it BTW, for my curiosity?
You may be interested in the new Clang -fbounds-safety extension https://clang.llvm.org/docs/BoundsSafety.html
Woah I enjoyed the read a lot.
I also learned that in LLVM IR, they have the implicit null checks [1] to replace the guard with just a signal handler, so it won't hurt the branch predictor too much.
So I believe there are many options upon "debug" and "release" profiles for the bound and null checking here. Very good design space.
[1]: https://llvm.org/docs/FaultMaps.html#the-implicitnullchecks-...
I think you don't need any rants but here it goes anyway.
Ditching headers does not solve anything at least if your language targets include performance or my beloved example Gamedev =) . You will have to consume headers until operating systems will not stop using them. It is a people problem not language problem.
Big elephants in the room I do not see in your list:
1) "threading" was bolted onto languages like C and C++ without much groundwork. Rust kinda has an idea there but its really alien to everything I saw in my entire 20+ career with C++. I am not going to try to explain it here to not get downvoted into oblivion. Just want you to think that threading has to be natural in any language targeting multicore hardware.
2) "optimization" is not optional. Languages also will have to deal with strict aliasing and UB catastrophes. Compilers became real AGI of the industry. There are no smart developers outsmarting optimizing compilers anymore. You either with the big compilers on optimization or your language performance is not relevant. Providing even some ways to control optimization is something sorely missed every time everything goes boom with a minor compiler update.
3) "hardware". If you need performance you need to go back to hardware not hide from it further behind abstract machines. C and C++ lack real control of anything hardware did since 1985. Performant code really needs to be able to have memory pages and cache lines and physical layout controls of machine code. Counter arguments that these hardware things are per platform and therefore outside of language are not really helping. Because they need to be per platform and available in the language.
4) "libc" is a problem. Most of it being used in newly written code has to be escalated straight to bug reporting tool. I used to think that C++ stl was going to age better but not anymore. Assumptions baked into old APIs are just not there anymore.
I guess it does not sound helpful or positive for any new language to deal with those things. I am pretty sure we can kick all those cans down the road if our goal is to keep writing software compatible with PDP that somehow limps in web browser (sorry bad attempt at joking).
> Just want you to think that threading has to be natural in any language targeting multicore hardware.
parallel execution and thus parallel programming will never be natural to any human being. We don't do it, we can't think it except by using various cognitive props (diagrams, objects) to help us. You cannot make it natural no matter how strongly you desire it.
Now, there is a different sort of "natural" which might mean something more like "idiomatic to other language forms and patterns", and that's certainly a goal that can widely missed or closely approximated.
Exactly the kind of thoughts and insights I need from more of the users. Thank you for pointing out many concerns.
> Headers.
C++20 modules are left unstable and unused in major compilers there, but it’s a standard. And C is ironically perfect for FFI, as I said, almost every programming language speaks C: Rust WebAssembly API is extern C, JNI in Java, every scripting language, even Go itself talks to OS solely using syscall ABI, foreign-function calls are only possible with Cgo. C was not just an application/systems language for some sad decades.
> Big elephants.
Since I was in the zoo watching tigers:
Mostly three groups of people are served under a language: Application writers, library writers, compiler writers (language itself).
I narrowed down and started “small” to see if people writing programs crossing kernel and user space would have more thoughts about C since it’s the only choice. That’s also my job, I made distributed block device (AWS EBS replacement) using SPDK, distributed filesystem (Ceph FS replacement) using FUSE, packet introspection module in router using DPDK. I know how it feels.
Then for the elephants you mentioned, I see them more fitted into a more general library and application development, so here we go:
> Threading.
Async Rust is painful, Send + Sync + Pin, long signatures of trait bounds, no async runtimes are available in standard libraries, endless battles in 3rd party runtimes.
I would prefer Go on such problems. Not saying goroutines and channels are perfect (stackful is officially the only choice, when goroutine stacks somehow become memory intensive, going stackless is only possible with 3rd party event loops), but builtin deadlock and race detection win much here. So it just crashes on violation, loops on unknown deadlocks, I would probably go to this direction.
> Optimization, hardware.
Quite don’t understand why these concerns are “concerns” here.
It’s the mindset of having more known safer parts in C, like a disallow list, rather than under a strong set of rules, like in Rust, an allowlist (mark `unsafe` to be nasty). Not making everything reasonable, safe and generally smart, which is surreal.
C is still, ironically again, the best language to win against assembly upon an optimizing performance, if you know these stories:
- They increased 30% speed on CPython interpreter recently on v3.14.
- The technique was known 3 years ago to be applied in LuaJIT-Remake, they remade a Lua interpreter to win against the original handwritten assembly version, without inline caching.
- Sub-techniques of it exist more than a decade even it’s in Haskell LLVM target, and they theoretically exist before C was born.
It is essentially just an approach to matching how the real abstract machine looks like underneath.
> libc.
Like I said, C is more than a language. Ones need to switch a new allocator algorithm upon malloc/free, Rust quits using jemalloc by default and uses just malloc instead. Libc is somewhat a weird de facto interface.
I guess I need to illustrate my points a bit because I never needed to poke kernels and my concerns are mostly from large games. I am trying to imagine writing large games in your language so please bear with me for a moment.
>Modules
Nobody plans to provide other interfaces to oses/middlewares/large established libraries. Economy is just not there.
>Threading
I was not talking about I/O at all. All of that you mention will be miles better in any high level language because waiting can be done in any language. Using threads for computation intensive things is a niche for low level languages. I would go further say that copying stuff around and mutexes also will be fine in high level languages.
>Optimization/Hardware
Is very important to me. I don't know how it was not relevant to your plan of fixing low level language. Here goes couple of examples to try to shake things up.
The strlen implementation in glibc is not written in C. UB just do not allow to implement the same algorithm. Because reading up until memory page end is outside of abstract machine. Also note how sanitizers are implemented to avoid checking strlen implementation.
Pointer provenance that is both present in each major compiler and impossible to define atm. You need to decide if your language goes with abstract machine or gcc or clang or linux. None of them agree on it. A good attempt to add into C standard a logical model of pointer provenance did not produced any results. If you want to read up on that there was HN thread about it recently.
>libc
I am pretty sure I can't move you on that. Just consider platforms that need to use new APIs for everything and have horrendous 'never to be used' shims to be posix 'compatible'. Like you can compile legacy things but running it does not make sense. Games tend to run there just fine because games used to write relevant low level code per platform anyway.
> Imagine writing large games in your language.
You don’t. Read the features I listed. One ends up with a C alternative frontend (Cfront, if you love bad jokes) including type system like Zig without any standard library. No hash tables, no vectors. You tended to write large games with this.
Like I said the main 3 groups of users, if you’re concerned about application writing, ask it. Rest of the comments talked about possible directions of langdev.
> Modules.
You write C++ and don’t know what a standard is. Motivating examples, real world problems (full and incremental compilation, better compilation cache instead of precompiled headers), decades spent on discussions. Economy would come for projects with modern C++ features.
> Threading.
If you know Rust and Go, talk about them more. Go creates tasks and uses futexes, with bare-bone syscall ABI. Higher level primitives are easy to use. Tools and runtime are friendly to debugging.
I wrote Go components with channels running faster than atomics with waits, in a distributed filesystem metadata server.
On CPU intensiveness, I would talk about things like automatic vectorization, smarter boxing/unboxing, smarter memory layout (aka levity, e.g. AoS vs SoA). Not threading niche.
> Strlen implementation and plan of low level programming.
Because I keep talking about designing a general purpose language. One can also use LLVM IR to implement such algorithms.
The design space here is to write these if necessary. Go source code is full of assembly.
> Pointer provenance.
Search for Andras Kovacs implementation of 2ltt in ICFP 2024 (actually he finished it in 2022), and his dtt-rtcg, you would realize how trivial these features could be implemented “for a new language”. I design new languages.
> libc.
Like I said, your happy new APIs invoke malloc.
Good luck with metaprogramming. It looks cool.
No worries, I got your message about target audience first time. It's just that language development for me is where I did some things. Langdev is an open ended problem. I wish I could express games needs without wasting time on things games don't care about.
> > Optimization, hardware.
> Quite don’t understand why these concerns are “concerns” here.
One of the most frustrating things about C is that it is generally taught together with assembly, so that there is a general conflation between C and assembly, as if C is both "just" some sort of portable assembler and the unique language with that property. The main consequence of this is that the C abstract machine [1] tends to be assumed to be the model of how processors work, and this ends up creating a lot of friction where the C abstract machines just doesn't match hardware newer than about 40 years old. It can be a little hard to understand just how bad the friction is if you haven't personally run across it, but here's a few examples:
* Registers. C doesn't have a concept of registers [2], and there's not much of an easy way to really distinguish between "things that look like a load/store because the abstract machine assumes everything has a memory location" and "no, this is meant to actually issue a hardware machine load/store or this is meant to actually permanently live in a register." There's also minor stuff like the fact that the language makes it easier to express "A[i]" (load A + i * sizeof(A)) over "&A[i]" (A + i sizeof(A)) that makes it somewhat annoying if you want to express assembly concepts better.
SIMD vectors. This is pretty common (at least across a desktop, server, or mobile CPU or GPU). But C has no way of expressing these types or how to use them, outside of compiler extensions (and there's like three incompatible versions of it).
* There's a lack of concept of optimization, and concomitant issues like optimization barriers. Some things have slowly moved in (e.g., there's now an attribute to indicate a function call is speculatable), but in general, it's still difficult to tell the compiler to stop doing some optimization that might break your code.
* No hardware speculation barrier concept, and similar other barriers for more exotic concepts like operations depending on the path condition of the function call (cryptographic code, which wants to be constant-time, or SIMT code tends to care about that a lot more).
[1] Or at least what people assume the semantics of the abstract machine are. Let's be frank, the C userbase isn't very good at actually knowing what the C standard does and doesn't guarantee.
[2] Yes, I know about the register keyword. No, it doesn't give C a meaningful concept of registers.
> Async Rust is painful
On the other hand, I've found normal threading in Rust quite simple (generally using a thread pool).
That's true!
Sorry that I didn't much clarify the "pain" though:
It's quite like the experience of using parser combinator in Rust, where you could happily define the grammar and the parsing action using its existing utitlies. But once you have to do some easy wrapping, e.g. to make a combinator called `parenthesized` to surround an expression with parentheses, the "pain" kicks in, you have to leave as many trait bounds as possible since wiring the typing annotations become tedious. That came up while I was using framework like `winnow`.
Async Rust kinda shares some similar characteristics, utility functionalities might bring in many "typing wirings" that could terrify some people (well but I love it though).
Kind of along these lines but for C++: https://docs.carbon-lang.dev/
Thanks for the mention! I heard about Carbon years ago but I'm happy this time I could dig it further for insights now.
It's pretty fun to think about "Carbon to C++ is Kotlin to Java". One very important takeaway from all the discussions here is that, I cannot ship a language right to the target (small) community, as it's impossible to control how people decide to use this language. Which means, I have to focus much on how to improve the experience of application writing. Carbon would definitely be one of the inspiration.
Oh yeah, and I don't need to handle seemless integration with templates, I'm lucky.
Isn't all C also valid C++? So this would apply?
No: https://en.wikipedia.org/wiki/Compatibility_of_C_and_C%2B%2B
There's the small stuff like "class" is a keyword in C++ so not a valid variable name.
There's the fact that C has continued to evolve so there are new C features that haven't made it into C++ yet (VLAs).
There's stuff that has been implemented differently in both in mostly compatible but sometimes observably different ways (e.g. the types of character and boolean literals)
Interesting I Wonder why I thought that then
People have been trying this for decades. It's always failed. You can't retrofit safety onto C without breaking compatibility, and if someone is willing to break compatibility they've already switched to Rust.
C3 to C compiler could be a proposal.
Ah that should be good for source-level compatibility. But I'm thinking about extending existing codebase that crosses between the kernel and user space, e.g. DPDK, SPDK, FUSE, kernel module, etc. Curious that how C3 would be adopted in such projects.
Start small.
And then? https://github.com/anqurvanillapy/TinyLean
Very very interesting for me. I always wanted to do something similar for Maude in Golang (Python is not a bad choice).
Currently my focus is on data engineering, but I can use it as an inspiration.
I talked about C3 to C translator, this is what I said start small.
This is a dream come true. Please do it, for the love of mankind.
Thanks sooo much! I would definitely do it!
> So unlike a few `unsafe` in a safe Rust, I want something like a few "safe" in an ambient "unsafe" C dialect. But I'm not saying "unsafe" is good or bad, I'm saying that "don't talk about unsafe vs safe", it's C itself, you wouldn't say anything is "safe" or "unsafe" in C.
Eh?
The critical criterion is "does your language make it difficult to write accidental RCEs". There's huge resistance to changing language at all, as we can see from the kernel mailing lists, so in order to go through the huge social pain of encouraging people to use a different language it's got to offer real and significant benefits.
Lifetimes are a solution to memory leaks and use-after free. Other solutions may exist.
Generics: Go tried to resist generics. It was a mistake. You need to be able to do Container<T> somehow. Do you have an opinion on the dotnet version of generics?
(You mention Ceph: every time I read about it I'm impressed, in that it seems an excellent solution to distributed filesystems, and yet I don't see it mentioned all that often. I'm glad it's survived)
> Offer real and significant benefits.
Yeah! The criterion is just there like you said.
> Do you have an opinion on the dotnet version of generics?
I'm not familiar with dotnet languages, but I have much experience implementing generics, especially how to avoid getting the poor C++ templates. Also, I believe traits (not that of C++) and typeclasses (Rust/Haskell) are definitely needed, and it's already proven too, to pair with the generics feature.
> Ceph.
My former job was to write a distributed object storage in Go and it was heavily inspired by Ceph. But subset of the design could make much benefits for a smaller company, since the initial idea to write a new one was from the pain of doing rebalancing in Ceph... oh you want to know the solution? Manual partitioning and migration by our SRE/Ops team lol. It's unexpectedly effective.
The problem with existing attempts to fix C, like Cyclone, are they're creating a new language, but what we really want is C, with improvements. The approach should not be to make a new language, but to augment C with optional new features, which can be incrementally applied to existing codebases to improve them.
You should start with a plain old C compiler, and add the features you want in ways that fully preserve backward compatibility. Code written with these new features should compile with existing C compilers without changing any semantics, and not only your own compiler. Using an existing compiler rather than yours would just mean they're not taking advantage of the features you add.
To give an example, lets say you want to augment pointers with some kind of ownership semantics that your compiler can statically check. We can add some type qualifiers in place of `restrict`.
We could make `_Owned` and `_Shared` keywords in the dialect compiled by your compiler, but we need the code to still work with an existing compiler. To fix this we can simply define them the mean nothing. Now when you compile with your compiler, it can be checked that you are not performing use-after-move, but if you're compiling with an existing compiler, the code will still compile, but the checks will not be done.An alternative syntactic representation from the above could use `[[attributes]]` which are now part of the C standard, but attributes can only appear in certain places, whereas symbols defined by the preprocessor can appear anywhere.
---
An example of good retrofitting is C#'s adding of non-nullable reference types. Using non-nullabiliy is optional, but can be made the default. When not enabled globally they can be used explicitly with `X!`. We can gradually annotate existing codebases to use non-nullable references, and then once we have updated the full translation unit we can enable them by default globally, so that `X` means `X!` instead of `X?`. The approach lets us gradually improve a codebase without having to rewrite it to use the new feature all at once.
Contrast this to Cyclone, which required you update the full translation unit for the Cyclone compiler to utilize non-nullable types.
If we were to add non-nullable pointers to C, we could take an approach like the above, where we have `void * _Nullable` and `void * _Notnull`, with the default setting for a translation unit provided with a `#pragma` - meaning `void *` without any annotation would default to nullable, but when the pragma is set, they become not-null by default. If, eventually you convert a whole codebase to using non-nullable pointers, you could enable it globally with a compiler switch and omit the pragmas, and from that point onward you would have to explicitly mark pointers that may be null with `_Nullable`.
---
An additional advantage of approaching it this way is that you can focus on the front-end facing features and leave the optimization to an existing compiler.
IMO this is the only sane approach to retrofit C. You need to be a compatible superset of C. You also need to have ABI compatibility because C is the lingua-franca for other languages to communicate with the OS.
I also think the C committee should stop trying to add new features into the standard until they've been proven in practice. While many of the proposals[1], such as to clean up various parts of the specification (Slay some earthly demons), are worthwhile, there are some contributors who propose adding X, Y, Z, without an actual implementation of them that can be experimented with, like they're competing with each other to get their pet feature into the standard.
What would be ideal would be if we could take some C26 code and compile it with a C23 compiler, because they added features in ways like the above, where they give additional meaning to the new compiler, but are just annotations that perform no function when compiled with an old compiler.
New features should be implemented and utilized before being considered for standardization. Let various ideas compete and let the best ones win, because prematurely adding features just piles more and more technical debt into the language, and makes it more difficult to add improvements further down the line.
[1]:https://www.open-std.org/jtc1/sc22/wg14/www/docs/?C=M;O=D
I love your gradual approach pretty much. It sounds like gradual typing but not just the typing part.
I used to make many tools with libclang Python bindings to automate some chores of refactoring. I don't remember if one could expand the macros using it, and I was told that lexer and preprocessor are messed together in Clang. So it should be quite hard to extend the existing framework.
I would definitely go into this direction, but for now it just looks like some final boss, let me finish some early quests.
I'd recommend using goblint (https://github.com/goblint) as a starting point, rather than Clang.
If the goal is something that can be used to improve existing C code, I have a few thoughts.
To get to memory safety with C:
- Add support for array bounds checking. Ideally with the compiler doing the heavy lifting and providing to itself that most runtime bounds checks are unnecessary.
- Implement trivial dependent types so the compiler can know about the array size field that is passed next to a pointer. AKA
void do_something(size_t size, entry_t ptr[size]);
- Enforce the restrict keyword. This is actually the tricky bit. I have some ideas for a language that is not C, but making it backwards compatible is beyond where I have gotten. My hint is separation logic.
- Allow types to change safely. So that free() can change the type of the pointer passed to it, to be a non-dereferencable pointer (whatever bits it has).
This is an idea from separation logic.
Allowing functions to change types of data safely could also be a safe solution to code that needs type punning today.
I think conceptually modules are great, but if your goal is source compatible changes that bring memory safety then something like modules is an unnecessary distraction.
Any changes that ultimately cannot be implemented in the default C compiler I don't think will be preferable to just rewriting the code in a more established language like Rust.
On the other hand I think we are in a local maxima with programming languages and type systems. With everyone busy recombining proven techniques in different ways instead of working on the hard problem of how to have assignment, threading, and memory safety. Plus how to do proofs of interesting program properties with things like asserts.
Unfortunately it appears that only through proof can programs be consistent enough that specific security concerns can be said to not be problems.
What I have seen of ADA Spark lately has been very tantalizing.
I have a personal project that I think I have solved the memory safety problem, while still allowing manual memory management and assignment. Unfortunately I am at a stage where everything is mostly clear in my head, but I haven't finished fleshing it out and proving the type system, so I really can't share it yet :-(.
While implementing modules, memory safety, type variables, and functions that can change the types of their argument pointers. I think I will end up with something simpler than C in most respects.
I keep going well that doesn't make any sense today, as I go through all of the details and ask why is something done the way it is done.
One of those questions is why doesn't C use modules.
> On the other hand I think we are in a local maxima with programming languages and type systems.
I think I gotcha. Oh no.
> But I haven't finished fleshing it out and proving the type system, so I really can't share it yet.
Any profile I could follow to wait for this to happen some day? I've made several pen-and-paper attempts on some problems you just mentioned, and further ones would be "strict aliasing", "pointer provenance" and many which appeared in all the dicussions here. It feels like you've already done many stuff, but I can't find your profile anywhere.