> resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
> It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code. You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
It is funny, but (like I’ve already mentioned[1] a few months ago) for serialization(-adjacent) formats in particular the preferential position of bytecode interpreters has been rediscovered again and again.
The earliest example I know about is Microsoft’s MIDL, which started off generating C code for NDR un/marshalling but very soon (ca. 1995) switched to bytecode programs (which Microsoft for some reason called “format strings”; these days there’s also typelib marshalling and WinRT metadata-driven marshalling, the latter completely undocumented, but both data-driven). Bellard’s nonfree ffasn1 also (seemingly) uses bytecode, unlike the main FOSS implementations of ASN.1. Protocol Buffers started off with codegen (burying Google user in de/serialization code) but UPB uses “table-driven”, i.e. bytecode, parsing[2].
The most interesting chapter in this long history is in my opinion Swift’s bytecode-based value witnesses[3,4]. Swift (uniquely) has support for ABI compatibility with polymorphic value types, so e.g. you can have a field in the middle of your struct whose size and alignment only become known at dynamic linking time. It does this in pretty much the way you expect[5] (and the same way IBM’s SOM did inheritance across ABI boundaries decades ago): each type has a vtable (“value witness”) full of compiler-generated methods like size, alignment, copy, move, etc., which for polymorphic type instances will call the type arguments’ witness methods and compute on the results. Anyways, here too the story is that they started with native codegen, got buried under the generated code, and switched to bytecode instead. (I wonder—are they going to PGO and JIT next, like hyperpb[6] for Protobuf? Also, bytecode-based serde when?)
You could probably rephrase almost any enum dispatch whatsoever as a "bytecode interpreter" of a sort, especially if run recursively to parse over some kind of sequence. If bytecode helps you achieve a more succinct representation of some program code than the native binary representation for your architecture, it makes sense that this could be faster in some cases.
Not in serde itself, but people have been experimenting with serde alternatives that are bytecode based. Nothing stable as far as I know, just experiments.
Another experiment is https://lib.rs/crates/facet which is a more general derives to generate compile time introspection to generate metadata tables approach.
Yeah, Clang's musttail and preserve_none make interpreter writing much simpler, just make yourself a guaranteed tail call opcode dispatch method (continuation passing style works a treat here), stitch those together using Copy-and-Patch and you have yourself a down and dirty jit compiler.
A new Go protobuf parser [1] made the rounds here eight months ago [2] with a specialized VM that outperforms the default generated protobuf code by 3x.
The article explains most of this, but the key takeaway for beginners once this lands is: With `become` you can write tail calls in Rust and it will promise they either work or don't compile, you can't have the case (which exists in several languages) where you thought you'd written a tail call but you hadn't (or maybe you had but you switched to a different compiler or made a seemingly inconsequential change to the code) and now the stack has overflowed.
Rust has been really good at providing ergonomic support for features we're too used to seeing provided as "Experts only" features with correspondingly poor UX.
"Either it works or doesn't compile" compared to "oops it silently degraded to the less performant thing because an invariant stopped being true" is remarkably similar to how I tend to describe the benefit of having move semantics by default with opt-in copies in Rust compared to in C++ where you need to set up things right and might still accidentally have it copy if you mess it up.
I like this change. I was wondering if I would've preferred to have something on the function signature (eg `tcc_fn foo() ...` as in Tail Call Constrained fn) and when encountering that the rust compiler would make checks about whether the body of the function is tail-callable.
My fear is that adding yet another keyword it might get lost in the sea of keywords that a Rust developer needs to remember. And if recursion is not something you do often you might not reach for it when actually needed.
Having this signal in the function signature means that people would be exposed to it just by reading the documentation and eventually will learn it exists and (hopefully) how to wield it.
The property we care about isn't a property of functions but of callers, so marking a function doesn't help.
`become blah(foo, bar)` is the same thing as `blah(foo, bar)` except that we, the caller are promising that we have nothing further to do and so when blah returns it can return to our caller.
If somebody else calls blah they don't want that behaviour, they might have lots more to do and if they were skipped over that's a disaster.
In some languages it's very obvious when you're going to get TCO anyway, but Rust has what C++ calls RAII, when a function ends all the local variables get destroyed and this may be non-trivial work. Presumably destroying a local i32 is trivial & so is a [u8; 32] but destroying a local String isn't, let alone a HashMap and who knows how complicated it is to destroy a File or a DBQuery or a Foo...
So in a sense "all" become does is try to bring that destruction sooner a little, so it happens before the call, leaving nothing to do afterwards. We weren't using that String any more anyway, lets just destroy it first, and the HashMap? Sure, and oh... no, actually if we destroy that Foo before calling blah which needs the Foo that messes things up... Rust's borrowck comes in clutch here to help us avoid a terrible mistake, our code was nonsense, it doesn't build.
Given that it's not really that uncommon to see something like `pub(crate) async fn foo() ...`, the concern of function signatures starting to get unwieldy feels a lot more concrete than hypotheticals about a "sea of keywords". From looking at the list of keywords in the language currently (listed here: https://doc.rust-lang.org/std/#keywords), I don't really see a whole lot that I think the average Rust programmer would say is a burden to have to remember. At most, `union` and `unsafe` are probably ones that most Rust programmers are not going to need to use directly, and `SelfTy` might look a bit confusing at first due to the fact that the keyword itself is spelled `Self` (and presumably listed in that way to make it easier to differentiate from the `self` entry in the documentation), but even including those I'd argue that probably over half aren't specific to Rust.
As for being in the documentation, I'd argue that might even be an explicit non-goal; it's not clear to me why that should be something considered part of the API. If someone updates their tail recursive function to manually loop instead (or vice-versa), it doesn't seem like that should necessitate the documentation being updated.
I'd actually say that for people learning Rust after something like C or C++ in particular the rare cases where a keyword means something else are the most confusing. In particular `const` in Rust means constant whereas in several languages it means an immutable variable.
const NINE: i32 = // Some arbitrary *constant* expression;
In K&R C this qualifier didn't exist so there's no confusion, but C89, all versions of C++ and several other languages inspired by them use "const" to mean an immutable variable.
I was wondering how they solved the `drop` problem (the fact that `return foo;` is not the last code executed in most functions, because Rust values are all implicitly dropped at the end of scope), and it seems that they cut the Gordian knot so that values are all implicitly dropped before `become` instead. Hope that works out - probably why it's still nightly for now.
What are some examples of macros that your would be able to be written with tail cails? Because macros in Rust can already be recursive (and I've written plenty of ones that take advantage of it over the years), it's not immediately obvious what doors better optimization of tail calls in Rust would open up for them.
I'm not sure how this would be useful in Rust, but macros and tail calls are what allows one to (for example) write iterative loops in Scheme, which doesn't have a native loop syntax.
Maybe the same idea can be used in Rust where some constructs are easier to write in recursive form instead of a loop?
In any case, here's a silly example of a `for-loop` macro in Scheme using a tail call:
(define-syntax for-loop
(syntax-rules ()
((for-loop var start end body ...)
(letrec ((loop (lambda (var)
(unless (>= var end)
body ...
(loop (+ var 1)))))) ; <-- tail call
(loop start)))))
And here's how you'd use it to print the numbers 0 to 9:
(for-loop i 0 10
(display i)
(newline))
This macro expands to a function that calls itself to loop. Since Scheme is guaranteed to have proper tail calls, the calls are guaranteed to not blow the stack.
(Note that you'll probably never see a `letrec` used like this: people would use a named `let`, which is syntax sugar for that exact usage of `letrec`. I wrote it the `letrec` way to make the function explicit).
Last year I was working on a tail-call interpreter (https://github.com/anematode/b-jvm/blob/main/vm/interpreter2...) and found a similar regression on WASM when transforming it from a switch-dispatch loop to tail calls. SpiderMonkey did the best with almost no regression, while V8 and JSC totally crapped out – same finding as the blog post. Because I was targeting both native and WASM I wrote a convoluted macro system that would do a switch-dispatch on WASM and tail calls on native.
Ultimately, because V8's register allocation couldn't handle the switch-loop and was spilling everything, I basically manually outlined all the bytecodes whose implementations were too bloated. But V8 would still inline those implementations and shoot itself in the foot, so I wrote a wasm-opt pass to indirect them through a __funcref table, which prevented inlining.
One trick, to get a little more perf out of the WASM tail-call version, is to use a typed __funcref table. This was really horrible to set up and I actually had to write a wasm-opt pass for this, but basically, if you just naively do a tail call of a "function pointer" (which in WASM is usually an index into some global table), the VM has to check for the validity of the pointer as well as a matching signature. With a __funcref table you can guarantee that the function is valid, avoiding all these annoying checks.
Based on looking at V8's JITed code, there seemed to be a lot of overhead with stack overflow checking, actually. The function prologues and epilogues were just as bloated in the tail-call case. I'll upload some screenshots if I can find them.
Speaking of `become`, I implemented a Copy-And-Patch JIT in Rust just by using this `become` feature too, after reading some articles about how to generate the stencils. I'm still fixing the code but I can release it as some kind of tech demo.
I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute.
(Not sure if the article addressed that... I only skimmed it.)
> Even in a release build, the compiler has not optimized out the stack. As we execute more and more operations, the stack gets deeper and deeper until it inevitably overflows.
I feel like I’ve seen elsewhere that the argument there is that you often must have this optimisation working in algos that rely on it or you will get stack overflows. Having a keyword to force it then becomes a very useful thing, vs relying on hopes that future compiler versions and different arch targets will all discover the optimisation opportunity.
> Having a keyword to force it then becomes a very useful thing, vs relying on hopes that future compiler versions and different arch targets will all discover the optimisation opportunity.
Having a way to guarantee TCO/TCE is essential for some cases, yes. GP's question, though, was why a keyword specifically and not a hypothetical attribute that effectively does the same thing.
A keyword seems nicer to me. I think the only reason to use attributes is to avoid the work of adding actual new syntax, but seeing as they've already done that...
> I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute.
The first RFC for guaranteed tail calls stated an attribute on `return` was a possible alternative "if and when it becomes possible to attach attributes to expressions" [0]. That was from pre-1.0, though; I believe Rust now supports attributes on at least some expressions, but I don't know when that was added.
The second RFC [1] doesn't seem to discuss keyword vs. attribute, but it does mention that the proof-of-concept implementation "parses `become` exactly how it parses the `return` keyword. The difference in semantics is handled later", so perhaps a keyword is actually simpler implementation-wise?
There's some more discussion on attribute vs. keyword starting here [2], though the attribute being discussed there is a function-level attribute rather than something that effectively replaces a `return`. The consensus seems to be that a function-level attribute is not expressive enough to support the desired semantics, at least. There's also a brief mention of `become` vs. `return` (i.e., new keyword because different semantics).
> does this work well with async or is it purely sync tail calls for now?
The current RFC generally does not allow `become` to be used with `async` for now [0]:
> Tail calling from async functions or async blocks is not allowed. This is due to the high implementation effort as it requires special handling for the async state machine. This restriction can be relaxed by a future RFC.
> Using `become` on a `.await` expression, such as `become f().await`, is also not allowed. This is because `become` requires a function call and `.await` is not a function call, but is a special construct.
> Note that tail calling async functions from sync code is possible but the return type for async functions is `impl Future`, which is unlikely to be interesting.
More accurate title would be to say it is a tail call optimized interpreter. Tail calls alone aren't special b/c what matters is that the compiler or runtime properly reuses caller's frame instead of pushing another call frame & growing the stack.
Maybe, it probably depends on how you're looking at it. The optimization is obvious, I expect any optimizing compiler will TCO all naive tail calls - but the trouble in Rust or C++ or a dozen other languages is that you can so easily write code which you think can be optimized but the compiler either can't see how or can see that it's not possible and (without this keyword) you don't find out about this because growing the stack is a valid implementation of what you wrote even though it's not what you meant.
The "become" keyword allows us to express our meaning, we want the tail call, and, duh, of course the compiler will optimize that if it can be a tail call but also now the compiler is authorized to say "Sorry Dave, that's not possible" rather than grow the stack. Most often you wrote something silly. "Oh, the debug logging happens after the call, that's never going to work, I will shuffle things around".
I wouldn't call it optimized, since that implies that it gains performance due to the tail calls and would work otherwise, but the tail calls are integral to the function of the interpreter. It simply wouldn't work if the compiler can't be forced to emit them.
> Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail-call elimination or tail-call optimization. (https://en.wikipedia.org/wiki/Tail_call)
Questioning standard nomenclature is useful too, as long as it provides insight and is not just bike-shedding. "optimization" (in the context of an optimizing compiler) is generally expected not to alter the semantics of a program.
> but the tail calls are integral to the function of the interpreter
Not really, a trampoline could emulate them effectively where the stack won't keep growing at the cost of a function call for every opcode dispatch. Tail calls just optimize out this dispatch loop (or tail call back to the trampoline, however you want to set it up).
> resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
https://sillycross.github.io/2022/11/22/2022-11-22/
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
> It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code. You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
It is funny, but (like I’ve already mentioned[1] a few months ago) for serialization(-adjacent) formats in particular the preferential position of bytecode interpreters has been rediscovered again and again.
The earliest example I know about is Microsoft’s MIDL, which started off generating C code for NDR un/marshalling but very soon (ca. 1995) switched to bytecode programs (which Microsoft for some reason called “format strings”; these days there’s also typelib marshalling and WinRT metadata-driven marshalling, the latter completely undocumented, but both data-driven). Bellard’s nonfree ffasn1 also (seemingly) uses bytecode, unlike the main FOSS implementations of ASN.1. Protocol Buffers started off with codegen (burying Google user in de/serialization code) but UPB uses “table-driven”, i.e. bytecode, parsing[2].
The most interesting chapter in this long history is in my opinion Swift’s bytecode-based value witnesses[3,4]. Swift (uniquely) has support for ABI compatibility with polymorphic value types, so e.g. you can have a field in the middle of your struct whose size and alignment only become known at dynamic linking time. It does this in pretty much the way you expect[5] (and the same way IBM’s SOM did inheritance across ABI boundaries decades ago): each type has a vtable (“value witness”) full of compiler-generated methods like size, alignment, copy, move, etc., which for polymorphic type instances will call the type arguments’ witness methods and compute on the results. Anyways, here too the story is that they started with native codegen, got buried under the generated code, and switched to bytecode instead. (I wonder—are they going to PGO and JIT next, like hyperpb[6] for Protobuf? Also, bytecode-based serde when?)
[1] https://news.ycombinator.com/item?id=44665671, I’m too lazy to copy over the links so refer there for the missing references.
[2] https://news.ycombinator.com/item?id=44664592 and parent’s second link.
[3] https://forums.swift.org/t/sr-14273-byte-code-based-value-wi...
[4] Rexin, “Compact value witnesses in Swift”, 2023 LLVM Dev. Mtg., https://www.youtube.com/watch?v=hjgDwdGJIhI
[5] Pestov, McCall, “Implementing Swift generics”, 2017 LLVM Dev. Mtg., https://www.youtube.com/watch?v=ctS8FzqcRug
[6] https://mcyoung.xyz/2025/07/16/hyperpb/
You could probably rephrase almost any enum dispatch whatsoever as a "bytecode interpreter" of a sort, especially if run recursively to parse over some kind of sequence. If bytecode helps you achieve a more succinct representation of some program code than the native binary representation for your architecture, it makes sense that this could be faster in some cases.
> Also, bytecode-based serde when?
Not in serde itself, but people have been experimenting with serde alternatives that are bytecode based. Nothing stable as far as I know, just experiments.
One early experiment is described in https://sdr-podcast.com/episodes/a-different-serde/ , which used a FORTH inspired stack based VM generated by derive macros at compile time (iirc).
Another experiment is https://lib.rs/crates/facet which is a more general derives to generate compile time introspection to generate metadata tables approach.
Yeah, Clang's musttail and preserve_none make interpreter writing much simpler, just make yourself a guaranteed tail call opcode dispatch method (continuation passing style works a treat here), stitch those together using Copy-and-Patch and you have yourself a down and dirty jit compiler.
A new Go protobuf parser [1] made the rounds here eight months ago [2] with a specialized VM that outperforms the default generated protobuf code by 3x.
[1]: https://mcyoung.xyz/2025/07/16/hyperpb/
[2]: https://news.ycombinator.com/item?id=44591605
I missed that, that could come in handy. Thanks!
Finally! Tail calls! I had to write rust some years ago, and the ocaml person in me itched to get to write tail recursion.
Tail recursion opens up for people to write really really neat looping facilities using macros.
Rust has the become keyword now I believe for TCO.
https://doc.rust-lang.org/std/keyword.become.html
From the first line of the post:
> Last week, I wrote a tail-call interpreter using the become keyword, which was recently added to nightly Rust (seven months ago is recent, right?).
The article explains most of this, but the key takeaway for beginners once this lands is: With `become` you can write tail calls in Rust and it will promise they either work or don't compile, you can't have the case (which exists in several languages) where you thought you'd written a tail call but you hadn't (or maybe you had but you switched to a different compiler or made a seemingly inconsequential change to the code) and now the stack has overflowed.
Rust has been really good at providing ergonomic support for features we're too used to seeing provided as "Experts only" features with correspondingly poor UX.
"Either it works or doesn't compile" compared to "oops it silently degraded to the less performant thing because an invariant stopped being true" is remarkably similar to how I tend to describe the benefit of having move semantics by default with opt-in copies in Rust compared to in C++ where you need to set up things right and might still accidentally have it copy if you mess it up.
I like this change. I was wondering if I would've preferred to have something on the function signature (eg `tcc_fn foo() ...` as in Tail Call Constrained fn) and when encountering that the rust compiler would make checks about whether the body of the function is tail-callable.
My fear is that adding yet another keyword it might get lost in the sea of keywords that a Rust developer needs to remember. And if recursion is not something you do often you might not reach for it when actually needed. Having this signal in the function signature means that people would be exposed to it just by reading the documentation and eventually will learn it exists and (hopefully) how to wield it.
What do you folks think?
The property we care about isn't a property of functions but of callers, so marking a function doesn't help.
`become blah(foo, bar)` is the same thing as `blah(foo, bar)` except that we, the caller are promising that we have nothing further to do and so when blah returns it can return to our caller.
If somebody else calls blah they don't want that behaviour, they might have lots more to do and if they were skipped over that's a disaster.
In some languages it's very obvious when you're going to get TCO anyway, but Rust has what C++ calls RAII, when a function ends all the local variables get destroyed and this may be non-trivial work. Presumably destroying a local i32 is trivial & so is a [u8; 32] but destroying a local String isn't, let alone a HashMap and who knows how complicated it is to destroy a File or a DBQuery or a Foo...
So in a sense "all" become does is try to bring that destruction sooner a little, so it happens before the call, leaving nothing to do afterwards. We weren't using that String any more anyway, lets just destroy it first, and the HashMap? Sure, and oh... no, actually if we destroy that Foo before calling blah which needs the Foo that messes things up... Rust's borrowck comes in clutch here to help us avoid a terrible mistake, our code was nonsense, it doesn't build.
Edited: Improve explanation
Given that it's not really that uncommon to see something like `pub(crate) async fn foo() ...`, the concern of function signatures starting to get unwieldy feels a lot more concrete than hypotheticals about a "sea of keywords". From looking at the list of keywords in the language currently (listed here: https://doc.rust-lang.org/std/#keywords), I don't really see a whole lot that I think the average Rust programmer would say is a burden to have to remember. At most, `union` and `unsafe` are probably ones that most Rust programmers are not going to need to use directly, and `SelfTy` might look a bit confusing at first due to the fact that the keyword itself is spelled `Self` (and presumably listed in that way to make it easier to differentiate from the `self` entry in the documentation), but even including those I'd argue that probably over half aren't specific to Rust.
As for being in the documentation, I'd argue that might even be an explicit non-goal; it's not clear to me why that should be something considered part of the API. If someone updates their tail recursive function to manually loop instead (or vice-versa), it doesn't seem like that should necessitate the documentation being updated.
I'd actually say that for people learning Rust after something like C or C++ in particular the rare cases where a keyword means something else are the most confusing. In particular `const` in Rust means constant whereas in several languages it means an immutable variable.
In K&R C this qualifier didn't exist so there's no confusion, but C89, all versions of C++ and several other languages inspired by them use "const" to mean an immutable variable.
Clojure’s “recur” form works similarly in that it’s guaranteed to be tail recursive, and if it isn’t the compiler will throw an exception.
But I guess it's only works for functions that just call themselves? That's nice, but a very limited subset of TCO.
I was wondering how they solved the `drop` problem (the fact that `return foo;` is not the last code executed in most functions, because Rust values are all implicitly dropped at the end of scope), and it seems that they cut the Gordian knot so that values are all implicitly dropped before `become` instead. Hope that works out - probably why it's still nightly for now.
What are some examples of macros that your would be able to be written with tail cails? Because macros in Rust can already be recursive (and I've written plenty of ones that take advantage of it over the years), it's not immediately obvious what doors better optimization of tail calls in Rust would open up for them.
I'm not sure how this would be useful in Rust, but macros and tail calls are what allows one to (for example) write iterative loops in Scheme, which doesn't have a native loop syntax.
Maybe the same idea can be used in Rust where some constructs are easier to write in recursive form instead of a loop?
In any case, here's a silly example of a `for-loop` macro in Scheme using a tail call:
And here's how you'd use it to print the numbers 0 to 9:
This macro expands to a function that calls itself to loop. Since Scheme is guaranteed to have proper tail calls, the calls are guaranteed to not blow the stack.
(Note that you'll probably never see a `letrec` used like this: people would use a named `let`, which is syntax sugar for that exact usage of `letrec`. I wrote it the `letrec` way to make the function explicit).
I wrote this for (guile) scheme: https://rikspucko.koketteriet.se/bjoli/goof-loop
This is just sugar over tail calls.
Nice post :)
Last year I was working on a tail-call interpreter (https://github.com/anematode/b-jvm/blob/main/vm/interpreter2...) and found a similar regression on WASM when transforming it from a switch-dispatch loop to tail calls. SpiderMonkey did the best with almost no regression, while V8 and JSC totally crapped out – same finding as the blog post. Because I was targeting both native and WASM I wrote a convoluted macro system that would do a switch-dispatch on WASM and tail calls on native.
Ultimately, because V8's register allocation couldn't handle the switch-loop and was spilling everything, I basically manually outlined all the bytecodes whose implementations were too bloated. But V8 would still inline those implementations and shoot itself in the foot, so I wrote a wasm-opt pass to indirect them through a __funcref table, which prevented inlining.
One trick, to get a little more perf out of the WASM tail-call version, is to use a typed __funcref table. This was really horrible to set up and I actually had to write a wasm-opt pass for this, but basically, if you just naively do a tail call of a "function pointer" (which in WASM is usually an index into some global table), the VM has to check for the validity of the pointer as well as a matching signature. With a __funcref table you can guarantee that the function is valid, avoiding all these annoying checks.
The article shows WASM being 1.2-3.7x slower, and your experience confirms it.
Do you have any idea which operations regress the most?
Based on looking at V8's JITed code, there seemed to be a lot of overhead with stack overflow checking, actually. The function prologues and epilogues were just as bloated in the tail-call case. I'll upload some screenshots if I can find them.
Speaking of `become`, I implemented a Copy-And-Patch JIT in Rust just by using this `become` feature too, after reading some articles about how to generate the stencils. I'm still fixing the code but I can release it as some kind of tech demo.
Ah that's great!
I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute.
(Not sure if the article addressed that... I only skimmed it.)
From the article:
> Even in a release build, the compiler has not optimized out the stack. As we execute more and more operations, the stack gets deeper and deeper until it inevitably overflows.
That touches on why TCO/TCE is desirable, but it doesn't address why the Rust devs chose to use a keyword for guaranteed TCE.
I feel like I’ve seen elsewhere that the argument there is that you often must have this optimisation working in algos that rely on it or you will get stack overflows. Having a keyword to force it then becomes a very useful thing, vs relying on hopes that future compiler versions and different arch targets will all discover the optimisation opportunity.
> Having a keyword to force it then becomes a very useful thing, vs relying on hopes that future compiler versions and different arch targets will all discover the optimisation opportunity.
Having a way to guarantee TCO/TCE is essential for some cases, yes. GP's question, though, was why a keyword specifically and not a hypothetical attribute that effectively does the same thing.
A keyword seems nicer to me. I think the only reason to use attributes is to avoid the work of adding actual new syntax, but seeing as they've already done that...
> I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute.
The first RFC for guaranteed tail calls stated an attribute on `return` was a possible alternative "if and when it becomes possible to attach attributes to expressions" [0]. That was from pre-1.0, though; I believe Rust now supports attributes on at least some expressions, but I don't know when that was added.
The second RFC [1] doesn't seem to discuss keyword vs. attribute, but it does mention that the proof-of-concept implementation "parses `become` exactly how it parses the `return` keyword. The difference in semantics is handled later", so perhaps a keyword is actually simpler implementation-wise?
There's some more discussion on attribute vs. keyword starting here [2], though the attribute being discussed there is a function-level attribute rather than something that effectively replaces a `return`. The consensus seems to be that a function-level attribute is not expressive enough to support the desired semantics, at least. There's also a brief mention of `become` vs. `return` (i.e., new keyword because different semantics).
[0]: https://github.com/rust-lang/rfcs/pull/81/changes
[1]: https://github.com/DemiMarie/rfcs/blob/become/0000-proper-ta...
[2]: https://github.com/rust-lang/rfcs/pull/1888#issuecomment-278...
The current rfc is here: https://github.com/rust-lang/rfcs/pull/3407
It does have some more discussion on keywords vs attributes and other options.
The `become` keyword changes behavior: it drops variables before the return.
nice to see become landing in nightly. does this work well with async or is it purely sync tail calls for now?
> does this work well with async or is it purely sync tail calls for now?
The current RFC generally does not allow `become` to be used with `async` for now [0]:
> Tail calling from async functions or async blocks is not allowed. This is due to the high implementation effort as it requires special handling for the async state machine. This restriction can be relaxed by a future RFC.
> Using `become` on a `.await` expression, such as `become f().await`, is also not allowed. This is because `become` requires a function call and `.await` is not a function call, but is a special construct.
> Note that tail calling async functions from sync code is possible but the return type for async functions is `impl Future`, which is unlikely to be interesting.
[0]: https://github.com/phi-go/rfcs/blob/guaranteed-tco/text/0000...
More accurate title would be to say it is a tail call optimized interpreter. Tail calls alone aren't special b/c what matters is that the compiler or runtime properly reuses caller's frame instead of pushing another call frame & growing the stack.
Maybe, it probably depends on how you're looking at it. The optimization is obvious, I expect any optimizing compiler will TCO all naive tail calls - but the trouble in Rust or C++ or a dozen other languages is that you can so easily write code which you think can be optimized but the compiler either can't see how or can see that it's not possible and (without this keyword) you don't find out about this because growing the stack is a valid implementation of what you wrote even though it's not what you meant.
The "become" keyword allows us to express our meaning, we want the tail call, and, duh, of course the compiler will optimize that if it can be a tail call but also now the compiler is authorized to say "Sorry Dave, that's not possible" rather than grow the stack. Most often you wrote something silly. "Oh, the debug logging happens after the call, that's never going to work, I will shuffle things around".
I wouldn't call it optimized, since that implies that it gains performance due to the tail calls and would work otherwise, but the tail calls are integral to the function of the interpreter. It simply wouldn't work if the compiler can't be forced to emit them.
What I wrote is standard nomenclature
> Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail-call elimination or tail-call optimization. (https://en.wikipedia.org/wiki/Tail_call)
Questioning standard nomenclature is useful too, as long as it provides insight and is not just bike-shedding. "optimization" (in the context of an optimizing compiler) is generally expected not to alter the semantics of a program.
> but the tail calls are integral to the function of the interpreter
Not really, a trampoline could emulate them effectively where the stack won't keep growing at the cost of a function call for every opcode dispatch. Tail calls just optimize out this dispatch loop (or tail call back to the trampoline, however you want to set it up).