| Svelte Hacker News

Animats 4 years ago

Go really does do this better. Go is a green thread system.

In Rust, if you're actually doing any compute work, you're stalling out the async system. In Go, if you compute for a while, the scheduler will let someone else run. You can get all the CPUs working. This is well matched to writing web back ends.

Rust's "Async" seems to be designed for a very specific use case - a program running a very large number of network connections, most of which are waiting. If you're doing something where you need all available CPUs to get the work done, it's a bad fit.

anderskaseorg 4 years ago

You make it sound like Rust doesn’t support background CPU computation. Rust has always had threads (https://doc.rust-lang.org/book/ch16-01-threads.html), along with incredibly powerful libraries like Rayon (https://github.com/rayon-rs/rayon) to take advantage of them.
async/await wasn’t specifically targeted for that use case because it didn’t need to be. But you can use threads with async, if you want, using a multi-threaded scheduler like Tokio with the rt-multi-thread feature flag (https://docs.rs/tokio/1.7.0/tokio/runtime/index.html#multi-t...).
staticassertion 4 years ago

> In Rust, if you're actually doing any compute work, you're stalling out the async system.
Just use threads? Or add explicit yields.
- Animats 4 years ago
  
  explicit yields.
  That's so 1984 Macintosh. "Cooperative multitasking".
  The usual effect of that style of programming is stuck window managers.
  
  staticassertion 4 years ago
  
  You still get preemption at the thread level.
  
  pcwalton 4 years ago
  
  So use threads! That's what they're there for.
  Go's solution is threads, just with a particularly idiosyncratic M:N implementation. Rust threads are 1:1. But they're the same concept. You can even get M:N if you really want it (nobody does, which is why mioco gets no use).
  
  jjav 4 years ago
  
  Exactly. It's painful that the same problems get revisited as new things.
  Computation needs to be scheduled to a CPU or core for parallel progress to occur. That means kernel threads.
  
  monocasa 4 years ago
  
  There's nothing wrong with explicit yields within a single process. It's only when the kernel scheduler can't preempt you that you end up with stuck window managers.
- jlokier 4 years ago
  
  If you call a library function, not written by you, where you don't know whether it's going to return in a few nanoseconds or might take a long time, you the caller can't do the right thing yourself.
  A thread is too inefficient for the fast calls, but no thread is disastrous for the rest of the program's concurrency for the slow calls.
  So you're reliant on the library, not written by you, to always choose an efficient behaviour for each call. In other words, to decide dynamically when to use a thread, and when not to use a thread. And that library is reliant on all of its transitive dependencies in the same way.
  It's a nice ideal, but it's not realistic.
  And it still doesn't work all that efficiently anyway. If you call a library function many times, or just many functions, and any of those calls (outside your knowledge) has some long-running part where a thread ought to be used, to do it efficiently a single thread would span your code's multiple calls, and not just last for one library call.
  Essentially, to know whether to make a library call in a thread or not, and at what level to use a thread, you need to know what the behaviour of that call is going to be in advance. In practice, even when carefully optimising programs, we almost always assume one behaviour or another. Maybe measure, and assume the measured behaviour will remain similar. This doesn't have good adaptive behaviour when the assumption is wrong or out of date, so it doesn't work well for calls to things whose duration varies a lot, and especially not for large systems of calls between different modules, each of which has unpredictable timing.
  I first encounted this issue in a video games engine, long before Go or Rust existed (and before async-await). NPCs usually did a tiny amount of calculation, and there where tens of thousands, called every display frame, so this had to be fast. But some of them, known only to themselves, occasionally decided to make network calls, read cached data (which might need network calls or not), or call the filesystem, or do intensive "AI" planning using a burst of CPU. To maintain the frame rate while efficiently processing all NPCs, it was necessary to detect when some deep library call from an NPC had used the NPC's "short call" time budget, and switch it over to a thread.
  In that video games environment, the Go model worked. No matter how each NPC was coded and by whomever did it, an NPC might slow down if it called something slow but the game would stay smooth, and all other NPCs would continue to update at the full frame rate. The Rust model would result in the whole game frame rate stalling, which is much less acceptable. The threads-when-anything-could-be-slow model was too slow, and the use-threads-when-you-will-be-slow model gave NPC code too much power to affect the system outside the NPC.
  
  staticassertion 4 years ago
  
  What you're arguing for is task preemption, and it has a global cost. If you're willing to pay global costs, Rust is probably not the right language for you.
  
  jlokier 4 years ago
  
  Technically it's not pre-emption. It's activating parallelism on another CPU core when a time condition is triggered, in order to reduce tail latencies of unrelated tasks and prevent starvation. Nothing is directly pre-empted by that event.
  But let's run with pre-emption, because that does apply on single cores.
  No, it doesn't work out to be a global cost, assuming you mean performance (as opposed to developer cognitive cost). I have experience with this - the game engine. We don't half-ass performance in games like that, it's a central factor. We optimise for the highest performance on middle-of-the-road hardware, and measure constantly.
  The Rust model is same or worse on each global performance metric that matters in that scenario. The global metrics are things like graphics throughput, tail latency and variance, interactive response, task starvation. So the "cost" of "pre-emption" ranges from zero to negative.
  I will grant you that pre-emption adds shared resource locking overheads if you're comparing with an async-await model that uses only a single thread for all tasks globally, so locking is not required. (I.e. JavaScript). But that's not the Rust model.
  Pre-emption has a small, positive local cost, in that you need to check when to pre-empt. But that can be made small. You don't need a kernel to do it, and you don't need to be constantly reading clocks either. The cost is a few nanoseconds per microtask, i.e. book-keeping like with any execution loop, and some periodic interrupts or signals, which you might have anyway. Alternatively you can use performance counters on some CPUs. The actual act of pre-empting is also cheap - it's just a coroutine jump with some book-keeping. But it's also comparatively rare compared with async task transitions, so the cost doesn't really matter anyway.
  
  littlestymaar 4 years ago
  
  Preemption is costly, because you need to save the entire state of the execution you're preempting, in order to restore it later (which is also, as you should know, what makes context-switches slow). Go and async Rust are both cooperatively scheduled, switching back and forth only in specific yield points, which reduces the amount of state that needs to be saved. Adding such a yield point is costly too though, that's why Go didn't add them inside tight loops until 1.14.[1]
  Regarding scheduling and how task-switching is done, async Rust and Go are fundamentally the same model (and BTW, the authors of Tokio never hide that they took a lot of inspiration from the Go runtime). The three big differences are the following:
  - Go's tasks are stackful, they have their own stack, which is grow-able (the stack is copied into a bigger stack when it's full)
  - Go's yield points are automatically inserted by the compiler based on heuristics which vary depending on the version of the compiler. Whereas Rust yield point are manually inserted by the developer. This resonate strongly with the philosophy of both languages (best-effort automatic vs explicit manual control leading to the best performance) which can also be seen on the topic of memory allocation (automatic boxing based on escape analysis vs manual Box pointer).
  - Go only has green thread, which means that if the runtime is failing to keep its promises[1][2], you don't have much alternative. Rust has both Async tasks and OS threads, meaning you can choose what works best for your workload.
  [1] https://github.com/golang/go/issues/10958 [2] https://github.com/golang/go/issues/36365
  
  jlokier 4 years ago
  
  > Preemption is costly, because you need to save the entire state of the execution you're preempting, in order to restore it later (which is also, as you should know, what makes context-switches slow).
  In-userspace stackful context switches are not at all slow in a sane userspace-switching runtime. They are approximately a combination of setjmp() and longjmp(), but without saving all registers. Typically a few tens of nanoseconds or less on a modern CPU. Pre-emptive context switches are nearly always stackful, even if non-pre-emptive (async) context switches are stackless in the same runtime.
  A pre-emptive context switch in a userspace co-operative scheduling system is slower than non-pre-emptive context switch if it is caused by a dedicated interrupt of some kind. In the case of userspace pre-emption, generally a signal and return, typically single digit microseconds.
  However, the cost there is mainly the signal. If the pre-emptive context switch is caused by a signal that triggered anyway for another purpose, the actual context switch is, again, cheap.
  In the model I described (the game environment), pre-emptive context switches are rare. It wouldn't matter if they took longer, because >= 99.99% of context switches are co-operative in that model. The important factor is that no task blows the frame budget or prevents other tasks from running at the full frame rate. In web services a similar target is tail latency, in the presence of diverse tasks you cannot predict in advance.
  > meaning you can choose what works best for your workload.
  Exactly. And as I've tried to explain, for some varying, complex workloads whatever you choose statically has poor metrics; it must be adaptive to maintain good timing metrics.
  > Rust yield point are manually inserted by the developer
  Indeed, and in the case of NPCs in a game engine, or a large program composed of many libraries written by hundreds of different authors (e.g. some web servers), that causes complex, interdependent performance characteristics, where timing behaviour in one of them ruins performance for everything, unless they are isolated using threads, in which case it's too slow. There is no "the developer" who can ensure this doesn't happen.
  (Aside, if Rust's type system helped with this, that would be great, the same way types help with other "programming in the large" safety issues, but it doesn't address timing characteristics as far as I know.)
  > leading to the best performance [..] Rust has both Async tasks and OS threads, meaning you can choose what works best for your workload.
  You could summarise my point as: "For some dynamic workloads, especially in timing-sensitive large programs with many components working independently and unpredictably, neither async tasks or OS threads perform best for your workload (or even adequately sometimes). The optimal (or required) combination requires some dynamic responses, and cannot be achieved solely by static placement of yield points and thread initiations."
  
  0xjnml 4 years ago
  
  Goroutines are asynchronously preemptible since Go 1.14, released February 25th, 2020.
  https://blog.golang.org/go1.14
  
  pcwalton 4 years ago
  
  I bet if you implemented this system with every NPC getting a separate thread, it would be quite comparable in performance to the Go implementation on a modern OS. Goroutines are heavier, and OS threads lighter, than many people think.
  (This speaks more to the fact that goroutine-per-entity wouldn't be feasible than to the idea that thread-per-entity would be.)
  
  jlokier 4 years ago
  
  I don't know about goroutine performance, and maybe you're right about their weight relative to modern OS threads.
  I'm talking about the scheduling model rather than the specific implementations in Go and Rust. The game engine I worked predates them both, and at the time, there is no way 10,000+ OS threads could advance in every rendering frame. Just entering and exiting the kernel for each thread would take longer than the frame budget. It had to be a userspace queue.
alexfrydl 4 years ago

What you're describing is not actually a problem with async/await. Rather, async/await places a burden of knowledge on the developer to explicitly avoid this problem. Go makes a tradeoff on overhead in exchange for removing that burden of knowledge.
Explicit cooperative multitasking absolutely can be used effectively for CPU-bound work, but it requires the developer to know how to do that, rather than relying on preemption to cover for them. It's similar to the different pros and cons of garbage collection vs. explicit memory management.
- jlokier 4 years ago
  
  > async/await places a burden of knowledge on the developer to explicitly avoid this problem
  How many large projects do you know where the developers know every CPU run-length histogram of every library dependencies and their transitive dependencies? Even if they can measure them, they can't realistically alter many of them.
  And how many library developers do you know who know the CPU run-length histogram expectations of everything that depends on them, as well as their own dependencies?
  In the Go model, the system balances competing modules dynamically in response to unpredictable load patterns, and ensures some amount of fairness. Something which is called, no matter how deep in the call stack and no matter how many steps removed from another module, can do some computation and it doesn't severely affect the rest of the program. In particular, it doesn't cause a giant spike in latency of processing unrelated things.
  In the Rust model, timings of totally unrelated modules have a stronger effect on each other. Unrelated modules are not as decoupled. To be conservative, especially in a library, it's better to avoid any lengthy computations in your async functions, breaking them up in to smaller parts just in case something unrelated needs to be able to make progress. In such cases, preemption is more efficient.
  The Rust model also leads to an interesting metastability in library design motivation among independent developers. A library that provides an async API and breaks its work up into many small, non-sequentially-dependent tasks will tend to get a higher share of CPU execution than one which uses fewer large tasks for the same job - because the scheduler is not trying to be fair. So there's an incentive for every library developer to break things up into many small async tasks, to make their own library perform better, even though that is less efficient overall.
  Overall, I think the Rust async scheduling model is better suited to smaller programs than the Go scheduling model.
  
  littlestymaar 4 years ago
  
  > In the Go model, the system balances competing modules dynamically in response to unpredictable load patterns, and ensures some amount of fairness. Something which is called, no matter how deep in the call stack and no matter how many steps removed from another module, can do some computation and it doesn't severely affect the rest of the program. In particular, it doesn't cause a giant spike in latency of processing unrelated things.
  There's some kind of magical thinking here. The Go runtime attempts to hide as much complexity as it can, and while it works OK most of the time, there are a lot of edge cases that the runtime doesn't handle well [1]. And implementing a runtime that handle these things automagically is a hard task, and nasty bugs ensue[2].
  Also, scheduler fairness isn't related to the language itself, since Rust doesn't ship a scheduler the runtime being a third-party library.
  [1]: https://github.com/golang/go/issues/36365 [2]: https://github.com/golang/go/issues/40722
filleduchaos 4 years ago

Concurrency and parallelism are different things (and the existence of an implementation for one does not preclude the other).
lmm 4 years ago

The trouble with a green thread system is that you can't opt out. Like the article's example of everything being implicitly cancellable by a higher-level select, in a green thread system everything is implicitly threadshiftable by a lower-level function.
If your scheduler works right, then green threads are great, but when your scheduler breaks (and eventually it will) they're impossible to fix. Lightweight-but-not-invisible yields keep things as simple as possible, but no simpler.
- pjmlp 4 years ago
  
  Sure you can and that is one of the design ideas behind Project Loom or .NET Tasks.
  Have a scheduler API available that is able to take those decisions, while providing default schedulers for the most common patterns.
  Just like in many design decisions, Go design team just decided not to expose the same level of power to their users.
leshow 4 years ago

> In Rust, if you're actually doing any compute work, you're stalling out the async system. In Go, if you compute for a while, the scheduler will let someone else run. You can get all the CPUs working. This is well matched to writing web back ends.
This is not a choice in Rust but instead a choice that is handled differently in the async runtimes, which are libraries. If I'm not mistaken, async-std (an async runtime in Rust) adds automatic yields similarly to how Go does.
littlestymaar 4 years ago

> In Rust, if you're actually doing any compute work, you're stalling out the async system. In Go, if you compute for a while, the scheduler will let someone else run. You can get all the CPUs working. This is well matched to writing web back ends.
For the first ten years of Go at least, it was pretty easy to stall the system as well, since the scheduler wasn't able to preempt in the middle of a hot loop[1]. This has only been fixed in 1.14 last year!
> Rust's "Async" seems to be designed for a very specific use case - a program running a very large number of network connections, most of which are waiting. If you're doing something where you need all available CPUs to get the work done, it's a bad fit.
In Rust, if you have a CPU intensive job to do, you can use threads (and as it's completely orthogonal to async, it combines well with it).
[1]: https://github.com/golang/go/issues/10958

akiselev 4 years ago

> As an aside, I do understand why performance oriented languages like Rust choose the async/await path, as async by default carries a performance penalty / requires a runtime. I don't however accept it in non-performance oriented languages.

Ironically, I think Rust futures could be expanded into Go-like uncolored functions relatively easily without even much of a runtime. Since they're stack allocated by default, driving a future to completion is a matter of calling Future::poll in a while loop and handling the waker callback. I feel like the runtimes are there mostly to provide a unified interface to kernel APIs like io_uring/epoll/etc and ease ownership/task allocation.