If the author has not already, I would commend to them a search of the literature (or the relevant blog summaries) for the term "implicit parallelism". This was an academic topic from a few years back (my brain does not do this sort of thing very well but I want to say 10-20 years go) where the hope was that we could just fire some sort of optimization technique at normal code which would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.
In order to do this, the first thing that was done was to analyze existing source code and determine what the maximum amount of implicit parallelism was that was in the code, assuming it was free. This attempt then basically failed right here. Intuitively we all expect that our code has tons of implicitly parallelism that can be exploited. It turns out our intuition is wrong, and the maximum amount of parallelism that was extracted was often in the 2x range, which even if the parallelization was free it was only a marginal improvement.
Moreover, it is also often not something terribly amenable to human optimization either.
A game engine might be the best case scenario for this sort of code, but once you start putting in the coordination costs back into the charts those charts start looking a lot less impressive in practice. I have a sort of rule of thumb that the key to high-performance multithreading is that the cost of the payload of a given bit of coordination overhead needs to be substantially greater than the cost the coordination, and a games engine will not necessarily have that characteristic... it may have lots of tasks to be done in parallel, but if they
Back about 20 years, DO CONCURRENT went into FORTRAN. It says that the programmer claims all the iterations of a DO look are non-interfering. This was never really that useful. It holds for matrix multiply, but not much else.
It's exactly what you want for the article's use case of adding things up
in parallel.
There are a few standard cases for parallelism:
- There's no interaction between tasks at all for long periods. Easiest case. Use case is codebreaking and crypto mining. Best done on special purpose hardware.
- You're running a service with a large number of incoming connections. The connections are essentially independent of each other.
This is easy to do concurrently, because the threads don't talk to each other. For N > 1000 or so, this is the useful case for server-side async.
- You have some big area or volume oriented computation, like weather prediction, where there are many local regions being computed, and the regions talk to their neighbors a bit. This is the classic supercomputer application.
- You have lots of little tasks going on, queuing up events for each other.
This was the vision Alan Kay had for Smalltalk. It sometimes shows up inside games,
and inside discrete event simulations. The internals of some operating systems work this way.
- Anything that runs on a GPU. Very limited interaction between tasks. Until you get to shadows, lighting, occlusion culling, reflections, and anything where you don't want to
do N lights x M meshes processing. Vulkan gives you the low-level tools to deal with the interlocking needed to do that, which is why Vulkan is so complicated.
(I can't speak to LLM training; haven't been inside that.)
The Fortran committee botched DO CONCURRENT badly. The requirements imposed on the program only make it safe to execute its iterations in any serial order, but are not sufficient to allow safe parallel execution. So one can write a perfectly conforming DO CONCURRENT loop that will produce wrong answers when actually run in parallel. The problem in the spec seems to have been inadvertent but they have refused to fix it (and don’t seem to understand the problem either.)
I guess you can argue that instruction reordering, SMT/Hyper-threading are already eating the easy wins there. And as you said, it seems like the gains taper off at 2x.
I'm not sure why games would be a good target. They're traditionally very much tied to a single thread, because ironically, passing data to the graphics and display hardware and to multi threaded subroutines like physics all has to be synchronized.
The easiest way to do that without locking a bunch of threads is to let a single thread go as fast as possible through all that main thread work.
If you really want a game focused parallelization framework, look into the Entity Component System pattern. The developer defines the data and mutability flow of various features in the game.
Because the execution ordering is fully known, the frameworks can chunk, schedule, reorder, and fan-out, etc the work across threads with less waiting or cache misses.
"I'm not sure why games would be a good target..."
"If you really want a game focused parallelization framework, look into the Entity Component System pattern."
Exactly that. You can break a lot modern games nicely into a lot of little things being done to discrete entities very quickly. But there the problem is that it's easy for the things to be too small, meaning you don't have a lot of time to be "clever" in the code.
I'm ignoring the GPU and just looking at CPU for this. GPU is in its own world where parallelization is forced on you comprehensively, in a space forced to be amenable to that.
Yeah, ECS is the approach I've heard can get games to have sufficient parallelism. Though only if you are careful about sticking to it properly, and with careful management of the dataflow between systems. I think the main thing is, apart from the difficulty of doing the analysis in the first place, if you're writing the code without thinking about parallelism, you will tend to introduce data dependcies all over the place and changing that structure is very difficult.
One thing that might help is that state space explosion is mainly caused by mutability. Because each mutation potentially creates branching that can be difficult or impossible to statically analyze.
In other words, imperative programming with const is directly transpilable to/from functional programming.
Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.
Which also has implications for multiprocessing. When all data is const, many of the async edge cases that we're normally forced to deal with go away. Loosely that means that we can use techniques similar to double buffering to create new mutated copies where only the part that changed was actually copied. The rest of the data can reference one source of truth. In practice, this turns out to most closely approximate optimal algorithms, where the best-effort imperative code may paint itself into a corner because logic design choices get dictated by the realities of the runtime instead of the problem domain itself, so the optimizer loses opportunities to simplify it.
I find that in my daily work, nearly all of my time is spent untangling highly-imperative mutable code. Because I must hold the entire state in my mind when mutable references (mostly to objects) are passed between functions. My higher-order method solutions often end up shorter, more readable, faster, even more efficient, but harder to explain to junior developers. So I would really like a compiler that implicitly converts imperative code like for() loops and references into const/functional code whose intermediate code (i-code) can be reduced to its simplest form.
This is also my main concern about Rust and other mainstream imperative languages like C# and even Javascript, that they encourage bare-hands methods that go away with const/functional code if the user is willing to accept a doubling or more of memory usage. Which becomes less of an issue over time as memory cost decreases.
Edit: I maybe should have said fork-join instead of scatter-gather, but they are related concepts.
> Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.
I have built a decent amount of multithreaded and distributed systems. I agree in principle with everything you wrote in that paragraph. In practice, I find such techniques add a lot of unstable overhead in the processing phase as memory is copied, and again while the results are merged, conflicts resolved, etc. They also lock you into a certain kind of architecture where global state is periodically synchronized. So IMO the performance of these things is highly workload-dependent; for some, it is barely an improvement over serialized, imperative code, and adds a lot of cognitive overhead. For others, it is the obvious and correct choice, and pays huge dividends.
Mostly I find the benefit to be organizational, by providing clear interfaces between system components, which is handy for things developed by teams. But as you said, it requires developers to understand the theory of operation, and it is not junior stuff.
Completely agree that we could use better language & compiler support for such things.
Games have tons of opportunities for parallelism (outside of rendering which is obviously embarrassingly parallel) if they are designed for it from the getgo. Unfortunately, most game engines make decisions up front that force much of their game logic to be inherently serial for no good reason. It is definitely true that the sort of parallelism you get from architecting for it from the beginning bears little resemblance to what a compiler would be able to extract with an automatic semantics-preserving pass.
What about pipelining? Something that has clear serial dependencies could still be pipelined to execute separate (but dependent at work boundaries) units of work in parallel.
This post underscores how traditional imperative language syntax just isn't that well-suited to elegantly expressing parallelism. On the other hand, this is exactly where array languages like APL/J/etc. or array-based frameworks like NumPy/PyTorch/etc. really shine.
The list summation task in the post is just a list reduction, and a reduction can automatically be parallelized for any associative operator. The gory parallelization details in the post are only something the user needs to care about in a purely imperative language that lacks native array operations like reduction. In an array language, the `reduce` function can detect whether the reduction operator is associative and if so, automatically handle the parallelization logic behind-the-scenes. Thus `reduce(values, +)` and `reduce(values, *)` would execute seamlessly without the user needing to explicitly implement the exact subdivision of work. On the other hand, `reduce(values, /)` would run in serial, since division is not associative. Custom binary operators would just need to declare whether they're associative (and possibly commutative, depending on how the parallel scheduler works internally), and they'd be parallelized out-of-the-box.
If you're willing to let go of imperative syntax, Interaction Nets[0] might be interesting to maximize parallelism where possible. I think Bend[1] is probably the most mature implementation of that idea.
Sometimes this approach means that you end up with really compact code where you need to do a lot of research and mental modelling to understand if it actually adds up to things being executed in parallel, and if that is good or not.
Interesting - this is the exact problem José Valim cited when creating Elixir: he was working on Rails multi-core performance in 2010 and found that "writing multi-core software, which is software that runs on all cores with Ruby, was not really straightforward."[1] Ruby's GIL meant only one thread executes at a time.
Fleury's arriving at similar conclusions from the C/systems side: make multi-core the default, not an opt-in. Though his approach still requires explicit coordination (LaneIdx(), barriers, range distribution) vs BEAM where the scheduler automatically distributes processes across cores.
Different tradeoffs for different domains, but both are reacting to the same backwards reality where we program single-core-first on multi-core hardware.
I genuinely think that writing things in a message-passing/channel-based concurrency setup is often easier than doing things in single-core.
Obviously it's hard to get right, but a lot of the time doing a concurrent-first setup allows for a very clear separation of concerns (at least if you have a proper green threading solution). You can have one thread whose sole job is to do one thing, and then spit back the response onto a channel or respond to a pid. It's elegant and for me I find it fairly straightforward to reason about.
I do a lot of Clojure for my personal projects, and I will import core.async to a lot of projects, even for stuff that could be handled single-threaded, simply because I find it easier to reason about.
The thing I struggle with is that most userland applications simply don't need multiple physical cores from a capacity standpoint.
Proper use of concepts like async/await for IO bound activity is probably the most important thing. There are very few tasks that are truly CPU bound that a typical user is doing all day. Even in the case of gaming you are often GPU bound. You need to fire up things like Factorio, Cities Skylines, etc., to max out a multicore CPU.
Even when I'm writing web backends I am not really thinking about how I can spread my workload across the cores. I just use the same async/await interface and let the compiler, runtime and scheduler figure the annoying shit out for me.
Task.WhenAll tends to be much more applicable than Parallel.ForEach. If the information your code is interacting with doesn't currently reside in the same physical host, use of the latter is almost certainly incorrect.
I find async a terrible way to write interactive apps, because eventually something will take too long, and then suddenly your app jerks. So I have to keep figuring out manually which tasks need sending to a thread pool, or splitting my tasks into smaller and smaller pieces.
I’m obviously doing something wrong, as the rest of the world seems to love async. Do their programs just do no interesting CPU intensive work?
You're doing nothing wrong other than using async. You'll only become an async fan once you write a program that does no interesting CPU work, just a lot of I/O - because async is great for the case where the computer spends most of its time waiting around for multiple (or possibly even quite a lot more than that) I/O requests to complete, then does a bit of CPU work on completion of each in order to decide what I/O to do next. Then repeat forever.
This stuff is always a pain to do in languages that don't have this sort of facility, because you need to manually structure your code as some kind of state machine. But that's exactly the sort of tedious nonsense the computer can do for you, if only there's the language mechanism to let you explain to it what you want - which is exactly what async is for.
But it is fundamentally a cooperative multitasking mechanism, as code written in this manner expects to run atomically between awaits. So it's impossible for it to take advantage of multiple threads as-is, which is why languages supporting this mechanism don't bother.
If you are doing a lot of CPU work, that's a problem that async was never designed to solve. You need threads.
Are you using multiple threads or just a single one? Not sure why your application would "jerk" because something takes long time? If it's in a separate thread, it being async or not shouldn't matter, or if it's doing CPU intensive work or just sleeping.
If I’m using threads for each of my tasks, then why do I need async at all? I find mixing async and threads is messy, because it’s hard to take a lock in async code, as that blocks other async code from running. I’m sure this can be done well, but I failed when I tried.
It's messy in something like Python, but mostly transparent in C#. In C# you effectively are using async/await to provide M-N threading ala. GoLang, it's just different syntax.
OP references the C# method Task.WhenAll so they might be assuming other languages are equally capable.
I have seen a lot of interfaces that I would be fine describing as "jerking" due to async silliness. Largely from odd reflows that result when the code completes in an order that is not "top down" in the function.
It can be convenient to have the code "just wait" for an async spot, but I do think there are some benefits to forcing the a bit more structure around it.
Probably. In web development it's usually get data, transform data, send data. That's in both directions, client to server and viceversa. Transformations are almost always simple. Native apps maybe do something more client side on average but I won't bet anything more than a cup of coffee on that.
There is a deep literature on this in the High Performance Computing (HPC) field, where researchers traditionally needed to design simulations to run on hundreds to thousands of nodes with up to hundreds of CPU threads each. Computation can be defined as dependency graphs at the function or even variable level (depending on how granular you can make your threads). Languages built on top of LLVM or interpreters that expose AST can get you a long way there.
The article makes for an interesting technical read, and points to some very important issues, such as debugging in a multicore environment.
Even so, I encourage anyone reading this article to keep in mind that there is a massive amount of work dedicated to software support for multicore parallelism.
For anyone interested in digging deeper into multicore parallelism, I recommend taking a look at the CMU approach to teaching algorithms [1]. They teach parallel programming to second-year undergraduates by default, and the students even write parallel implementations, which can achieve good parallel speedups on a multicore system.
Moreover, they do it using either a parallel functional language or a parallel library in modern C++.
Importantly, they get this far by largely hiding details of exactly how parallel tasks are scheduled across cores. There's a scheduler that does it, much like there's a garbage collector that frees unused memory in a memory-managed language.
> they do it using either a parallel functional language or a parallel library in modern C++.
IMO this is a weakness not a strength.
The approach of Fleury (low level C, no lib), makes his articles a lot more valuable for understanding fundamentals.
I think the actually interesting non-obvious part starts at "Redesigning Algorithms For Uniform Work Distribution". All the prior stuff done you basically get for free in a functional language that has some thread pool or futures built in, doing FP. The real question is how you write algorithms or parts of programs in a way that they lend themselves to be run in parallel as small units with results that easily merge again (parallel map reduce) or maybe don't even need to be merged again. That is the real difficult part, aside from transforming some mutating program into FP style and having appropriate data structures.
And then of course the heuristics start to become important. How much parallelism, before overhead eats the speedup?
Another question is energy efficiency. Is it more important to finish calculation as quickly as possible, or would it be OK to need some longer time, but in total calculate less, due to less overhead and no/less merging?
I think a common blindspot that makes it difficult to fully take advantage of modern silicon is making distinctions between parallelism and concurrency in code that are in actuality ambiguous.
The canonical overly reductive examples are parallelism as trivial SIMD loop parallelism and concurrency as multiple threads working on different but related tasks. If those are your only models the opportunities will be limited. Parallelism, in the sense of executing multiple physical operations in a single instruction, is not limited to e.g. an arithmetic operation on arrays of numbers. If you can execute semantically orthogonal threads of logic in the same instruction, you arguably have both concurrency and parallelism at the same time.
For example, one of the most effective models for using AVX-512 is to treat registers as specialized engines for computing on 64-byte structs, where the structs are an arbitrary collection of unrelated types of different sizes. There are idiomatic techniques for doing many semantically orthogonal computations on the fields in these structs in parallel using the same handful of vector instructions. It essentially turns SIMD into MIMD with clever abstractions. A well-known simple example is searching row-structured data by evaluating a collection of different predicates across any number of columns in parallel with a short sequence of vector intrinsics. Query performance is faster than columnar for some workloads. For most code there is substantially more parallelism available within a single thread of execution than you'll see if you are only looking for trivial array parallelism. It is rare to see code that does this but the gains can be large.
On the other hand, I think concurrency is largely a solved problem to the extent that you can use thread-per-core architectures, particularly if you don't rely on shared task queues or work stealing to balance load.
> particularly if you don't rely on shared task queues or work stealing to balance load.
Anywhere I can read up on better load balancing techniques? Or, are we talking about “Know ahead of time how long each task will reliably run so you can statically schedule everything”?
The main challenge here is that a lot of languages have historically treated threading as an afterthought. Python is a good example where support was so limited (due to the GIL, which they are only now in the process of removing) that people mostly just didn't bother with it and just tried to orchestrate processes instead. Languages like go and javascript are really good at async stuff but that's mostly all happening on 1 core. You can of course run these with multiple cores but you have only limited control over which core does what.
Java has had threading from v1. Fun fact, it was all green threads in 1.0. Real threads that were able to use a second CPU (if you had one) did not come until 1.1. And they've now come full circle with a way to use "virtual" threads. Which technically is what they started with 30 years ago. Java also went on a journey of doing blocking IO on threads, jumping through a lot of hoops (nio) to introduce non blocking io, and lately rearchitecting the blocking io such that you can (mostly) pretend your blocking io is non blocking via virtual threads.
That's essentially what project Loom enables. Pretty impressive from a technical point of view but it's a bit of a leaky abstraction with some ugly failure modes (e.g. deadlocks if something happens to use the synchronized keyword deep down in some library). If that happens on a single real thread running a lot of virtual threads, the whole thread and all the virtual threads on it are blocked.
There are other languages on the JVM that use a bit higher level abstractions here. I'm mainly familiar with Kotlin's coroutines. But Scala went there before them of course. What I like in Kotlin's take on this is the notion of structured concurrency where jobs fork and join in a context and can be scheduled via dispatchers as a light weight co-routine, a thread pool, or a virtual thread pool (same API, that kind of was the point of Loom). So, it kind of mixes parallelism and concurrency and treats them as conceptually similar.
Structured concurrency is also on the roadmap for Java as I understand it. But a lot of mainstream languages are stuck with more low level or primitive mechanisms; or use completely different approaches for concurrency and paralellism. That's fine for experts using this for systems programming stuff but not necessarily ideal if we are all going to do multi core by default.
IMHO structured concurrency would be a good match for python as well. It's early days with the GIL removal but the threading and multiprocess modules are a bit dated/primitive. Async was added at some point in the 3.x cycle. But doing both async & threading is going to require something beyond what's there currently.
On .NET side, there is dataflow framework based on TPL for structured concurrency, but few people are even aware it exists, it is async/await all over the place nowadays.
Some code should be single core; like for example, a frontend UI for a web application... You don't want to be hoarding all of the user's CPU capacity with your frontend.
But I do like implementing my backends as multi-core by default because it forces me to architect the system in a simple way. In many cases, I find it easier to implement a multi-core approach. The code is often more maintainable and secure when you don't assume that state is always available in the current process. It forces a more FP/stateless approach. Or at least it makes you think really hard about what kind of state you want to keep in memory.
In backends, you usually need to solve having concurrent requests from multiple users, regardless if you need it for performance. From that, the step to using multiple cores is sometimes very small.
I.e. you don't usually need to make a for loop parallel, you can just make sure different requests are parallel.
True, it is a more natural progression to parallelize on the backend as isolating different requests has multiple benefits besides scalability; e.g. security and maintainability. Also, we have more control over the backend environment so it's easier to add external components (e.g data store like Redis) if needed to help keep track of state; there's no need to store everything in process memory... Not to mention that most backends need/use a database already and so it can be used to keep state with relative ease and efficiency without even adding any new service/component.
The approach described in this article is to reverse the good old fork/join, but it would only be practical for simple sub tasks or basic CLI tools, not entire programs.
In the end, using this style is almost the same as doing fork/join, except the setup is somewhat hidden.
Those interested can go look at all of the actual code I’ve written using these techniques, and decide for themselves whether or not it’s practical only for “simple sub tasks or basic CLI tools”:
Based on the title I would have assumed that the programming model would be inverted, but it wasn't. What is needed is something akin to the Haskell model where the evaluation order is unspecified while simultaneously allowing mutation. The way to do this would be a Rust style linear type system where you are allowed to acquire exclusive write access to a region in memory, but not be allowed to perform any side effect and all modifications must be returned as if the function was referentially transparent. This is parallel by default, because you actively have to opt into a sequential execution order if you want to perform side effects.
The barriers to this approach are the same old problems with automatic parallelization.
Current hardware assumes a sequential instruction stream with hardware threads and cores and no hardware primitive in the microsecond range to rapidly schedule code to be executed on another core. This means you must split your program into two identical programs that then are managed by the operating system. This kills performance due to excessive amount of synchronization overhead.
The other problem is that even if you have low latency scheduling, you still need to gather a sufficient amount of work for each thread. Too fine grained and you run into synchronization overhead (no matter how good your hardware is), too coarse grained and you won't be able to spread the load onto all the processors.
There is also a third problem that is lurking in the dark and many developers with the exception of the Haskell community are underestimating: Running programs in a suboptimal order can lead to a massive increase in the instantaneous memory usage to the point where the program can no longer run. Think of a program allocating memory for each request, processing it and then deallocating, then allocating again. What if it accepts all requests in parallel? It will first allocate everything, then process everything and then deallocate everything.
Yes this is just the GPU programming model without the hw perks (subgroups etc).
Im impressed if he came up with it on his own but its pretty clear from the article that he didnt.
The GPU model works because the GPU is just wide SIMD with automagical scheduling.
To apply this to the CPU might be misguided unless you use SIMD aka like ISPC.
What I have found is that even among talented senior engineers there is massive Dunning-Kruger effect when it comes to performant architecture. They don't know how to do it, and they don't know that they don't know how to do it.
I have always wanted reasonable performance (though this
might appear like “performance programming” to a concerning
proportion of the software industry),
This hit me right in the heart.
I'm often the only person on the team who cares about performance, so I am drawn to these performance-related challenges... and it has really hurt my career, because I am then often perceived as some kind of person focused on optimization rather than delivering features.
Like the author I do not focus on performance for performance's sake. Nearly all of the time the right course of action is "do not optimize this piece of code; focus on maintainability and readability instead."
However, sometimes, you really need to design for performance from the beginning.... or you don't have a product.
At my most recent job they were trying to push lots of data through RabbitMQ/Celery for scientific work. This worked for trivial jobs (tens of megabytes) but not for moderate or large ones (hundreds of gigabytes)
To make such a product viable, you really need to consider performance from the start. Celery explicitly tells you not to pass non-trivial amounts of data around: you should be passing pointers to data (database ID's, file paths, S3 URIs, whatever) rather than the actual full-fat data.
The team really struggled with this. Their next proposed solution was "well, okay, we'll store intermediate results in the database and 'optimize' later" Great idea, but this involved 1B+ result objects. Wrong again. You are not serializing 1B+ Python objects, sending them over the wire, and performing 1B+ Redis or Postgres inserts in any reasonable amount of time or memory. Optimize and bulk insert all you want, but that's an absolute dead end.
There aren't a whole lot of options for performantly slinging around hundreds of gigabytes of data. Assuming you can't just run on a monster server with hundreds of GB of RAM (which honestly is often the right answer) you are generally going to be looking at fast on-disk formats like Parquet etc. In any event that's something you really need to design around from the start, not something you sprinkle on at the end.
They're on their second iteration of the architecture right now, and it's slower than the first iteration was. Still no viable product. Shame.
> I'm often the only person on the team who cares about performance, so I am drawn to these performance-related challenges... and it has really hurt my career, because I am then often perceived as some kind of person focused on optimization rather than delivering features.
Pro tip: always turn that kind of thing into a dollar value you can put on your annual review. "My update to X let us use Y fewer EC2 instances, saving us $Z per year." Then it's not some Don Quixote obsession, but a clear fiscal benefit to the company.
This is kind of the gold standard and well worth aspiring to for any given situation.
I've often found it hard to achieve in practice.
A typical issue is that for non-trivial performance improvements (the kind that will take several days to several months to achieve) it's hard to estimate the speedup (and therefore, the the corresponding dollar amount) without actually doing a substantial portion of the work and estimating the corresponding speedup.
Another typical issue is where the performance speedup doesn't correspond directly to a dollar amount. At a recent employer we were working on a scientific product. The optimizations would have affected the size of the datasets able to be handled by our product. So there would have been a dollar impact, but it would have not been some kind of relatively simple AWS math. It would have been measured in potential sales dollars.
It's like making your game run at 60fps instead of 30fps. 60fps is clearly better (all other things being equal) and better games tend to sell better, and we'd like to make the best game possible and sell the most copies possible... but how do we quantify the expected return on this work? Many times, I don't think you can.
I think that software performance is the good kind of vanity metric, speed almost always translate to better end-user experience, but there is a point where it can turn into a pointless rabbit hole.
My experience is that multi-threading has quite abit of overhead, not necessary from the scheduling, but from the cache misses because now everything is unlikely to be in cache, so a naive parallel for can easily end up consuming a ton of CPU resources, it may indeed finish quicker, but use 5x the overall CPU time to do so.
Then there is the other issue that parallel_for suffers from, the "starter" thread has to finish the loop, and if may end up with nothing to do for some time(like when one of the helper threads get suspended..), or it might end up going off to process some other work, causing the entire loop to take much longer to finish. So parallel_for kinda sucks, and I prefer using dependency graphs when I can.
Imagine if your whole system followed this concept, bloating your thread count by 32x or something. Using multiple threads is not free and scheduling between them all eats performance. Not only that each thread will use up extra memory.
If the author has not already, I would commend to them a search of the literature (or the relevant blog summaries) for the term "implicit parallelism". This was an academic topic from a few years back (my brain does not do this sort of thing very well but I want to say 10-20 years go) where the hope was that we could just fire some sort of optimization technique at normal code which would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.
In order to do this, the first thing that was done was to analyze existing source code and determine what the maximum amount of implicit parallelism was that was in the code, assuming it was free. This attempt then basically failed right here. Intuitively we all expect that our code has tons of implicitly parallelism that can be exploited. It turns out our intuition is wrong, and the maximum amount of parallelism that was extracted was often in the 2x range, which even if the parallelization was free it was only a marginal improvement.
Moreover, it is also often not something terribly amenable to human optimization either.
A game engine might be the best case scenario for this sort of code, but once you start putting in the coordination costs back into the charts those charts start looking a lot less impressive in practice. I have a sort of rule of thumb that the key to high-performance multithreading is that the cost of the payload of a given bit of coordination overhead needs to be substantially greater than the cost the coordination, and a games engine will not necessarily have that characteristic... it may have lots of tasks to be done in parallel, but if they
Back about 20 years, DO CONCURRENT went into FORTRAN. It says that the programmer claims all the iterations of a DO look are non-interfering. This was never really that useful. It holds for matrix multiply, but not much else. It's exactly what you want for the article's use case of adding things up in parallel.
There are a few standard cases for parallelism:
- There's no interaction between tasks at all for long periods. Easiest case. Use case is codebreaking and crypto mining. Best done on special purpose hardware.
- You're running a service with a large number of incoming connections. The connections are essentially independent of each other. This is easy to do concurrently, because the threads don't talk to each other. For N > 1000 or so, this is the useful case for server-side async.
- You have some big area or volume oriented computation, like weather prediction, where there are many local regions being computed, and the regions talk to their neighbors a bit. This is the classic supercomputer application.
- You have lots of little tasks going on, queuing up events for each other. This was the vision Alan Kay had for Smalltalk. It sometimes shows up inside games, and inside discrete event simulations. The internals of some operating systems work this way.
- Anything that runs on a GPU. Very limited interaction between tasks. Until you get to shadows, lighting, occlusion culling, reflections, and anything where you don't want to do N lights x M meshes processing. Vulkan gives you the low-level tools to deal with the interlocking needed to do that, which is why Vulkan is so complicated.
(I can't speak to LLM training; haven't been inside that.)
The Fortran committee botched DO CONCURRENT badly. The requirements imposed on the program only make it safe to execute its iterations in any serial order, but are not sufficient to allow safe parallel execution. So one can write a perfectly conforming DO CONCURRENT loop that will produce wrong answers when actually run in parallel. The problem in the spec seems to have been inadvertent but they have refused to fix it (and don’t seem to understand the problem either.)
I guess you can argue that instruction reordering, SMT/Hyper-threading are already eating the easy wins there. And as you said, it seems like the gains taper off at 2x.
I'm not sure why games would be a good target. They're traditionally very much tied to a single thread, because ironically, passing data to the graphics and display hardware and to multi threaded subroutines like physics all has to be synchronized.
The easiest way to do that without locking a bunch of threads is to let a single thread go as fast as possible through all that main thread work.
If you really want a game focused parallelization framework, look into the Entity Component System pattern. The developer defines the data and mutability flow of various features in the game.
Because the execution ordering is fully known, the frameworks can chunk, schedule, reorder, and fan-out, etc the work across threads with less waiting or cache misses.
"I'm not sure why games would be a good target..."
"If you really want a game focused parallelization framework, look into the Entity Component System pattern."
Exactly that. You can break a lot modern games nicely into a lot of little things being done to discrete entities very quickly. But there the problem is that it's easy for the things to be too small, meaning you don't have a lot of time to be "clever" in the code.
I'm ignoring the GPU and just looking at CPU for this. GPU is in its own world where parallelization is forced on you comprehensively, in a space forced to be amenable to that.
Sure there's work you can throw into ECS but its a paradigm shift that is not implicit and also highlights how much doesn't work.
Yeah, ECS is the approach I've heard can get games to have sufficient parallelism. Though only if you are careful about sticking to it properly, and with careful management of the dataflow between systems. I think the main thing is, apart from the difficulty of doing the analysis in the first place, if you're writing the code without thinking about parallelism, you will tend to introduce data dependcies all over the place and changing that structure is very difficult.
Agreed. It exemplifies what parallelism is on the table but also how many more guarantees need to be enforced to get it.
You're almost swinging the pendulum back to a fixed pipeline and I don't think you can get that for free.
ECS is not a game specific thing. It’s a redrawing of encapsulation boundaries which can be applied to systems broadly.
https://youtube.com/watch?v=wo84LFzx5nI
One thing that might help is that state space explosion is mainly caused by mutability. Because each mutation potentially creates branching that can be difficult or impossible to statically analyze.
In other words, imperative programming with const is directly transpilable to/from functional programming.
Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.
Which also has implications for multiprocessing. When all data is const, many of the async edge cases that we're normally forced to deal with go away. Loosely that means that we can use techniques similar to double buffering to create new mutated copies where only the part that changed was actually copied. The rest of the data can reference one source of truth. In practice, this turns out to most closely approximate optimal algorithms, where the best-effort imperative code may paint itself into a corner because logic design choices get dictated by the realities of the runtime instead of the problem domain itself, so the optimizer loses opportunities to simplify it.
I find that in my daily work, nearly all of my time is spent untangling highly-imperative mutable code. Because I must hold the entire state in my mind when mutable references (mostly to objects) are passed between functions. My higher-order method solutions often end up shorter, more readable, faster, even more efficient, but harder to explain to junior developers. So I would really like a compiler that implicitly converts imperative code like for() loops and references into const/functional code whose intermediate code (i-code) can be reduced to its simplest form.
This is also my main concern about Rust and other mainstream imperative languages like C# and even Javascript, that they encourage bare-hands methods that go away with const/functional code if the user is willing to accept a doubling or more of memory usage. Which becomes less of an issue over time as memory cost decreases.
Edit: I maybe should have said fork-join instead of scatter-gather, but they are related concepts.
> Using modern techniques like higher order methods, scatter-gather arrays (similar to map-reduce), passing by value via copy-on-write, etc, code can be written that works like piping data between unix executables. Everything becomes a spreadsheet basically.
I have built a decent amount of multithreaded and distributed systems. I agree in principle with everything you wrote in that paragraph. In practice, I find such techniques add a lot of unstable overhead in the processing phase as memory is copied, and again while the results are merged, conflicts resolved, etc. They also lock you into a certain kind of architecture where global state is periodically synchronized. So IMO the performance of these things is highly workload-dependent; for some, it is barely an improvement over serialized, imperative code, and adds a lot of cognitive overhead. For others, it is the obvious and correct choice, and pays huge dividends.
Mostly I find the benefit to be organizational, by providing clear interfaces between system components, which is handy for things developed by teams. But as you said, it requires developers to understand the theory of operation, and it is not junior stuff.
Completely agree that we could use better language & compiler support for such things.
Games have tons of opportunities for parallelism (outside of rendering which is obviously embarrassingly parallel) if they are designed for it from the getgo. Unfortunately, most game engines make decisions up front that force much of their game logic to be inherently serial for no good reason. It is definitely true that the sort of parallelism you get from architecting for it from the beginning bears little resemblance to what a compiler would be able to extract with an automatic semantics-preserving pass.
There is some recent work on this too: https://dl.acm.org/doi/10.1145/3632880
What about pipelining? Something that has clear serial dependencies could still be pipelined to execute separate (but dependent at work boundaries) units of work in parallel.
> would automatically extract all the implicit parallelism in the code and parallelize it, resulting in massive essentially-free gains.
Sounds like what mojo is attempting to do:
https://www.modular.com/mojo
This post underscores how traditional imperative language syntax just isn't that well-suited to elegantly expressing parallelism. On the other hand, this is exactly where array languages like APL/J/etc. or array-based frameworks like NumPy/PyTorch/etc. really shine.
The list summation task in the post is just a list reduction, and a reduction can automatically be parallelized for any associative operator. The gory parallelization details in the post are only something the user needs to care about in a purely imperative language that lacks native array operations like reduction. In an array language, the `reduce` function can detect whether the reduction operator is associative and if so, automatically handle the parallelization logic behind-the-scenes. Thus `reduce(values, +)` and `reduce(values, *)` would execute seamlessly without the user needing to explicitly implement the exact subdivision of work. On the other hand, `reduce(values, /)` would run in serial, since division is not associative. Custom binary operators would just need to declare whether they're associative (and possibly commutative, depending on how the parallel scheduler works internally), and they'd be parallelized out-of-the-box.
If you're willing to let go of imperative syntax, Interaction Nets[0] might be interesting to maximize parallelism where possible. I think Bend[1] is probably the most mature implementation of that idea.
[0]: https://en.wikipedia.org/wiki/Interaction_nets [1]: https://github.com/HigherOrderCO/Bend
Sometimes this approach means that you end up with really compact code where you need to do a lot of research and mental modelling to understand if it actually adds up to things being executed in parallel, and if that is good or not.
Interesting - this is the exact problem José Valim cited when creating Elixir: he was working on Rails multi-core performance in 2010 and found that "writing multi-core software, which is software that runs on all cores with Ruby, was not really straightforward."[1] Ruby's GIL meant only one thread executes at a time.
Fleury's arriving at similar conclusions from the C/systems side: make multi-core the default, not an opt-in. Though his approach still requires explicit coordination (LaneIdx(), barriers, range distribution) vs BEAM where the scheduler automatically distributes processes across cores.
Different tradeoffs for different domains, but both are reacting to the same backwards reality where we program single-core-first on multi-core hardware.
[1] https://www.welcometothejungle.com/en/articles/btc-elixir-jo...
I genuinely think that writing things in a message-passing/channel-based concurrency setup is often easier than doing things in single-core.
Obviously it's hard to get right, but a lot of the time doing a concurrent-first setup allows for a very clear separation of concerns (at least if you have a proper green threading solution). You can have one thread whose sole job is to do one thing, and then spit back the response onto a channel or respond to a pid. It's elegant and for me I find it fairly straightforward to reason about.
I do a lot of Clojure for my personal projects, and I will import core.async to a lot of projects, even for stuff that could be handled single-threaded, simply because I find it easier to reason about.
The thing I struggle with is that most userland applications simply don't need multiple physical cores from a capacity standpoint.
Proper use of concepts like async/await for IO bound activity is probably the most important thing. There are very few tasks that are truly CPU bound that a typical user is doing all day. Even in the case of gaming you are often GPU bound. You need to fire up things like Factorio, Cities Skylines, etc., to max out a multicore CPU.
Even when I'm writing web backends I am not really thinking about how I can spread my workload across the cores. I just use the same async/await interface and let the compiler, runtime and scheduler figure the annoying shit out for me.
Task.WhenAll tends to be much more applicable than Parallel.ForEach. If the information your code is interacting with doesn't currently reside in the same physical host, use of the latter is almost certainly incorrect.
I find async a terrible way to write interactive apps, because eventually something will take too long, and then suddenly your app jerks. So I have to keep figuring out manually which tasks need sending to a thread pool, or splitting my tasks into smaller and smaller pieces.
I’m obviously doing something wrong, as the rest of the world seems to love async. Do their programs just do no interesting CPU intensive work?
You're doing nothing wrong other than using async. You'll only become an async fan once you write a program that does no interesting CPU work, just a lot of I/O - because async is great for the case where the computer spends most of its time waiting around for multiple (or possibly even quite a lot more than that) I/O requests to complete, then does a bit of CPU work on completion of each in order to decide what I/O to do next. Then repeat forever.
This stuff is always a pain to do in languages that don't have this sort of facility, because you need to manually structure your code as some kind of state machine. But that's exactly the sort of tedious nonsense the computer can do for you, if only there's the language mechanism to let you explain to it what you want - which is exactly what async is for.
But it is fundamentally a cooperative multitasking mechanism, as code written in this manner expects to run atomically between awaits. So it's impossible for it to take advantage of multiple threads as-is, which is why languages supporting this mechanism don't bother.
If you are doing a lot of CPU work, that's a problem that async was never designed to solve. You need threads.
Are you using multiple threads or just a single one? Not sure why your application would "jerk" because something takes long time? If it's in a separate thread, it being async or not shouldn't matter, or if it's doing CPU intensive work or just sleeping.
If I’m using threads for each of my tasks, then why do I need async at all? I find mixing async and threads is messy, because it’s hard to take a lock in async code, as that blocks other async code from running. I’m sure this can be done well, but I failed when I tried.
It depends what framework/language you're using.
It's messy in something like Python, but mostly transparent in C#. In C# you effectively are using async/await to provide M-N threading ala. GoLang, it's just different syntax.
OP references the C# method Task.WhenAll so they might be assuming other languages are equally capable.
You can map N-async tasks onto M-threads. This is essentially what Rust does, and if you squint this is how Go works as well.
A go routine is not that different from an async task, except the runtime inserts all the await points for you.
I have seen a lot of interfaces that I would be fine describing as "jerking" due to async silliness. Largely from odd reflows that result when the code completes in an order that is not "top down" in the function.
It can be convenient to have the code "just wait" for an async spot, but I do think there are some benefits to forcing the a bit more structure around it.
Probably. In web development it's usually get data, transform data, send data. That's in both directions, client to server and viceversa. Transformations are almost always simple. Native apps maybe do something more client side on average but I won't bet anything more than a cup of coffee on that.
Rest of the world doesn't love async. Just the loud opinionated people.
There is a deep literature on this in the High Performance Computing (HPC) field, where researchers traditionally needed to design simulations to run on hundreds to thousands of nodes with up to hundreds of CPU threads each. Computation can be defined as dependency graphs at the function or even variable level (depending on how granular you can make your threads). Languages built on top of LLVM or interpreters that expose AST can get you a long way there.
The article makes for an interesting technical read, and points to some very important issues, such as debugging in a multicore environment.
Even so, I encourage anyone reading this article to keep in mind that there is a massive amount of work dedicated to software support for multicore parallelism.
For anyone interested in digging deeper into multicore parallelism, I recommend taking a look at the CMU approach to teaching algorithms [1]. They teach parallel programming to second-year undergraduates by default, and the students even write parallel implementations, which can achieve good parallel speedups on a multicore system.
Moreover, they do it using either a parallel functional language or a parallel library in modern C++.
Importantly, they get this far by largely hiding details of exactly how parallel tasks are scheduled across cores. There's a scheduler that does it, much like there's a garbage collector that frees unused memory in a memory-managed language.
[1] https://www.cs.cmu.edu/~15210/docs/book.pdf
> they do it using either a parallel functional language or a parallel library in modern C++.
IMO this is a weakness not a strength. The approach of Fleury (low level C, no lib), makes his articles a lot more valuable for understanding fundamentals.
I think the actually interesting non-obvious part starts at "Redesigning Algorithms For Uniform Work Distribution". All the prior stuff done you basically get for free in a functional language that has some thread pool or futures built in, doing FP. The real question is how you write algorithms or parts of programs in a way that they lend themselves to be run in parallel as small units with results that easily merge again (parallel map reduce) or maybe don't even need to be merged again. That is the real difficult part, aside from transforming some mutating program into FP style and having appropriate data structures.
And then of course the heuristics start to become important. How much parallelism, before overhead eats the speedup?
Another question is energy efficiency. Is it more important to finish calculation as quickly as possible, or would it be OK to need some longer time, but in total calculate less, due to less overhead and no/less merging?
Ryan is the author of the RAD Debugger: https://github.com/EpicGamesExt/raddebugger
there's a wealth of papers and theses on these topics from the early-mid 90s, for example
https://link.springer.com/article/10.1007/BF02577777
I think a common blindspot that makes it difficult to fully take advantage of modern silicon is making distinctions between parallelism and concurrency in code that are in actuality ambiguous.
The canonical overly reductive examples are parallelism as trivial SIMD loop parallelism and concurrency as multiple threads working on different but related tasks. If those are your only models the opportunities will be limited. Parallelism, in the sense of executing multiple physical operations in a single instruction, is not limited to e.g. an arithmetic operation on arrays of numbers. If you can execute semantically orthogonal threads of logic in the same instruction, you arguably have both concurrency and parallelism at the same time.
For example, one of the most effective models for using AVX-512 is to treat registers as specialized engines for computing on 64-byte structs, where the structs are an arbitrary collection of unrelated types of different sizes. There are idiomatic techniques for doing many semantically orthogonal computations on the fields in these structs in parallel using the same handful of vector instructions. It essentially turns SIMD into MIMD with clever abstractions. A well-known simple example is searching row-structured data by evaluating a collection of different predicates across any number of columns in parallel with a short sequence of vector intrinsics. Query performance is faster than columnar for some workloads. For most code there is substantially more parallelism available within a single thread of execution than you'll see if you are only looking for trivial array parallelism. It is rare to see code that does this but the gains can be large.
On the other hand, I think concurrency is largely a solved problem to the extent that you can use thread-per-core architectures, particularly if you don't rely on shared task queues or work stealing to balance load.
> particularly if you don't rely on shared task queues or work stealing to balance load.
Anywhere I can read up on better load balancing techniques? Or, are we talking about “Know ahead of time how long each task will reliably run so you can statically schedule everything”?
The main challenge here is that a lot of languages have historically treated threading as an afterthought. Python is a good example where support was so limited (due to the GIL, which they are only now in the process of removing) that people mostly just didn't bother with it and just tried to orchestrate processes instead. Languages like go and javascript are really good at async stuff but that's mostly all happening on 1 core. You can of course run these with multiple cores but you have only limited control over which core does what.
Java has had threading from v1. Fun fact, it was all green threads in 1.0. Real threads that were able to use a second CPU (if you had one) did not come until 1.1. And they've now come full circle with a way to use "virtual" threads. Which technically is what they started with 30 years ago. Java also went on a journey of doing blocking IO on threads, jumping through a lot of hoops (nio) to introduce non blocking io, and lately rearchitecting the blocking io such that you can (mostly) pretend your blocking io is non blocking via virtual threads.
That's essentially what project Loom enables. Pretty impressive from a technical point of view but it's a bit of a leaky abstraction with some ugly failure modes (e.g. deadlocks if something happens to use the synchronized keyword deep down in some library). If that happens on a single real thread running a lot of virtual threads, the whole thread and all the virtual threads on it are blocked.
There are other languages on the JVM that use a bit higher level abstractions here. I'm mainly familiar with Kotlin's coroutines. But Scala went there before them of course. What I like in Kotlin's take on this is the notion of structured concurrency where jobs fork and join in a context and can be scheduled via dispatchers as a light weight co-routine, a thread pool, or a virtual thread pool (same API, that kind of was the point of Loom). So, it kind of mixes parallelism and concurrency and treats them as conceptually similar.
Structured concurrency is also on the roadmap for Java as I understand it. But a lot of mainstream languages are stuck with more low level or primitive mechanisms; or use completely different approaches for concurrency and paralellism. That's fine for experts using this for systems programming stuff but not necessarily ideal if we are all going to do multi core by default.
IMHO structured concurrency would be a good match for python as well. It's early days with the GIL removal but the threading and multiprocess modules are a bit dated/primitive. Async was added at some point in the 3.x cycle. But doing both async & threading is going to require something beyond what's there currently.
On .NET side, there is dataflow framework based on TPL for structured concurrency, but few people are even aware it exists, it is async/await all over the place nowadays.
Some code should be single core; like for example, a frontend UI for a web application... You don't want to be hoarding all of the user's CPU capacity with your frontend.
But I do like implementing my backends as multi-core by default because it forces me to architect the system in a simple way. In many cases, I find it easier to implement a multi-core approach. The code is often more maintainable and secure when you don't assume that state is always available in the current process. It forces a more FP/stateless approach. Or at least it makes you think really hard about what kind of state you want to keep in memory.
In backends, you usually need to solve having concurrent requests from multiple users, regardless if you need it for performance. From that, the step to using multiple cores is sometimes very small.
I.e. you don't usually need to make a for loop parallel, you can just make sure different requests are parallel.
True, it is a more natural progression to parallelize on the backend as isolating different requests has multiple benefits besides scalability; e.g. security and maintainability. Also, we have more control over the backend environment so it's easier to add external components (e.g data store like Redis) if needed to help keep track of state; there's no need to store everything in process memory... Not to mention that most backends need/use a database already and so it can be used to keep state with relative ease and efficiency without even adding any new service/component.
I think this is less innovative than it seems.
The approach described in this article is to reverse the good old fork/join, but it would only be practical for simple sub tasks or basic CLI tools, not entire programs.
In the end, using this style is almost the same as doing fork/join, except the setup is somewhat hidden.
Those interested can go look at all of the actual code I’ve written using these techniques, and decide for themselves whether or not it’s practical only for “simple sub tasks or basic CLI tools”:
https://github.com/EpicGamesExt/raddebugger/blob/c738768e411...
https://github.com/EpicGamesExt/raddebugger/blob/master/src/...
Posting a tl;dr here might stave off some of dismissive comments based only on only reading the headline.
This is a good advice. That said, I don't think you should care about people that comments on articles without reading them.
Based on the title I would have assumed that the programming model would be inverted, but it wasn't. What is needed is something akin to the Haskell model where the evaluation order is unspecified while simultaneously allowing mutation. The way to do this would be a Rust style linear type system where you are allowed to acquire exclusive write access to a region in memory, but not be allowed to perform any side effect and all modifications must be returned as if the function was referentially transparent. This is parallel by default, because you actively have to opt into a sequential execution order if you want to perform side effects.
The barriers to this approach are the same old problems with automatic parallelization.
Current hardware assumes a sequential instruction stream with hardware threads and cores and no hardware primitive in the microsecond range to rapidly schedule code to be executed on another core. This means you must split your program into two identical programs that then are managed by the operating system. This kills performance due to excessive amount of synchronization overhead.
The other problem is that even if you have low latency scheduling, you still need to gather a sufficient amount of work for each thread. Too fine grained and you run into synchronization overhead (no matter how good your hardware is), too coarse grained and you won't be able to spread the load onto all the processors.
There is also a third problem that is lurking in the dark and many developers with the exception of the Haskell community are underestimating: Running programs in a suboptimal order can lead to a massive increase in the instantaneous memory usage to the point where the program can no longer run. Think of a program allocating memory for each request, processing it and then deallocating, then allocating again. What if it accepts all requests in parallel? It will first allocate everything, then process everything and then deallocate everything.
He's describing the Cuda execition model.
More interestingly, it's the same model as Intel's Implicit SPMD Program compiler (ispc, God that name is awful). But ispc works across SIMD lanes.
Yes this is just the GPU programming model without the hw perks (subgroups etc). Im impressed if he came up with it on his own but its pretty clear from the article that he didnt.
The GPU model works because the GPU is just wide SIMD with automagical scheduling.
To apply this to the CPU might be misguided unless you use SIMD aka like ISPC.
Everyone wants parallelism until mutexes enter the room.
What I have found is that even among talented senior engineers there is massive Dunning-Kruger effect when it comes to performant architecture. They don't know how to do it, and they don't know that they don't know how to do it.
This hit me right in the heart.I'm often the only person on the team who cares about performance, so I am drawn to these performance-related challenges... and it has really hurt my career, because I am then often perceived as some kind of person focused on optimization rather than delivering features.
Like the author I do not focus on performance for performance's sake. Nearly all of the time the right course of action is "do not optimize this piece of code; focus on maintainability and readability instead."
However, sometimes, you really need to design for performance from the beginning.... or you don't have a product.
At my most recent job they were trying to push lots of data through RabbitMQ/Celery for scientific work. This worked for trivial jobs (tens of megabytes) but not for moderate or large ones (hundreds of gigabytes)
To make such a product viable, you really need to consider performance from the start. Celery explicitly tells you not to pass non-trivial amounts of data around: you should be passing pointers to data (database ID's, file paths, S3 URIs, whatever) rather than the actual full-fat data.
The team really struggled with this. Their next proposed solution was "well, okay, we'll store intermediate results in the database and 'optimize' later" Great idea, but this involved 1B+ result objects. Wrong again. You are not serializing 1B+ Python objects, sending them over the wire, and performing 1B+ Redis or Postgres inserts in any reasonable amount of time or memory. Optimize and bulk insert all you want, but that's an absolute dead end.
There aren't a whole lot of options for performantly slinging around hundreds of gigabytes of data. Assuming you can't just run on a monster server with hundreds of GB of RAM (which honestly is often the right answer) you are generally going to be looking at fast on-disk formats like Parquet etc. In any event that's something you really need to design around from the start, not something you sprinkle on at the end.
They're on their second iteration of the architecture right now, and it's slower than the first iteration was. Still no viable product. Shame.
> I'm often the only person on the team who cares about performance, so I am drawn to these performance-related challenges... and it has really hurt my career, because I am then often perceived as some kind of person focused on optimization rather than delivering features.
Pro tip: always turn that kind of thing into a dollar value you can put on your annual review. "My update to X let us use Y fewer EC2 instances, saving us $Z per year." Then it's not some Don Quixote obsession, but a clear fiscal benefit to the company.
This is kind of the gold standard and well worth aspiring to for any given situation.
I've often found it hard to achieve in practice.
A typical issue is that for non-trivial performance improvements (the kind that will take several days to several months to achieve) it's hard to estimate the speedup (and therefore, the the corresponding dollar amount) without actually doing a substantial portion of the work and estimating the corresponding speedup.
Another typical issue is where the performance speedup doesn't correspond directly to a dollar amount. At a recent employer we were working on a scientific product. The optimizations would have affected the size of the datasets able to be handled by our product. So there would have been a dollar impact, but it would have not been some kind of relatively simple AWS math. It would have been measured in potential sales dollars.
It's like making your game run at 60fps instead of 30fps. 60fps is clearly better (all other things being equal) and better games tend to sell better, and we'd like to make the best game possible and sell the most copies possible... but how do we quantify the expected return on this work? Many times, I don't think you can.
I think that software performance is the good kind of vanity metric, speed almost always translate to better end-user experience, but there is a point where it can turn into a pointless rabbit hole.
Somewhat interesting read, if rather long..
My experience is that multi-threading has quite abit of overhead, not necessary from the scheduling, but from the cache misses because now everything is unlikely to be in cache, so a naive parallel for can easily end up consuming a ton of CPU resources, it may indeed finish quicker, but use 5x the overall CPU time to do so.
Then there is the other issue that parallel_for suffers from, the "starter" thread has to finish the loop, and if may end up with nothing to do for some time(like when one of the helper threads get suspended..), or it might end up going off to process some other work, causing the entire loop to take much longer to finish. So parallel_for kinda sucks, and I prefer using dependency graphs when I can.
> programmers leave an enormous amount of performance on the table by ignoring the fundamentally multi-core reality of their machines.
It's left on the user's table for them to use to run other things.
Imagine if your whole system followed this concept, bloating your thread count by 32x or something. Using multiple threads is not free and scheduling between them all eats performance. Not only that each thread will use up extra memory.