Title feels a bit too click-baity. The article first says multi-consumer queues break global ordering (but preserves per-consumer ordering), and then proposes a data structure that completely abandons any ordering, without explaining why that’s ok. It’d be good to at least acknowledge that queues have their uses.
Also if this is really about bags, why not open with that?
I took the article to be saying that mpmc queues are not queues so we can use bags instead because bags are faster. The second claim seems dubious to me though. I checked out the repo which led me to https://alexsaveau.dev/blog/opinions/performance/lockness/at... which proposes new CPU instructions to create atomic bitmasks. As the author notes, the bag structure breaks down a bit because you have to operate on the bit mask with atomic operations which will contend heavily with each other.
Ignoring all the other problems with this article that have been pointed out around definitions, it also claims that lockless is slow anyway. Without giving literally any data.
Good news: it's not slow
Let's take rust, which has an oddly thriving ecosystem of lockless mpsc/mpmc/etc queues, and lots of benchmarks of them in lots of configurations.
The fastest ones easily do at least 30 million elements/second in most configurations. The "slowest" around 5-10.
So the fastest is doing 33ns per element and the slowest is 100ns.
Which is probably why the article offers no data. It's not actually slow. It's actually really fast when done well.
I have always found MPMC ring buffers to have worse tails than pointer-based MPMC queues. These are very unfashionable (especially in rust) but can be made wait-free with a single atomic for readers and writers. The SPSC ring buffer is perfect, but adding either more producers or more consumers makes the whole thing have terrible pathologies.
Reading the article again, this lockless bag idea also needs two atomic writes on every operation.
The "disruptor" pattern has some significant problems with tail performance (around pathological OS scheduling) if you aren't running one thread per core the way LMAX designed it.
Without any data it's impossible to even tell what this means or whether it matters. IE even as a comparitive term it's useless without real data.
Theoretically better (which is questionable at best in this case) doesn't mean anything useful if it doesn't actually matter in practice, or is always worse in pracftice.
As noted by other commenters, the point I was trying to get across is that the way we implement lockless channels is suboptimal and could be made faster from a theoretical standpoint.
In my benchmarks[1], the average processing time for an element is 250ns with 4 producers and 4 consumers contending heavily. That's terrible! Even if your numbers are correct, 100ns is a bit faster than two round trips to RAM while 33ns is about three round trips to L3 and ~100x slower than spamming a core with add operations. That's slow.
The blog post gatekeeps the definition of "queue" to require a global processing order, (not just merely global dequeue order following the FIFO principle) when the vast majority of real life queues have no such constraint whatsoever.
If you go to the the city hall and they have multiple counters, you pull a ticket with your number. There is a global order of dequeue operations. The first person to pull a ticket will be the first to be assigned to a counter. If there is one counter, the second person has to wait for the first person. If there are two counters, then the second person can go to the second counter. If the second counter works faster than the first counter, then the third person will go to the second counter.
According to the blog post this violates a global processing order constraint (created by who?). The people at the second counter appear to be coming from the future from the perspective of a global processing order constraint.
This is a very strange way to look at queues, because the primary purpose of a queue is to temporarily buffer data when there is insufficient capacity to process incoming requests. Calling something broken because it fulfills its purpose is a very negative way to approach life.
Yes, the author hasn't heard the term "service discipline" before. Sure, queuing theory is easier if you don't consider all the various service disciplines that real life queues use in practice (both physical queues and digital queues).
Thanks for sharing, I had not! It sounds like "processor sharing" would be the expected mode of operation for lockless queues. But see my comment to the parent, this is not how they work.
In principle there is an infinite number of possibilities for service disciplines, since it is an arbitrary function to select an element from sets. Service disciplines can also include randomness. I only know the basics of the mathematics of queue theory, but my introduction to it gave me the impression that almost anything but the most basic of disciplines didn't have closed form solutions to most questions. Like many models, you can introduce some things as approximations to get close enough answers. I personally prefer simulation, but that has its own problems.
This is actually a great analogy because it exemplifies the misconceptions people have about lockless queues.
In the example with multiple counters, in real life each counter could shout out a number and have people approach their respective counters in parallel. But this is not how lockless queues work. Instead, the person at the head of the queue holds a baton and when multiple numbers are called, everybody waiting in the queue goes up to the counter of the person holding the baton. Once that head-of-the-queue has made it to the counter, they give the baton to the person behind them who then drags everybody along to their counter. And so on.
The article was arguing for a lockless channel implementation akin to your interpretation of a queue with parallel access to the counters.
> the primary purpose of a queue is to temporarily buffer data
I think if that’s all you care about you’re in the clear.
The article is useful in that it considers a more general set of use cases (people do all kinds of weird things) and highlights that a somewhat intuitive assumption one might have may not hold.
> Acausality within consumers is the only upheld invariant: a consumer will not see any elements prior to the last element it has seen.
This is simply not true. A standard multi-consumer queue is ordered on the consumer side so long as consumers synchronize with each other. This could be as simple as one consumer setting a global flag after receiving message A and another consumer observing that flag before receiving message B. A lockless bag will not have this property.
Similarly, any of these queues have the nice property that, even if no one synchronizes anything, they’re fair: even under heavy load such that the queue never comes close to emptying, every push attempt will be fairly serviced by a pop without any other synchronization needed.
Attempting to relax either of these could be problematic, especially as a drop-in replacement of the data structure. I suspect that quite a lot of software that uses queues pushes special tokens to indicate changes of global state, and safety and correctness may be violated by reordering. For example, an MPMC system could have a producer push a special “I’m done” token when it’s all done, and the consumers could collectively count the “I’m done” tokens and assume (correctly on a standard MPMC queue) that no more messages will be in the queue once one “I’m done” per producer have been globally seen. If those tokens come out a bag too early, then a different synchronization mechanism would be needed.
As other comments note there are plenty of problems restricting queues to ordered execution. For most use cases that simply does not matter. What matters are related features like avoidance of starvation or execution bias. And those are perfectly possible to do with multiple consumers/producers.
The article basically says that when you have multiple suppliers or consumers, the "order" of the queue loses meaning. It turns into an "unordered" pool of data. Therefore focus should be shifted from maintain a "queue of data" to a "bag of data".
No, order does not loose meaning. You can have "ordered" or "unordered" or "almost ordered", and pretending otherwise looses meaning.
Even slightly-misordered queue has very useful fairness guarantees - for a queue of size N, an item will spend O(N) steps in the queue.
"Bags of data" have no guarantees of that sort at all. By their definition, an item can spent an arbitrary time in the "bag", which makes those "bags" pretty useless for many real-time use cases.
As an example, imagine the bag where readers and writers always prefer smaller indices. So if an item gets stuck at the last slot, it will never gets picked up as long as new items are coming. This is a pretty useless behavior which is yet totally compatible with "bag of data" definition.
> The article basically says that when you have multiple suppliers or consumers, the "order" of the queue loses meaning.
The disconnect to me is that the title says "it's not a queue". It is a queue, its just (as you note) not being used in a way where it being a queue provides any benefit.
I think the key advantage they are describing for their bag is that it allows you to claim a element then operate on that element independently. This avoids head-of-line blocking caused by things like random context switching mid-processing.
However, that can be implemented with a MPMC queue with no exotic operations with precise O(1) behavior with no size bounds. For a bag with two N-bit bit-vectors, just have two N-bit queues of indexs. Dequeue on acquire (giving you a unused index) and enqueue on release (releasing your now used index). If anybody finishes processing quickly, they release back to the queue for new processing which means there is no head-of-line blocking. Only by having all N actively processing do you wait.
I guess you also still have the tiny window on enqueue between reserving the slot and actually writing out the index where you could tear due to a perfectly sub-optimal context switch.
If you want a exotic operation to make this better, then you would want a uninterruptible two-instruction sequence of: "atomic modify, then use old value as index for indexed store". So that would be a instruction with 4 parameters: memory location for atomic, atomic modify argument, base memory location for store, register value to store; and the instruction itself would encode the operation and store scale/stride. Importantly, the indexed store does not need to be atomic with respect to the first, it merely needs to be ordered after it with a guarantee that a context switch or interrupt can not result in partial completion. This is a much weaker requirement than making multiple cache-line loads/stores fully atomic with respect to each other.
Such a operation would improve multi-producer queues with fixed-size elements smaller than a word and would likely be fairly trivial to implement, merely requiring fusion of two existing instructions in a way that prevents interruption between them.
The author has a point, but the global ordering of an MPMC queue is not entirely pointless. Assuming the queue marks the messages with the order, the ordering might be used at a later stage in the pipeline to reorder the processed messages and recover the total order.
Also there are implications for fairness. A queue implemented as a stack might means that messages pushed during a spike might never be processed or experience very high latency.
Finally it is actually possible to implement a decent lock free MPSC queue in top of a node-based stack: the consumer would pop all elements on one operation, then reverse the list. This N cost is amortized over N pops so it is actually not bad.
I did do an actual lock-free MPMC ring buffer implementation as an exercise. I used that to make blocking bounded queues using various synchronization mechanisms, mutex/condvars and eventcounts among others. The eventcount version runs about 8x faster than the mutex version.
"They come in four variants: {single,multi}-producer {single,multi}-consumer."
The article makes it sound a bit as if they were all created equal, while I think only MPSC is really used widely. The others are all kind if niche, because usually we have better solutions for the problems they are suitable for.
Fair point, though I'm not sure I agree. MPMC channels underpin pretty much every general task scheduler (take a peek inside tokio or rayon for example). And SPSCs are quite useful for designing custom pipelines. Though I agree that MPMC channels are silly.
Imagine standing in line at the supermarket, or municipality, or anywhere where you stand in line.
The front of the line is near service desk 1. When it’s your turn, you have to walk to service desk 10. As you’re walking to it, service desk 1 frees up, and the person who was behind you in line walks to service desk 1. They arrive in seconds, while you’re still halfway to being served, a few meters away.
Op is pointing out that (programming) queues without locking lack a certain purity which is also lacking in real life queues.
Title feels a bit too click-baity. The article first says multi-consumer queues break global ordering (but preserves per-consumer ordering), and then proposes a data structure that completely abandons any ordering, without explaining why that’s ok. It’d be good to at least acknowledge that queues have their uses.
Also if this is really about bags, why not open with that?
I took the article to be saying that mpmc queues are not queues so we can use bags instead because bags are faster. The second claim seems dubious to me though. I checked out the repo which led me to https://alexsaveau.dev/blog/opinions/performance/lockness/at... which proposes new CPU instructions to create atomic bitmasks. As the author notes, the bag structure breaks down a bit because you have to operate on the bit mask with atomic operations which will contend heavily with each other.
Ignoring all the other problems with this article that have been pointed out around definitions, it also claims that lockless is slow anyway. Without giving literally any data.
Good news: it's not slow
Let's take rust, which has an oddly thriving ecosystem of lockless mpsc/mpmc/etc queues, and lots of benchmarks of them in lots of configurations.
The fastest ones easily do at least 30 million elements/second in most configurations. The "slowest" around 5-10.
So the fastest is doing 33ns per element and the slowest is 100ns.
Which is probably why the article offers no data. It's not actually slow. It's actually really fast when done well.
I have always found MPMC ring buffers to have worse tails than pointer-based MPMC queues. These are very unfashionable (especially in rust) but can be made wait-free with a single atomic for readers and writers. The SPSC ring buffer is perfect, but adding either more producers or more consumers makes the whole thing have terrible pathologies.
Reading the article again, this lockless bag idea also needs two atomic writes on every operation.
SPMC ring buffer or SPMC "disruptor" aren't that bad. Multiple producers in a general ring buffer definitely introduce a lot of issues.
The "disruptor" pattern has some significant problems with tail performance (around pathological OS scheduling) if you aren't running one thread per core the way LMAX designed it.
I read "slow" and "fast" in the article as comparative term. "slower than" what the writer has seen in other cases.
Absolute ms doesn't prove much unless you put it in comparison with other best.
Without any data it's impossible to even tell what this means or whether it matters. IE even as a comparitive term it's useless without real data.
Theoretically better (which is questionable at best in this case) doesn't mean anything useful if it doesn't actually matter in practice, or is always worse in pracftice.
As noted by other commenters, the point I was trying to get across is that the way we implement lockless channels is suboptimal and could be made faster from a theoretical standpoint.
In my benchmarks[1], the average processing time for an element is 250ns with 4 producers and 4 consumers contending heavily. That's terrible! Even if your numbers are correct, 100ns is a bit faster than two round trips to RAM while 33ns is about three round trips to L3 and ~100x slower than spamming a core with add operations. That's slow.
[1]: https://github.com/SUPERCILEX/lockness/blob/master/bags/benc... $ cargo bench -- 8_threads/std_mpmc
It's only terrible if it actually matters?
Also it doesn't look like you are using any of the optimized lockless implementations like crossfire/etc, so not sure what this actually proves?
Why would lockless be slower? Is it ever slower?
The blog post gatekeeps the definition of "queue" to require a global processing order, (not just merely global dequeue order following the FIFO principle) when the vast majority of real life queues have no such constraint whatsoever.
If you go to the the city hall and they have multiple counters, you pull a ticket with your number. There is a global order of dequeue operations. The first person to pull a ticket will be the first to be assigned to a counter. If there is one counter, the second person has to wait for the first person. If there are two counters, then the second person can go to the second counter. If the second counter works faster than the first counter, then the third person will go to the second counter.
According to the blog post this violates a global processing order constraint (created by who?). The people at the second counter appear to be coming from the future from the perspective of a global processing order constraint.
This is a very strange way to look at queues, because the primary purpose of a queue is to temporarily buffer data when there is insufficient capacity to process incoming requests. Calling something broken because it fulfills its purpose is a very negative way to approach life.
Yes, the author hasn't heard the term "service discipline" before. Sure, queuing theory is easier if you don't consider all the various service disciplines that real life queues use in practice (both physical queues and digital queues).
Thanks for sharing, I had not! It sounds like "processor sharing" would be the expected mode of operation for lockless queues. But see my comment to the parent, this is not how they work.
In principle there is an infinite number of possibilities for service disciplines, since it is an arbitrary function to select an element from sets. Service disciplines can also include randomness. I only know the basics of the mathematics of queue theory, but my introduction to it gave me the impression that almost anything but the most basic of disciplines didn't have closed form solutions to most questions. Like many models, you can introduce some things as approximations to get close enough answers. I personally prefer simulation, but that has its own problems.
This is actually a great analogy because it exemplifies the misconceptions people have about lockless queues.
In the example with multiple counters, in real life each counter could shout out a number and have people approach their respective counters in parallel. But this is not how lockless queues work. Instead, the person at the head of the queue holds a baton and when multiple numbers are called, everybody waiting in the queue goes up to the counter of the person holding the baton. Once that head-of-the-queue has made it to the counter, they give the baton to the person behind them who then drags everybody along to their counter. And so on.
The article was arguing for a lockless channel implementation akin to your interpretation of a queue with parallel access to the counters.
> the primary purpose of a queue is to temporarily buffer data
I think if that’s all you care about you’re in the clear.
The article is useful in that it considers a more general set of use cases (people do all kinds of weird things) and highlights that a somewhat intuitive assumption one might have may not hold.
> Acausality within consumers is the only upheld invariant: a consumer will not see any elements prior to the last element it has seen.
This is simply not true. A standard multi-consumer queue is ordered on the consumer side so long as consumers synchronize with each other. This could be as simple as one consumer setting a global flag after receiving message A and another consumer observing that flag before receiving message B. A lockless bag will not have this property.
Similarly, any of these queues have the nice property that, even if no one synchronizes anything, they’re fair: even under heavy load such that the queue never comes close to emptying, every push attempt will be fairly serviced by a pop without any other synchronization needed.
Attempting to relax either of these could be problematic, especially as a drop-in replacement of the data structure. I suspect that quite a lot of software that uses queues pushes special tokens to indicate changes of global state, and safety and correctness may be violated by reordering. For example, an MPMC system could have a producer push a special “I’m done” token when it’s all done, and the consumers could collectively count the “I’m done” tokens and assume (correctly on a standard MPMC queue) that no more messages will be in the queue once one “I’m done” per producer have been globally seen. If those tokens come out a bag too early, then a different synchronization mechanism would be needed.
As other comments note there are plenty of problems restricting queues to ordered execution. For most use cases that simply does not matter. What matters are related features like avoidance of starvation or execution bias. And those are perfectly possible to do with multiple consumers/producers.
> Lockless queues are slow
That particular implementation of a lockless queue may be slow, but that is not globally true for lockless queues. There are variants, for example BBQ, https://www.usenix.org/conference/atc22/presentation/wang-ji..., that have great performance.
I don't know what others are reading.
The article basically says that when you have multiple suppliers or consumers, the "order" of the queue loses meaning. It turns into an "unordered" pool of data. Therefore focus should be shifted from maintain a "queue of data" to a "bag of data".
No, order does not loose meaning. You can have "ordered" or "unordered" or "almost ordered", and pretending otherwise looses meaning.
Even slightly-misordered queue has very useful fairness guarantees - for a queue of size N, an item will spend O(N) steps in the queue.
"Bags of data" have no guarantees of that sort at all. By their definition, an item can spent an arbitrary time in the "bag", which makes those "bags" pretty useless for many real-time use cases.
As an example, imagine the bag where readers and writers always prefer smaller indices. So if an item gets stuck at the last slot, it will never gets picked up as long as new items are coming. This is a pretty useless behavior which is yet totally compatible with "bag of data" definition.
> The article basically says that when you have multiple suppliers or consumers, the "order" of the queue loses meaning.
The disconnect to me is that the title says "it's not a queue". It is a queue, its just (as you note) not being used in a way where it being a queue provides any benefit.
I think the key advantage they are describing for their bag is that it allows you to claim a element then operate on that element independently. This avoids head-of-line blocking caused by things like random context switching mid-processing.
However, that can be implemented with a MPMC queue with no exotic operations with precise O(1) behavior with no size bounds. For a bag with two N-bit bit-vectors, just have two N-bit queues of indexs. Dequeue on acquire (giving you a unused index) and enqueue on release (releasing your now used index). If anybody finishes processing quickly, they release back to the queue for new processing which means there is no head-of-line blocking. Only by having all N actively processing do you wait.
I guess you also still have the tiny window on enqueue between reserving the slot and actually writing out the index where you could tear due to a perfectly sub-optimal context switch.
If you want a exotic operation to make this better, then you would want a uninterruptible two-instruction sequence of: "atomic modify, then use old value as index for indexed store". So that would be a instruction with 4 parameters: memory location for atomic, atomic modify argument, base memory location for store, register value to store; and the instruction itself would encode the operation and store scale/stride. Importantly, the indexed store does not need to be atomic with respect to the first, it merely needs to be ordered after it with a guarantee that a context switch or interrupt can not result in partial completion. This is a much weaker requirement than making multiple cache-line loads/stores fully atomic with respect to each other.
Such a operation would improve multi-producer queues with fixed-size elements smaller than a word and would likely be fairly trivial to implement, merely requiring fusion of two existing instructions in a way that prevents interruption between them.
The author has a point, but the global ordering of an MPMC queue is not entirely pointless. Assuming the queue marks the messages with the order, the ordering might be used at a later stage in the pipeline to reorder the processed messages and recover the total order.
Also there are implications for fairness. A queue implemented as a stack might means that messages pushed during a spike might never be processed or experience very high latency.
Finally it is actually possible to implement a decent lock free MPSC queue in top of a node-based stack: the consumer would pop all elements on one operation, then reverse the list. This N cost is amortized over N pops so it is actually not bad.
I did do an actual lock-free MPMC ring buffer implementation as an exercise. I used that to make blocking bounded queues using various synchronization mechanisms, mutex/condvars and eventcounts among others. The eventcount version runs about 8x faster than the mutex version.
Don't these lockless bags break down when it comes to threads not spinning? For instance, if you want to sleep a thread until something was produced?
"They come in four variants: {single,multi}-producer {single,multi}-consumer."
The article makes it sound a bit as if they were all created equal, while I think only MPSC is really used widely. The others are all kind if niche, because usually we have better solutions for the problems they are suitable for.
Fair point, though I'm not sure I agree. MPMC channels underpin pretty much every general task scheduler (take a peek inside tokio or rayon for example). And SPSCs are quite useful for designing custom pipelines. Though I agree that MPMC channels are silly.
Imagine standing in line at the supermarket, or municipality, or anywhere where you stand in line.
The front of the line is near service desk 1. When it’s your turn, you have to walk to service desk 10. As you’re walking to it, service desk 1 frees up, and the person who was behind you in line walks to service desk 1. They arrive in seconds, while you’re still halfway to being served, a few meters away.
Op is pointing out that (programming) queues without locking lack a certain purity which is also lacking in real life queues.