CPU sleep states are normally disabled on servers because CPU demand can increase faster than the CPU speed can increase. It is much better to have a $20,000 CPU running at max MHz at all times.
If my gpu is sitting idle, and I mean idle with nothing loaded into its memory, it's sitting at about 18W. If I load in model that uses nearly all of the memory but that model is idle, it's at 36W. If that model is actively thinking, it's like 118W. I think this is likely due to the GPU being aware that there is real data loaded into memory and turning up the DRAM refresh rate whereas when nothing is loaded, the dynamic power is as low as possible.
Yes, I have some of these cards and AFAICT the HBM2e chips just always run at full speed. I have different variants of the pcie cards and while I can get the gpu itself into a lower power state the memory just runs full tilt. Though I see 40w on my “normal” cards and 60w on the Frankenstein card that thinks it’s an sxm4.
IIRC this was one of the issues with 2/2e, some combination of the various available memory controllers not agreeing on a standard to manage timings and power states. I haven't played around with my Radeon VII in a long while now.
That aside idle power consumption is a driver-to-driver affair from both amd and novideo, sometimes I'm only pulling 15-30W when nothing is happening and other times it decides it needs 110w for a static 500hz screen
To be fair, the culprit in the article is _less complex_ than branch prediction: "with random data, bits are flipped often, and bit flips in transistors inherently draw power" is less mental gymnastics than "with random data, the cpu fails to predict the future, causing redundant speculative execution"
Sure, when comparing 0’s to anything else. But what about normal distribution to uniform in 0,1? The author hand waves something about signs but it’s not very well reasoned - that’s just a single bit in floats.
And what of the Pi test - I’d expect that to flip many more bits than the 1-bit one.
If the inputs are constant, then all the multiplies are constant and the only thing that toggles is the accumulation. Which explains the pi situation.
Normal vs uniform is less clear, but also not as much of a difference. The arguments about signs isn't just about a signs bit, though. The way you negate during accumulation is that you flip all the bits. Only the final float representation is sign+magnitude, the accumulation itself has two's complement steps. I don't actually know the analysis here, just pointing out that it's not that simple.
Field effect transistors are basically a capacitor. They store energy.
If you switch a not gate's input from zero to one to zero and so on, the gate capacitance will have to charge and discharge. The entire idea behind CMOS is that if you have n and p channel transistors together, you can take advantage of the fact that electrons are more mobile than holes. Filling and draining electrons gives you a greater switching speed.
If the input stays the same, then the charge at the input inside the flip flop is the same as the charge inside the not gate. No charge differential means no electrons move, which means there is no ohmic resistance that causes the internal metal and polysilicon interconnect to heat up and less power gets lost and no switching obviously happens faster than some switching.
TL;DR If you randomize the data, you will constantly charge and discharge the capacitors.
I expected a “torch is smart enough to keep track of cases where it just initialized the C in C <= A*B+C to zero, avoiding the add” type situation but I was wrong.
I don't think GPUs ever had branch prediction in the first place. You can however run into poor performance due to thread divergence, which is a similar kind of issue (with much less black magic).
I'd have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you'd do when writing MUL for a processor that doesn't have it (I <3 z80)
Hardware engineer here. Special casing the multiply by 0 and multiply by 1 paths is harder than it sounds. In software, the cost of adding special cases is simply performance. You're adding more instructions that execute in sequence on a CPU that already physically exists. Doing this for your multiply case is worthwhile because the speedup is large for 0 and 1 while the cost is not that large (relative to the time taken for the whole multiplication operation) for other values.
Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."
All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.
You didn't touch on the most important aspect for cost: die area!
How much die space ($) will that circuitry, that's probably statistically near zero chance for you main customers workload (who has model weight of 0 or 1!?), add. And, if you can stomach the cost, what else could you put there instead?
Weights should not be 0 (at least not frequently) but in a ReLU-based neural network, activations are 0 pretty often. You're absolutely right about die area though.
Nvidia has added structural sparsity to their GPUs and every time they pull out a flops or tops number, they assume you will use structural sparsity.
The die area argument here makes no sense. Supporting structural sparsity can be done either by duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both, in which case you can have twice as many of them.
Also, in ReLU^2 networks, 90%+ parameters are zero.
I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
In general, constraints require optimizations and rearchitectures. I'd also expect the ram shortage for instance to have a big impact on the software industry as a whole, specially in games. They will need to make do with what people have, a ps5/pro or similar in PC power.
I actually think it is a good thing to introduce constraints to AI and the overall tech industry. Hopefully everyone will have to look at improving performance without having to add RAM or increase CPU/GPU performance.
As long as these constraints are for everyone and not just for thee and not for me, and become an instrument for big tech to keep consumers dependent on their infra.
This is not observable from LLM inference, where you would not encounter uniform matrices.
Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.
> When thermal throttling occurs you can perform faster by running slower.
This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.
Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.
The cards don't "perform faster by running slower". If you run the card slower, it performs slower.
with a lower power cap set, it runs cooler, which sometimes allows the GPU to reach higher boost speeds. This is a real effect on gaming GPUs - however I have no idea if it applies to datacenter GPUs
>This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
That algorithm is doing exactly the task I described. If it could temporarily run faster but in a way that would cause occilation, that literally means it can run faster but it is choosing not to to preserve overall performance.
It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.
And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).
Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.
~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.
I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.
And for actual runs, from a pre-run sampled curve.
> For example, when the GPU is fully idle, nvidia-smi tells me that it’s only pulling 88W of power.
I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?
Server cards are not optimized for idle power usage. They’re expected to be fully utilized.
For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.
For GeForce cards you can get similar behavior by setting “Prefer maximum performance” which disables some of the low power states.
CPU sleep states are normally disabled on servers because CPU demand can increase faster than the CPU speed can increase. It is much better to have a $20,000 CPU running at max MHz at all times.
If my gpu is sitting idle, and I mean idle with nothing loaded into its memory, it's sitting at about 18W. If I load in model that uses nearly all of the memory but that model is idle, it's at 36W. If that model is actively thinking, it's like 118W. I think this is likely due to the GPU being aware that there is real data loaded into memory and turning up the DRAM refresh rate whereas when nothing is loaded, the dynamic power is as low as possible.
I suspect the act of running nvidia-smi itself prevents the GPU from being put into a low-power state.
From memory this is true and nvml (Nvidia management library) is the way to get stats that doesn't cause the GPU to wake.
Yes, I have some of these cards and AFAICT the HBM2e chips just always run at full speed. I have different variants of the pcie cards and while I can get the gpu itself into a lower power state the memory just runs full tilt. Though I see 40w on my “normal” cards and 60w on the Frankenstein card that thinks it’s an sxm4.
IIRC this was one of the issues with 2/2e, some combination of the various available memory controllers not agreeing on a standard to manage timings and power states. I haven't played around with my Radeon VII in a long while now.
That aside idle power consumption is a driver-to-driver affair from both amd and novideo, sometimes I'm only pulling 15-30W when nothing is happening and other times it decides it needs 110w for a static 500hz screen
I went in expecting to find 'branch prediction'[0] as the answer, but apparently things are even more complex nowadays.
[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...
To be fair, the culprit in the article is _less complex_ than branch prediction: "with random data, bits are flipped often, and bit flips in transistors inherently draw power" is less mental gymnastics than "with random data, the cpu fails to predict the future, causing redundant speculative execution"
But why do we expect random data to result in more bit flips? That seems harder to argue than the mechanics of a basic branch prediction system.
Think about it from the other end. Why would any bits flip at all in the data path of your matrix multiplier when all the matrices are 0?
Sure, when comparing 0’s to anything else. But what about normal distribution to uniform in 0,1? The author hand waves something about signs but it’s not very well reasoned - that’s just a single bit in floats.
And what of the Pi test - I’d expect that to flip many more bits than the 1-bit one.
If the inputs are constant, then all the multiplies are constant and the only thing that toggles is the accumulation. Which explains the pi situation.
Normal vs uniform is less clear, but also not as much of a difference. The arguments about signs isn't just about a signs bit, though. The way you negate during accumulation is that you flip all the bits. Only the final float representation is sign+magnitude, the accumulation itself has two's complement steps. I don't actually know the analysis here, just pointing out that it's not that simple.
Anywhere I can read more about this float accumulation with 2’s complement?
Field effect transistors are basically a capacitor. They store energy.
If you switch a not gate's input from zero to one to zero and so on, the gate capacitance will have to charge and discharge. The entire idea behind CMOS is that if you have n and p channel transistors together, you can take advantage of the fact that electrons are more mobile than holes. Filling and draining electrons gives you a greater switching speed.
If the input stays the same, then the charge at the input inside the flip flop is the same as the charge inside the not gate. No charge differential means no electrons move, which means there is no ohmic resistance that causes the internal metal and polysilicon interconnect to heat up and less power gets lost and no switching obviously happens faster than some switching.
TL;DR If you randomize the data, you will constantly charge and discharge the capacitors.
>I went in expecting to find 'branch prediction'[0]
GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?
They do texture prefetching, which is sorta similar.
I expected a “torch is smart enough to keep track of cases where it just initialized the C in C <= A*B+C to zero, avoiding the add” type situation but I was wrong.
That's exactly what I thought.
I don't think GPUs ever had branch prediction in the first place. You can however run into poor performance due to thread divergence, which is a similar kind of issue (with much less black magic).
I'd have guessed multiply-by-0 and multiply-by-1 can be special-cased to run much faster and simpler code paths, like you'd do when writing MUL for a processor that doesn't have it (I <3 z80)
Hardware engineer here. Special casing the multiply by 0 and multiply by 1 paths is harder than it sounds. In software, the cost of adding special cases is simply performance. You're adding more instructions that execute in sequence on a CPU that already physically exists. Doing this for your multiply case is worthwhile because the speedup is large for 0 and 1 while the cost is not that large (relative to the time taken for the whole multiplication operation) for other values.
Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."
All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.
You didn't touch on the most important aspect for cost: die area!
How much die space ($) will that circuitry, that's probably statistically near zero chance for you main customers workload (who has model weight of 0 or 1!?), add. And, if you can stomach the cost, what else could you put there instead?
Weights should not be 0 (at least not frequently) but in a ReLU-based neural network, activations are 0 pretty often. You're absolutely right about die area though.
> near zero chance for you main customers workload
What percent of this hardware is running inference for ReLU models? ;)
Nvidia has added structural sparsity to their GPUs and every time they pull out a flops or tops number, they assume you will use structural sparsity.
The die area argument here makes no sense. Supporting structural sparsity can be done either by duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both, in which case you can have twice as many of them.
Also, in ReLU^2 networks, 90%+ parameters are zero.
I expect the degraded critical path will most likely be worse than a bit of die area. On modern processes you have A LOT of transistors to play with.
Thanks for the detailed explanation, I had no idea about any of this.
I can't tell from the blog, is this actually verified or is it theory and then numbers showing plausibility?
I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
It’s real, you can measure it yourself on modern Nvidia hardware
I feel like many of the comments missed the point or didn't read the article. What I believe this article is stating (and I've read this many times during my PhD for various reasons), is that the input data distributions affect how many transistor state changes there are during multiplication. Since these events are a large portion of energy loss/heat generation, the clocks won't be throttled as much for certain data patterns.
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
Sounds like a side channel attack waiting to happen.
So I guess we'll all be applying a random rotation to our matrices now to obscure their contents, like TurboQuant does. https://arkaung.github.io/interactive-turboquant/#rotation
Not that it super matters, but random hadamards for quantization have been a thing since way before turboquant.
https://arxiv.org/abs/2404.00456
Which llama.cpp now does.
Yeah I am willing to drive faster down a straight flat runway as others ahead of me give an all clear.
When you make it so the computer does not have to compute all possible states of matter it finishes faster.
People have been noticing the effects of this in local LLM inference. Power limiting seems to improve overall performance!
In general, constraints require optimizations and rearchitectures. I'd also expect the ram shortage for instance to have a big impact on the software industry as a whole, specially in games. They will need to make do with what people have, a ps5/pro or similar in PC power.
I actually think it is a good thing to introduce constraints to AI and the overall tech industry. Hopefully everyone will have to look at improving performance without having to add RAM or increase CPU/GPU performance.
As long as these constraints are for everyone and not just for thee and not for me, and become an instrument for big tech to keep consumers dependent on their infra.
This is not observable from LLM inference, where you would not encounter uniform matrices.
Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.
When thermal throttling occurs you can perform faster by running slower.
This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.
> When thermal throttling occurs you can perform faster by running slower.
This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.
Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.
The cards don't "perform faster by running slower". If you run the card slower, it performs slower.
with a lower power cap set, it runs cooler, which sometimes allows the GPU to reach higher boost speeds. This is a real effect on gaming GPUs - however I have no idea if it applies to datacenter GPUs
>This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
That algorithm is doing exactly the task I described. If it could temporarily run faster but in a way that would cause occilation, that literally means it can run faster but it is choosing not to to preserve overall performance.
It wouldn't surprise me to see some ML algorithm in silico somewhere to select faster matmul paths on favorable data. Yo dawg, I heard you like AI, so we put some AI in your AI so you can infer while you're inferring.
And there's at least one more level of inception at the data center level, where they use AI to optimize power usage (particularly by predictively controlling cooling, and adaptively rescheduling tasks).
Here is one: An adjustment to weight updates, that makes it more likely for weights to stay uniformly distributed.
~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.
I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.
And for actual runs, from a pre-run sampled curve.
This is old news. AMD was this in their CPUs years ago.
Designing for predictable execution flow is one of the advantages of Tenstorrent hardware.
https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...
https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...
https://arxiv.org/html/2604.03279