I'm reminded of how Carmack talked about the extra efficiencies available when targeting consoles, because you knew exactly what hardware was available.
It's great that the efficiencies available can be shown to be extractable. The real, much harder, trick is putting together a sufficiently smart compiler to enable them for heterogeneous compute setups.
The demoscene also is an example of how much you can do if you can be absolutely sure exactly what hardware you’re running on.
The problem is that even for things like consoles, it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
Sometimes I dream of what the world would do if we were mystically stuck on exactly the processors we have today, for twenty years.
I've wondered sometimes what software would look like if a crisis took out the ability to build new semiconductors and we had to run all our computing infrastructure on chips salvaged from pregnancy tests, shoplifting tags, cars, old PCs, and other consumer electronics. We'd basically move backwards about 20 years in process technology, and most computers would have speeds roughly equivalent to 90s/00s PCs.
But then, this still wouldn't incentivize building directly to the hardware, because of the need to run on a large variety of different hardware. You're still better off preferencing portability over performance, and then making it up by cutting scope and ease of development.
Funny you say this... This exact thought experiment was going on last month! Laurie Wired [0] a cybersec youtuber asked it on twitter and got some interesting replies too!
This sounds kind of similar to what I've heard about Cuba's relationship with cars and probably technology after the U.S embargo. Not sure how true it was/is though.
> I've wondered sometimes what software would look like if a crisis took out the ability to build new semiconductors and we had to run all our computing infrastructure on chips salvaged from pregnancy tests, shoplifting tags, cars, old PCs, and other consumer electronics. We'd basically move backwards about 20 years in process technology, and most computers would have speeds roughly equivalent to 90s/00s PCs.
Optimizing for the hardware you are on is demonstrably an effort and skill issue. Everyone understands that with enough time and engineers, any piece of software could be optimized better. If only we had large volumes of inexpensive "intelligence" to throw at the problem.
This is one of my back-of-mind hopes for AI. Enlist computers as our allies in making computer software faster. Imagine if you could hand a computer brain your code, and ask it to just make the program faster. It becomes a form of RL problem, where the criteria are 1) a functionally equivalent program 2) that is faster.
This is what I was thinking, too. For so long, the default mode of operating a software company has been:
"Developer time is so expensive, we need to throw everything under the bus to make developers fast."
The kinds of things often thrown under the bus: Optimizations, runtime speed, memory footprint, disk image size, security, bug fixing, code cleanliness / lint, and so on. The result is crappy software written fast. Now, imagine some hypothetical AI (that we don't have yet) that makes developer time spent on the project trivial.
Optimistically: There might be time for some of these important software activities.
Pessimistically: Companies will continue to throw these things under the bus and just shit out crappy software even faster.
My favorite part of this phenomenon is every company that interviews developers on data structures and algorithms, then puts out a calculator app that takes half a gigabyte of storage and nearly as much RAM to run.
I have not had to use Windows in ages but every time I touch it I am amazed at the fact that it takes like 10-15GB for a bare installation of the latest version, while it does about the same amount of work as XP was able to do in under 1GB. Yes I am aware assets are a thing but has usability increased as a result of larger assets?
You can, with some programming languages, require a proof of this (see: Rocq, formerly 'coq').
I think a more interesting case might be showing functional equivalence on some subset of all inputs (because tbh, showing functional equivalence on all inputs often requires "doing certain things the slow way").
An even more interesting case might be "inputs of up to a particular complexity in execution" (which is... very hard to calculate, but likely would mean combining ~code coverage & ~path coverage).
Of course, doing all of that w/o creating security issues (esp. with native code) is an even further out pipe dream.
I'd settle for something much simpler, like "we can automatically vectorize certain loop patterns for particular hardware if we know the hardware we're targeting" from a compiler. That's already hard enough to be basically a pipe dream.
It's not even guaranteed to be functionally equivalent when compiled on the same hardware with the same compiler etc. Undefined behaviour can do what it wants. (And implementation defined behaviour also has a lot of leeway.)
However, if you stick to only defined behaviour, they are 'functionally equivalent', if your compiler doesn't have a bug.
It's not just being sure exactly what the hardware is, in demos you have the additional luxury of not being interactive. So you can plan everything exactly out in advance.
> it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
For the last six years my full time job has largely been optimizing games where most of the team has been working with this mindset. Sometimes someone spends a few days of just getting things done, followed by others building on top of it. This leads to systems which are not fast enough and take me weeks or even months to optimize.
We even got together at my last job and created a series of lectures on performance and best practices for everyone, including artists, to get ahead of this type of issues. It was apparently very appreciated, especially among the non technical staff who said it was valuable and they had no idea.
Consoles are pretty heterogeneous IRL too, though. You have multiple SKUs (regular and Pro, for example), not to mention most games will also target multiple consoles (PlayStation + Xbox + Switch is a common combo).
So in reality the opportunities to really code against a specific piece of hardware are few and far between...
Heck, then you get into multiple operating modes of the same hardware - the Nintendo Switch has a different perf profile if it's docked vs. not.
The original Switch was launched in 2016 that's plenty of time with a stable platform. The multiple operating modes in practice can be approached by coding against un-docked ( handheld) and then adding bonus quality for docked.
Cubans benefited from the cars being older, simpler, and robust. Imagine freezing car tech now, with so many electronics, far more parts and built to be replaced relatively quickly!
These older cars broke down all the time. There's a reason old American sit-coms have at least some characters always tinkering with their cars: you needed to do that. Nowadays, cars just work.
That's what you got with BeOS...throw out backward compatibility and build to current Best practices...it's ability to extract performance out of an 133 Mhz processor was amazing.
Even better was the BeBox running BeOS. That was a cool use of a fast dual CPU PowerPC platform with great graphics. Amiga vibes. But turns out that humans need software applications more than they need efficient use of the hardware.
The problem is that even for things like consoles, it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
This wasn't always the case. I have a friend who used work on games for the original Playstation. I remember him telling me that part of his job (or maybe his whole job) was optimizing the machine code output by the C compiler.
And don't forget that Sony and Microsoft have compilers teams, working on specialised GCC and LLVM backends, and sometimes upstreaming general improvements.
> The problem is that even for things like consoles, it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
Given all the time and money, there's also a skills gap.
Unlimited time and money will not make someone like me a John Carmack level programmer. There are a finite number of individuals operating at his level or above and having them hyper optimize code is a poor use of their time.
For hardware that isn't pre-framebuffer, demos seem to be mostly about hyperoptimizing the code in a portable way, much less optimizing to specific hardware timings and quirks.
Not using the NVDEC and NVJPG units to decompress weights into registers? And you say you're using the whole GPU. There are entire blocks on the silicon going idle!
Ha made me chuckle. For those wondering seriously about this, it’s not a viable optimization because weights are not readily compressible via JPEG/DCT, and there are a limited number of these units on the chip which bottlenecks throughout, meaning speed is dwarfed by simply reading uncompressed weights from HBM.
Good fun. Now I wish RT cores would be programmable with some form of PTX, but for now it's Optix or die. Managed to do fun stuff with it but it's like pulling teeth.
I won an GPU hackathon back in 2019 doing something very similar to this; although the other way around, I was compressing weights using hardware modules.
Contra another comment: fairly low. (Or at least my search-fu has not been able to find any CVEs or published papers about breaking isolation between MIG instances. MPS should be generally be used only by one user so multiple of their own CUDA apps can attach to one (v)GPU.)
MIG is used a lot in HPC and multi-tenancy cloud, where isolation is important. See Figure 1 and §6.2:
The card is actually sliced into different instances (show up as different /dev/nvidiaXs), each with their own SMs, L2, and DRAM, that are isolated between each one. (MPS is for the same user to share a GPU instance: allows multiple CUDA apps to attach and time-slicing occurs.)
I remember a few years ago my hardware security professor suggested we try to implement Rowhammer on GPU. I ended up doing something else, but it looks like someone got there: https://arxiv.org/abs/2507.08166
MPS should only be used where all the workloads trust each other. It is similar to running multiple games on your computer simultaneously.
You cannot use NVLink with MPS or MIG, it is not isolated, and malformed NVLink messages can be authored in userspace and can crash the whole GPU. Some vendors, like Modal, allow you to request NVLink'd shared GPUs anyway.
MIG only makes sense for cloud providers. MPS only makes sense for interactive (read: not ML) workloads. Workloads needing more than 1 GPU cannot use either.
I do not see MIG mentioned in either paper. I do not think the papers are examining isolation security between instances, which the GP was asking about.
As per sibling comment, this is about utilization efficiency and not breaking isolation (between MIG instances). The conclusion:
> In this paper, we presented MISO, a technique to leverage the MIG
functionality on NVIDIA A100 GPUs to dynamically partition GPU
resources among co-located jobs. MISO deploys a learning-based
method to quickly find the optimal MIG partition for a given job
mix running in MPS. MISO is evaluated using a variety of deep
learning workloads and achieves an average job completion time
that is lower than the unpartitioned GPU scheme by 49% and is
within 10% of the Oracle technique.
MIG virtualization is IMHO weak sauce. Only seven slices. Seven? Extremely limited hardware support. Difficult to configure - like the early days of CUDA. It’s been in the works for what 7 years now and barely functional.
Meanwhile, don’t forget that if your workloads are cooperative, you can put all the processes you want on a single GPU and they’ll happily multitask. No security boundary of course, but who knows how good MIG is at that.
I’d greatly prefer better tools for cooperative GPU sharing like per process memory limits or compute priority levels. Also seems like it should be way easier to implement. As containerization and k8 have proven, there’s a ton of utility in bin packing your own workloads better without rock solid security boundaries.
I know several HPC sites that use it: they (e.g.) ordered cookie-cutter server designs/models to simplify logistics, but not all of their users need the complete capabilities, and so they slice/dice some portion into smaller instances for smaller jobs.
At some point the slices because so small that they stop being useful. An A100 can have as 'little' as 40G of memory, and you're now down to 5G per instance:
It's a reasonable argument that you'd only need it at the top-end of the hardware: the number of workloads that need all that compute and memory are not that common, so downshifting some hardware to resource slices that are more typical is not crazy. Of course you then upshift when needed: but if you had purchased 'smaller' cards because that's what you thought you (initially) needed, then you're stuck at that level. There's no way for you to upshift/de-downshift.
> Difficult to configure - like the early days of CUDA.
How hard is it to run nvidia-smi?
> Meanwhile, don’t forget that if your workloads are cooperative, you can put all the processes you want on a single GPU and they’ll happily multitask. No security boundary of course, but who knows how good MIG is at that.
The security boundary of MIG is lot better than MPS, which basically has no security. I know several folks running HPC clusters that use it to isolate the Slurm workloads of different users. And my search-fu has found no CVEs or published papers jailbreaking out of MIG instances.
> I’d greatly prefer better tools for cooperative GPU sharing like per process memory limits or compute priority levels. Also seems like it should be way easier to implement.
The sentiment in the title resonates, but for consumer GPUs (the article is about server cards).
The recently leaked M5 benchmarks reveal a 35% faster GPU. These improvements compound, so you can get a GPU that's effectively twice as fast by waiting a couple of years.
Modern GPUs are the equivalent of local supercomputers, but the drivers, languages and libraries are still playing catch up. Imagine the audio processing you could do if only you could target that hardware.
Apple gives developers almost all the compute drivers you could want from them. If you can't express your GPU acceleration as a Metal Compute Shader, you probably aren't leaving any GPU horsepower on the table. ANE and MLX will get exposed in higher-level CoreML frameworks, everyone should be happy.
35% raster improvements, it's worth noting, is not super impressive on the GPU side of things. Most raster compute is a square function, to double your render resolution you need a 4x the GPU power (on-paper) to handle the pixel count. That's what, six years of annual iteration? A large component of Apple and AMD's inability to break into Nvidia's CUDA empire is their obsession over raster optimization in a world where DLSS and FSR exists. It's a noble pursuit, but even as a gamer I've gotta admit they're wasting their time. We have software methods that can close the gap in render quality between $100 GPUs and $1000 GPUs, but no such solution for GPGPU compute.
> Imagine the audio processing you could do if only you could target that hardware.
That's an interesting thought. Commercial grade signal processing rely on FPGAs and the Fintech field adapted them for high frequency trading. I wonder if we will see signal processing enabled on GPUs for consumers if the GPU drivers were more open.
It should definitely be possible already using CUDA or computer shaders. From a theoretical view computer graphics is signal processing but with a signal consisting of up to four color channels across two dimensions. This is the view taken in a lot of papers and practical implementations. After all, a lot of computer graphics is about applying filters (post-processing) such as color grading, anti-aliasing, etc. to this signal.
So, in a very real sense, signal processing is exactly what the GPU is built for and primarily used for.
I recall there have been some efforts in audio DSP on GPUs, the audio bandwidth is low so even transporting the results back to the CPU to be played could be done fast enough to maintain a usable latency.
> It is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it.
Only a matter of time until we start seeing bogus Hard Science papers like that, now that we've given the Social Text people the tools they need to take their revenge.
They will argue that we had it coming, and that it serves us right, and maybe they're not wrong.
> please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way
the writeup is a classic example of what we lose through abstraction and how writing custom (and optimized) code still beats sticking to high-level implementations.
i would go further and say that the "megakernel" written as part of the optimization is highly-model dependent as well.
the whole "cuda moat" is from the generic implementations of the moving parts of the model architecture. at the same time, you lose a lot of performance through the generic code. it is like comparing writing a stock trading algo in next.js vs assembly.
training models is another landscape altogether, so props to those who can quickly adapt to the hardware they got.
Excellent writeup. I like the interpreter. But I can only assume all these ideas have been widely implemented at all significant labs for years, so I'm surprised to see this written in 2025. This is all about taking things to their logical conclusions, not arcane magic. If you're going to spend billions on GPUs, why wouldn't you spend a little on CUDA programmer hours?
> I can only assume all these ideas have been widely implemented at all significant labs for years, right?
Nope.
I was also surprised, when joining such significant labs at how much relatively-low hanging fruits were still available to work on. But the reality is that there is just too much work to do, each seemingly super-important, and not enough people to do it.
This is what killed IBM PowerPC in the ML market. Tried to get in with a faster CPU with NVLink embedded hoping that would win market share. But what won wasn't a faster machine or better architecture. A platform with more developers that has fewer bugs and everyone knows wins almost all the time. ML/AI developers are less rare today but still rare.
I'm very willing to believe that. When I hear that they just don't have enough staff for it I get the impression is that they set their hiring bar for engineers too high. Optimising CUDA is quite different from having experience training LLMs.
> they set their hiring bar for engineers too high
Not sure I agree, if you look at the head count growth of companies like OpenAI, Anthropic etc, it is super fast, its already pretty hard to keep everything working smoothly with that rate of employee growth, so going faster than that seems very risky.
Ultimately I think it's mostly caused by the field still being so new. Everything still needs to be optimized and there just aren't that many very good CUDA programmers to start with, then you need to find one that also has deep knowledge of ML and transformers architectures, which further drains the pool. And then when you do find one of them, there is 50 different things they could be working on instead of what's in the article, all equally or more impactful. The architectures being constantly evolving also make it hard/not a great ROI to go super super deep in single digit % optimization when there is new stuff coming out all the time that can be made an order of magnitude faster.
A good example of that is flash attention: it is maybe the most significant/impactful optimization in ML of the last few years. Tl;dr is how do you fuse the entire attention pipeline together to make it much faster and avoid massive tensor materialization. The bottleneck was obvious to anyone that profiled a Transformer-based model, but there was no obvious solution because of how softmax works. Yet the paper that ultimately unblock this was published back in 2019 [1], but it took 3 years for a team to connect the dots. Most people in pure ML engineering didn't know about the paper and don't have good enough CUDA knowledge/ GPU arch understanding, most people with good CUDA knowledge don't understand ML well enough, and even the author of that 2019 paper said "[we] hypothesize that this reduction
in memory accesses should improve Softmax performance on actual hardware" but didn't have the technical skills to test this or to see how that could be part of a bigger breakthrough because it requires understanding core concepts in how GPU worked and compute/memory imbalance.
> I get the impression is that they set their hiring bar for engineers too high.
whenever anyone says this they should be required to disclose whether they've actually 1) been employed to do this work 2) how many LC rounds they've failed during their last job search ..... lol
It would be nice if the article header would actually be clear that they are optimizing a CUDA chip. There is a difference between a GPU and a CUDA chip.
I bought a car with side impact airbags, so we’re damn well going to use the side impact airbags.
Maybe… you don’t actually want or need to use all the features of something you bought. Particularly given that GPUs previously used for cryptocurrency mining may have damaged themselves while being run full out for a year straight.
I'm reminded of how Carmack talked about the extra efficiencies available when targeting consoles, because you knew exactly what hardware was available.
It's great that the efficiencies available can be shown to be extractable. The real, much harder, trick is putting together a sufficiently smart compiler to enable them for heterogeneous compute setups.
The demoscene also is an example of how much you can do if you can be absolutely sure exactly what hardware you’re running on.
The problem is that even for things like consoles, it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
Sometimes I dream of what the world would do if we were mystically stuck on exactly the processors we have today, for twenty years.
I've wondered sometimes what software would look like if a crisis took out the ability to build new semiconductors and we had to run all our computing infrastructure on chips salvaged from pregnancy tests, shoplifting tags, cars, old PCs, and other consumer electronics. We'd basically move backwards about 20 years in process technology, and most computers would have speeds roughly equivalent to 90s/00s PCs.
But then, this still wouldn't incentivize building directly to the hardware, because of the need to run on a large variety of different hardware. You're still better off preferencing portability over performance, and then making it up by cutting scope and ease of development.
You might enjoy Dusk OS and its more extreme subling Collapse OS: https://duskos.org https://collapseos.org
Funny you say this... This exact thought experiment was going on last month! Laurie Wired [0] a cybersec youtuber asked it on twitter and got some interesting replies too!
[0]: https://www.youtube.com/watch?v=L2OJFqs8bUk
This sounds kind of similar to what I've heard about Cuba's relationship with cars and probably technology after the U.S embargo. Not sure how true it was/is though.
> I've wondered sometimes what software would look like if a crisis took out the ability to build new semiconductors and we had to run all our computing infrastructure on chips salvaged from pregnancy tests, shoplifting tags, cars, old PCs, and other consumer electronics. We'd basically move backwards about 20 years in process technology, and most computers would have speeds roughly equivalent to 90s/00s PCs.
Don't forget disposable vapes: https://news.ycombinator.com/item?id=45252817
Be careful what you wish for...
Optimizing for the hardware you are on is demonstrably an effort and skill issue. Everyone understands that with enough time and engineers, any piece of software could be optimized better. If only we had large volumes of inexpensive "intelligence" to throw at the problem.
This is one of my back-of-mind hopes for AI. Enlist computers as our allies in making computer software faster. Imagine if you could hand a computer brain your code, and ask it to just make the program faster. It becomes a form of RL problem, where the criteria are 1) a functionally equivalent program 2) that is faster.
This is what I was thinking, too. For so long, the default mode of operating a software company has been:
"Developer time is so expensive, we need to throw everything under the bus to make developers fast."
The kinds of things often thrown under the bus: Optimizations, runtime speed, memory footprint, disk image size, security, bug fixing, code cleanliness / lint, and so on. The result is crappy software written fast. Now, imagine some hypothetical AI (that we don't have yet) that makes developer time spent on the project trivial.
Optimistically: There might be time for some of these important software activities.
Pessimistically: Companies will continue to throw these things under the bus and just shit out crappy software even faster.
My favorite part of this phenomenon is every company that interviews developers on data structures and algorithms, then puts out a calculator app that takes half a gigabyte of storage and nearly as much RAM to run.
I have not had to use Windows in ages but every time I touch it I am amazed at the fact that it takes like 10-15GB for a bare installation of the latest version, while it does about the same amount of work as XP was able to do in under 1GB. Yes I am aware assets are a thing but has usability increased as a result of larger assets?
The latest iOS update (!) is more than 16gb… a mobile OS…
It ships with just as many features as Windows 10 which is also in that range, so it's not too surprising.
To be fair, windows has so much backwards compatibility, I'm sure there's a ton of stuff there that's not used by 99.9% of people.
That's a good or a bad thing depending on your perspective
I am fairly certain that if you install every Debian package available it will still be less than 16GB. Windows 10 is a bare OS at that size.
> functionally equivalent
Who confirms what is functionally equivalent?
You can, with some programming languages, require a proof of this (see: Rocq, formerly 'coq').
I think a more interesting case might be showing functional equivalence on some subset of all inputs (because tbh, showing functional equivalence on all inputs often requires "doing certain things the slow way").
An even more interesting case might be "inputs of up to a particular complexity in execution" (which is... very hard to calculate, but likely would mean combining ~code coverage & ~path coverage).
Of course, doing all of that w/o creating security issues (esp. with native code) is an even further out pipe dream.
I'd settle for something much simpler, like "we can automatically vectorize certain loop patterns for particular hardware if we know the hardware we're targeting" from a compiler. That's already hard enough to be basically a pipe dream.
Yeah restructuring for autovectorization with otherwise equivalent results would be a great example and step
Notably for example C/C++ code is not necessarily functionally equivalent, when it's compiled on different platforms.
It's not even guaranteed to be functionally equivalent when compiled on the same hardware with the same compiler etc. Undefined behaviour can do what it wants. (And implementation defined behaviour also has a lot of leeway.)
However, if you stick to only defined behaviour, they are 'functionally equivalent', if your compiler doesn't have a bug.
The Magic does, of course!
It's not just being sure exactly what the hardware is, in demos you have the additional luxury of not being interactive. So you can plan everything exactly out in advance.
This is true of inference too.
> it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
For the last six years my full time job has largely been optimizing games where most of the team has been working with this mindset. Sometimes someone spends a few days of just getting things done, followed by others building on top of it. This leads to systems which are not fast enough and take me weeks or even months to optimize.
We even got together at my last job and created a series of lectures on performance and best practices for everyone, including artists, to get ahead of this type of issues. It was apparently very appreciated, especially among the non technical staff who said it was valuable and they had no idea.
Consoles are pretty heterogeneous IRL too, though. You have multiple SKUs (regular and Pro, for example), not to mention most games will also target multiple consoles (PlayStation + Xbox + Switch is a common combo).
So in reality the opportunities to really code against a specific piece of hardware are few and far between...
Heck, then you get into multiple operating modes of the same hardware - the Nintendo Switch has a different perf profile if it's docked vs. not.
This used to be less true at the time Carmack said it :>
The original Switch was launched in 2016 that's plenty of time with a stable platform. The multiple operating modes in practice can be approached by coding against un-docked ( handheld) and then adding bonus quality for docked.
This is exactly what I've been doing when optimizing games for the switch.
A handful of variants for consoles is not nearly as bad as the almost limitless variety on PC.
>Sometimes I dream of what the world would do if we were mystically stuck on exactly the processors we have today, for twenty years.
Reminds me of the old American cars in Cuba - https://en.wikipedia.org/wiki/Yank_tank
Cubans benefited from the cars being older, simpler, and robust. Imagine freezing car tech now, with so many electronics, far more parts and built to be replaced relatively quickly!
These older cars broke down all the time. There's a reason old American sit-coms have at least some characters always tinkering with their cars: you needed to do that. Nowadays, cars just work.
That's what you got with BeOS...throw out backward compatibility and build to current Best practices...it's ability to extract performance out of an 133 Mhz processor was amazing.
Even better was the BeBox running BeOS. That was a cool use of a fast dual CPU PowerPC platform with great graphics. Amiga vibes. But turns out that humans need software applications more than they need efficient use of the hardware.
It was the story with so many things "back then" - even Itanium was a beast on custom-coded perfect applications.
And don't forget that Sony and Microsoft have compilers teams, working on specialised GCC and LLVM backends, and sometimes upstreaming general improvements.
> The problem is that even for things like consoles, it's usually more "cost efficient" to write normal fast-to-write code that isn't maximally effective, let the compiler do its magic, and call it good enough.
Given all the time and money, there's also a skills gap.
You can use money and time to buy skills.
Unlimited time and money will not make someone like me a John Carmack level programmer. There are a finite number of individuals operating at his level or above and having them hyper optimize code is a poor use of their time.
Oh, I meant more like: if you have enough money, you can employ John Carmack (or similar) for a while.
For hardware that isn't pre-framebuffer, demos seem to be mostly about hyperoptimizing the code in a portable way, much less optimizing to specific hardware timings and quirks.
Rust jobs would actually touch more hard tech rather than being concentrated in crypto scams.
Unless it happens to something like a PS3 or Saturn.
Despite sentiments around Mojo being negative on HN due to the stack not being OSS, this is the ultimate goal of Modular.
https://signalsandthreads.com/why-ml-needs-a-new-programming...
I listened to that episode, by chance, last week. It was well worth the time to listen.
Not using the NVDEC and NVJPG units to decompress weights into registers? And you say you're using the whole GPU. There are entire blocks on the silicon going idle!
Ha made me chuckle. For those wondering seriously about this, it’s not a viable optimization because weights are not readily compressible via JPEG/DCT, and there are a limited number of these units on the chip which bottlenecks throughout, meaning speed is dwarfed by simply reading uncompressed weights from HBM.
It seems like this is indeed possible using video codecs: https://arxiv.org/abs/2407.00467v1
Good fun. Now I wish RT cores would be programmable with some form of PTX, but for now it's Optix or die. Managed to do fun stuff with it but it's like pulling teeth.
Yeah, but they could be.
I won an GPU hackathon back in 2019 doing something very similar to this; although the other way around, I was compressing weights using hardware modules.
Have a link to this?
Unfortunately no. I have cool picture, though!
I will have to settle for a picture then :)
Send email (see profile), I'll gladly share more details ^^.
If your workload can't actually use the whole (NVidia) GPU, it is possible to slice it up so that it can be shared between multiple users:
* https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
* https://www.nvidia.com/en-us/technologies/multi-instance-gpu...
Or having multiple processes from one user share it:
* https://docs.nvidia.com/deploy/mps/index.html
AIUI only on workstation/server cards though, it's one of the levers they pull to artificially segment their lineup.
How real is the risk of information leakage if I’m on a shared GPU with multiple users?
Contra another comment: fairly low. (Or at least my search-fu has not been able to find any CVEs or published papers about breaking isolation between MIG instances. MPS should be generally be used only by one user so multiple of their own CUDA apps can attach to one (v)GPU.)
MIG is used a lot in HPC and multi-tenancy cloud, where isolation is important. See Figure 1 and §6.2:
* https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
The card is actually sliced into different instances (show up as different /dev/nvidiaXs), each with their own SMs, L2, and DRAM, that are isolated between each one. (MPS is for the same user to share a GPU instance: allows multiple CUDA apps to attach and time-slicing occurs.)
Is anyone actually looking at this platform?
> Is anyone actually looking at this platform?
Question unclear: looking at to use (yes: lots in HPC, hypervisors), or looking at from a security POV (don't know)?
Yeah I'm talking about the latter
I remember a few years ago my hardware security professor suggested we try to implement Rowhammer on GPU. I ended up doing something else, but it looks like someone got there: https://arxiv.org/abs/2507.08166
MIG is low, the exploit would be exotic.
MPS should only be used where all the workloads trust each other. It is similar to running multiple games on your computer simultaneously.
You cannot use NVLink with MPS or MIG, it is not isolated, and malformed NVLink messages can be authored in userspace and can crash the whole GPU. Some vendors, like Modal, allow you to request NVLink'd shared GPUs anyway.
MIG only makes sense for cloud providers. MPS only makes sense for interactive (read: not ML) workloads. Workloads needing more than 1 GPU cannot use either.
Very real.
https://www.usenix.org/system/files/usenixsecurity24-guo-yan...
https://www.sciencedirect.com/science/article/pii/S016740482...
I do not see MIG mentioned in either paper. I do not think the papers are examining isolation security between instances, which the GP was asking about.
Yeah, I only posted two links from my notes, from when I was looking at this a few months ago. Here's one on MIG.
https://arxiv.org/abs/2207.11428
As per sibling comment, this is about utilization efficiency and not breaking isolation (between MIG instances). The conclusion:
> In this paper, we presented MISO, a technique to leverage the MIG functionality on NVIDIA A100 GPUs to dynamically partition GPU resources among co-located jobs. MISO deploys a learning-based method to quickly find the optimal MIG partition for a given job mix running in MPS. MISO is evaluated using a variety of deep learning workloads and achieves an average job completion time that is lower than the unpartitioned GPU scheme by 49% and is within 10% of the Oracle technique.
That paper doesn’t seem to be about security vulnerabilities in MiG but rather using it to improve workload efficiency
MIG virtualization is IMHO weak sauce. Only seven slices. Seven? Extremely limited hardware support. Difficult to configure - like the early days of CUDA. It’s been in the works for what 7 years now and barely functional.
Meanwhile, don’t forget that if your workloads are cooperative, you can put all the processes you want on a single GPU and they’ll happily multitask. No security boundary of course, but who knows how good MIG is at that.
I’d greatly prefer better tools for cooperative GPU sharing like per process memory limits or compute priority levels. Also seems like it should be way easier to implement. As containerization and k8 have proven, there’s a ton of utility in bin packing your own workloads better without rock solid security boundaries.
> MIG virtualization is IMHO weak sauce.
I know several HPC sites that use it: they (e.g.) ordered cookie-cutter server designs/models to simplify logistics, but not all of their users need the complete capabilities, and so they slice/dice some portion into smaller instances for smaller jobs.
E.g.:
* https://hpc.njit.edu/MIG/
* https://www.rc.virginia.edu/2025/07/hpc-maintenance-aug-12-2...
> Only seven slices. Seven?
At some point the slices because so small that they stop being useful. An A100 can have as 'little' as 40G of memory, and you're now down to 5G per instance:
* https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#a10...
> Extremely limited hardware support.
It's a reasonable argument that you'd only need it at the top-end of the hardware: the number of workloads that need all that compute and memory are not that common, so downshifting some hardware to resource slices that are more typical is not crazy. Of course you then upshift when needed: but if you had purchased 'smaller' cards because that's what you thought you (initially) needed, then you're stuck at that level. There's no way for you to upshift/de-downshift.
> Difficult to configure - like the early days of CUDA.
How hard is it to run nvidia-smi?
> Meanwhile, don’t forget that if your workloads are cooperative, you can put all the processes you want on a single GPU and they’ll happily multitask. No security boundary of course, but who knows how good MIG is at that.
The security boundary of MIG is lot better than MPS, which basically has no security. I know several folks running HPC clusters that use it to isolate the Slurm workloads of different users. And my search-fu has found no CVEs or published papers jailbreaking out of MIG instances.
> I’d greatly prefer better tools for cooperative GPU sharing like per process memory limits or compute priority levels. Also seems like it should be way easier to implement.
This is what MPS is for:
* https://docs.nvidia.com/deploy/mps/index.html
* https://man.archlinux.org/man/extra/nvidia-utils/nvidia-cuda...
Or green contexts
The sentiment in the title resonates, but for consumer GPUs (the article is about server cards).
The recently leaked M5 benchmarks reveal a 35% faster GPU. These improvements compound, so you can get a GPU that's effectively twice as fast by waiting a couple of years.
Modern GPUs are the equivalent of local supercomputers, but the drivers, languages and libraries are still playing catch up. Imagine the audio processing you could do if only you could target that hardware.
Apple gives developers almost all the compute drivers you could want from them. If you can't express your GPU acceleration as a Metal Compute Shader, you probably aren't leaving any GPU horsepower on the table. ANE and MLX will get exposed in higher-level CoreML frameworks, everyone should be happy.
35% raster improvements, it's worth noting, is not super impressive on the GPU side of things. Most raster compute is a square function, to double your render resolution you need a 4x the GPU power (on-paper) to handle the pixel count. That's what, six years of annual iteration? A large component of Apple and AMD's inability to break into Nvidia's CUDA empire is their obsession over raster optimization in a world where DLSS and FSR exists. It's a noble pursuit, but even as a gamer I've gotta admit they're wasting their time. We have software methods that can close the gap in render quality between $100 GPUs and $1000 GPUs, but no such solution for GPGPU compute.
> Imagine the audio processing you could do if only you could target that hardware.
That's an interesting thought. Commercial grade signal processing rely on FPGAs and the Fintech field adapted them for high frequency trading. I wonder if we will see signal processing enabled on GPUs for consumers if the GPU drivers were more open.
It should definitely be possible already using CUDA or computer shaders. From a theoretical view computer graphics is signal processing but with a signal consisting of up to four color channels across two dimensions. This is the view taken in a lot of papers and practical implementations. After all, a lot of computer graphics is about applying filters (post-processing) such as color grading, anti-aliasing, etc. to this signal.
So, in a very real sense, signal processing is exactly what the GPU is built for and primarily used for.
I recall there have been some efforts in audio DSP on GPUs, the audio bandwidth is low so even transporting the results back to the CPU to be played could be done fast enough to maintain a usable latency.
> It is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way, and we have no intention whatsoever of supporting it.
My favorite type of code
I wish more people posting code would be honest in this way.
Loved this too hahaha..
[flagged]
I know I'm being unfair, but something about the writing style reminds me of this classic:
Transgressing the Boundaries: Towards a Transformative Hermeneutics of Quantum Gravity
https://physics.nyu.edu/faculty/sokal/transgress_v2/transgre...
Ben here -- you may be amused to know that Alan Sokal was my dad's freshman roommate in undergrad!
Ben, as in Benjamin Spector? (As noted below by sciurus.)
Please accept my sheepish apology for taking an unwarranted potshot at your writing style.
OTOH, if I hadn't, none of us would have known about that remarkable connection!
Awesome! We truly have Transgressed the Boundaries.
(And I'm curious... The way you said "Ben here" makes me wonder if I know you?)
(I think we was just introducing himself as Ben Spector, the lead author of the paper.)
D'oh! Thank you, and now I owe Ben an apology.
Only a matter of time until we start seeing bogus Hard Science papers like that, now that we've given the Social Text people the tools they need to take their revenge.
They will argue that we had it coming, and that it serves us right, and maybe they're not wrong.
Already been done: https://www.nationalgeographic.com/pages/article/131003-boha...
> please be warned that this really is research code; it is sensitive to compiler versions, GPU setup, and sometimes even being looked at the wrong way
the writeup is a classic example of what we lose through abstraction and how writing custom (and optimized) code still beats sticking to high-level implementations.
i would go further and say that the "megakernel" written as part of the optimization is highly-model dependent as well.
the whole "cuda moat" is from the generic implementations of the moving parts of the model architecture. at the same time, you lose a lot of performance through the generic code. it is like comparing writing a stock trading algo in next.js vs assembly.
training models is another landscape altogether, so props to those who can quickly adapt to the hardware they got.
Excellent writeup. I like the interpreter. But I can only assume all these ideas have been widely implemented at all significant labs for years, so I'm surprised to see this written in 2025. This is all about taking things to their logical conclusions, not arcane magic. If you're going to spend billions on GPUs, why wouldn't you spend a little on CUDA programmer hours?
> I can only assume all these ideas have been widely implemented at all significant labs for years, right?
Nope.
I was also surprised, when joining such significant labs at how much relatively-low hanging fruits were still available to work on. But the reality is that there is just too much work to do, each seemingly super-important, and not enough people to do it.
This is what killed IBM PowerPC in the ML market. Tried to get in with a faster CPU with NVLink embedded hoping that would win market share. But what won wasn't a faster machine or better architecture. A platform with more developers that has fewer bugs and everyone knows wins almost all the time. ML/AI developers are less rare today but still rare.
I'm very willing to believe that. When I hear that they just don't have enough staff for it I get the impression is that they set their hiring bar for engineers too high. Optimising CUDA is quite different from having experience training LLMs.
> they set their hiring bar for engineers too high
Not sure I agree, if you look at the head count growth of companies like OpenAI, Anthropic etc, it is super fast, its already pretty hard to keep everything working smoothly with that rate of employee growth, so going faster than that seems very risky.
Ultimately I think it's mostly caused by the field still being so new. Everything still needs to be optimized and there just aren't that many very good CUDA programmers to start with, then you need to find one that also has deep knowledge of ML and transformers architectures, which further drains the pool. And then when you do find one of them, there is 50 different things they could be working on instead of what's in the article, all equally or more impactful. The architectures being constantly evolving also make it hard/not a great ROI to go super super deep in single digit % optimization when there is new stuff coming out all the time that can be made an order of magnitude faster.
A good example of that is flash attention: it is maybe the most significant/impactful optimization in ML of the last few years. Tl;dr is how do you fuse the entire attention pipeline together to make it much faster and avoid massive tensor materialization. The bottleneck was obvious to anyone that profiled a Transformer-based model, but there was no obvious solution because of how softmax works. Yet the paper that ultimately unblock this was published back in 2019 [1], but it took 3 years for a team to connect the dots. Most people in pure ML engineering didn't know about the paper and don't have good enough CUDA knowledge/ GPU arch understanding, most people with good CUDA knowledge don't understand ML well enough, and even the author of that 2019 paper said "[we] hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware" but didn't have the technical skills to test this or to see how that could be part of a bigger breakthrough because it requires understanding core concepts in how GPU worked and compute/memory imbalance.
[1]: https://arxiv.org/pdf/1805.02867
> I get the impression is that they set their hiring bar for engineers too high.
whenever anyone says this they should be required to disclose whether they've actually 1) been employed to do this work 2) how many LC rounds they've failed during their last job search ..... lol
> they set their hiring bar for engineers too high
You chase away your top engineers when you glom up the system with dumbfucks.
I thought I was going to see something crazy like using RT cores in parallel with tensor cores. Like compiling matmul into triangle intersections.
I don’t think datacenter GPUs have many of those.
It would be nice if the article header would actually be clear that they are optimizing a CUDA chip. There is a difference between a GPU and a CUDA chip.
You could do similar stuff on AMD chips.
I bought a car with side impact airbags, so we’re damn well going to use the side impact airbags.
Maybe… you don’t actually want or need to use all the features of something you bought. Particularly given that GPUs previously used for cryptocurrency mining may have damaged themselves while being run full out for a year straight.
Figure 1: Zoooommmm
Accept!
[flagged]
Kind of. We like to call them grad students.
We've finally achieved AGI* !
* Actually Graduate Individuals
Science is just Stochastic Graduate Descent, as we used to say.
AI = Academic Interns
They're cheaper!
Time to first token is 20 odd years and 100k of education debt
Yeah, but the previous "investors" ate up that cost and left pure profit for us "founders" (professors).
As long as they damn well use their whole brain.