I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):
> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"
It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.
And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.
My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.
Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.
I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".
I can always ask $AI to do the equivalent for me, which is a tragedy according to some.
Shamelessly responding as the author. I (mostly) agree with you here.
> please be surgically precise with your terms
There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.
For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.
There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:
I agree with your points and will try to improve this more. It's just a hard set of compromises.
I appreciate your response. I made a point of not revising my comment after posting it and finding in a subsequent parable the following, quoting:
> Each SM is broken up into 4 identical quadrants, which NVIDIA calls SM subpartitions, each containing a Tensor Core, 16k 32-bit registers, and a SIMD/SIMT vector arithmetic unit called a Warp Scheduler, whose lanes (ALUs) NVIDIA calls CUDA Cores.
And right after:
> CUDA Cores: each subpartition contains a set of ALUs called CUDA Cores that do SIMD/SIMT vector arithmetic.
So, to your defense and my shame -- you *did* do better than I was able to infer from first glance. And I can take absolutely no issue with a piece elaborating on originally "vague" sentence later on -- we need to read top to bottom, after all.
Much of the difficulty with laying out knowledge in written word is inherent constraints like choosing between deferring detail to "further down" at the expense of giving the "bird's eye view". I mean there is a reason writing is hard, technical writing perhaps more so, in a way. You're doing much better than a lot of other stuff I've had to learn with, so I can only thank you to have done as much as you already have.
To be more constructive still, I agree the border between clarity and utility isn't always clearly drawn. But I think you can think of it as a service to your readers -- go with precision I say -- if you really presuppose the reader should know SIMD, chances are they are able to grok a new definition like "SIMD lane" if you define it _once_ and _well_. You don't need to be "unwieldy" in repetition -- the first time may be hard but you only need to do it once.
I am rambling. I do believe there are worse and better ways to impart knowledge of the kind in writing, but I too obviously don't have the answers, so my criticism was in part inconstructive, just a sheer outcry of mild frustration once I started conflating things from the get go but before I decided to give it a more thorough read.
One last thing though: I always like when a follow-up article starts with a preamble along of "In the previous part of the series..." so new visitors can simultaneously become aware there's prior knowledge that may be assumed, _and_ navigate their way to desired point in the series, all the way to the start perhaps. That frees you from e.g. wanting to annotate abbreviations in every part, if you want to avoid doing that.
Thank you for taking the time to write this reply. Agree with "in the previous part of this series" comment. I'll try to find a way to highlight this more.
What I'd like to add to this page is some sort of highly clear glossary that defines all the terms at the top (but in some kind of collapsable fashion) so I can define everything with full clarity without disrupting the flow. I'll play with the HTML and see what I can do.
2) What are your thoughts on links to the wiki articles under things such as "SIMD" or "ALUs" for the precise meaning while using the metaphors in your prose?
Most novices tend to Google and end up on Wikipedia for the trees. It's harder to find the forest.
I feel you handle this balance quite gracefully, to the point where I was impressed at your handling of the issue while reading and before checking the comments section. I don't know why the idea of something being called something by marketing or documentation (names which one must, strategically accept and internalize) but fundamentally and functionally being better described with other language isn't clearer (which is more useful, so also needed) to the grandparent poster. You want people to be aware of both and explain both without dwelling or getting caught on it, it struck me as an artful choice.
> This article assumes you've read [this] and [this] and understand
> [this topic] and [this topic too]
I'm not sure that's helpful, and, I don't put everything. Those links might also have further links saying you need X, Y, and Z. But at least there is a trail on where to start
Nvidia’s marketing team uses confusing terminology to make their product sound cooler than it is.
An Intel “core” can perform AVX512 SIMD instructions that involve 16 lanes of 32-bit data. Intel cores are packaged in groups of up to 16. And, they use hyperthreading, speculative execution and shadow registers to cover latency.
An Nvidia “Streaming Multiprocessor” can perform SIMD instructions on 32 lanes of 32-bits each. Nvidia calls these lanes “cores” to make it feel like one GPU can compete with thousands of Intel CPUs.
Simpler terminology would be: an Nvidia H100 has 114 SM Cores, each with four 32-wide SIMD execution units (where basic instructions have a latency of 4 cycles) and each with four Tensor cores. That’s a lot more capability than a high-end Intel CPUs, but not 14,592 times more.
The CUDA API presents a “CUDA Core” (single SIMD lane) as if it was a thread. But, for most purposes it is actually a single SIMD lane in the 32-wide “Warp”. Lots of caveats apply in the details though.
It's all very circular, if you try to avoid the architecture-specific details of individual hardware designs. A SIMD "lane" is roughly equivalent to an ALU (arithmetic logic unit) in a conventional CPU design. Conceptually, it processes one primitive operation such as add, multiple, or FMA (fused-multiply-add) at a time on scalar values.
Each such scalar operation is on a fixed width primitive number, which is where we get into the questions of what numeric types the hardware supports. E.g. we used to worry about 32 vs 64 bit support in GPUs and now everything is worrying about smaller widths. Some image processing tasks benefit from 8 or 16 bit values. Lately, people are dipping into heavily quantized models that can benefit from even narrower values. The narrower values mean smaller memory footprint, but also generally mean that you can do more parallel operations with "similar" amounts of logic since each ALU processes fewer bits.
Where this lane==ALU analogy stumbles is when you get into all the details about how these ALUs are ganged together or in fact repartitioned on the fly. E.g. a SIMD group of lanes share some control signals and are not truly independent computation streams. Different memory architectures and superscalar designs also blur the ability to count computational throughput, as the number of operations that can retire per cycle becomes very task-dependent due to memory or port contention inside these beasts.
And if a system can reconfigure the lane width, it may effectively change a wide ALU into N logically smaller ALUs that reuse most of the same gates. Or, it might redirect some tasks to a completely different set of narrower hardware lanes that are otherwise idle. The dynamic ALU splitting was the conventional story around desktop SIMD, but I think is less true in modern designs. AFAICT, modern designs seem more likely to have some dedicated chip regions that go idle when they are not processing specific widths.
> that can itself perform actual SIMD instructions?
Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.
In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.
There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.
The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.
Interestingly, I find llms are really good for this problem; when looking up one term just leads to more unknown terms and you struggle to find a starting point from which to understand the rest, I have found that they can tell you where to start.
I’m being earnest: what is an appropriate level of computer architecture knowledge? SIMD is 50 years old.
From the resource intro:
> Expected background: We’re going to assume you have a basic understanding of LLMs and the Transformer architecture but not necessarily how they operate at scale.
I suppose this doesn’t require any knowledge about how computers work, but core CPU functionality seems…reasonable?
SIMD is quite old but the changes Nvidia made to call it SIMT and that they used as an excuse to call their vector lanes "cores" are quite a bit newer.
My recursive brain got a chuckle out of wondering about "imprecise" being in quotes. I found the quotes made the meaning a touch...imprecise.
While I can understand the imprecise point, I found myself very impressed by the quality of the writing. I don't envy making digestible prose about the differences between GPUs and TPUs.
I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.
The principles of parallel computing, and how they work at hardware and driver levels are more broad. Some parts of it are provincial (Strong province though...), and others are more general.
It's hard to find skills that don't have a degree of provincialism. It's not a great feeling, but you more on. IMO, don't over-idealize the concept of general-knowledge to your detriment.
I think we can also untangle the open-source part from the general/provincial. There is more to the world worth exploring.
It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.
I agree that “learning CUDA wasn’t particularly difficult to get started,” there are Grand Canyon sized chasms between CUDA and its alternatives when attempting to crank performance.
Well, I think to a degree that depends what you're targeting.
Single socket 8 core CPU? Yes.
If you spent some time playing with trying to eke out performance on Xeon Phi and have done NUMA-aware code for multi socket boards and optimising for the memory hierarchy of L1/L2/L3 then it really isn't that different.
Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.
I grew up learning programming on a genuine IBM PC running MS-DOS, neither of which was FOSS but taught me plenty that I routinely rely on today in one form or another.
There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.
I've been hearing that for over a decade. I can't even name off hand any CUDA competitors, none of them are likely to gain enough traction to upset CUDA in the coming decade.
ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.
Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.
Look at the state of PyTorch’s CI pipelines and you’ll immediately see that ROCm is a nightmare. Especially nowadays when TPU and MPS, while missing features, rarely create cascading failures throughout the stack.
I still don't see ROCm as that serious a threat, they're still a long way behind in library support.
I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.
Talking about hardware rather than software, you have AMD and Intel. And - if your platform is not x86_64, NVIDIA is probably not even one of the competitors; and you have ARM, Qualcomm, Apple, Samsung and probably some others.
It's a valid point of view, but I don't see the value in sharing it.
There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.
It reads a bit like "You shouldn't use it because..."
Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?
There are tons of ML compilers right now, FlashAttention brought back the cache-aware model to parallel programming, Moore's law hit is limit and heterogeneous hardware is taking taking off.
Just some fundamentals I can think of off the top of my head. I'm surprised people saying that the lower level systems/hardware stuff are untransferable. These things are used everywhere. If anything, it's the AI itself that's potentially a bubble, but the fundamental need for understanding performance of systems & design is always there.
Im actually doing a ton of research in the area myself, the caution was against becoming an Nvidia expert narrowly rather than a general low level programmer with Nvidia skills included.
I mean, I'm in Toronto Canada, a fairly big city and market, and have an open seat for a couple of good senior Oracle DBAs pretty much constantly. The market may have reduced over decades but there's still more demand than supply. And the core DBA skills are transferable to other RDBMS as well. While I agree that some niche technologies are fleeting, it's perhaps not the best example :-)
That's actually interesting! My experience is different, especially compared to the late 90s and early 00s, most people avoid Oracle if they can. But yes, its always worth having someone who's job is to think about the database if it's your lynchpin.
Well, there's the difference. Maybe demand has collapsed for the kind of people who knew how to tune the Oracle SGA and get their laughable CLI client to behave, but the market for people who structurally understood the best ways to organize, insert and pull data back out is still solid.
Re Oracle and "big 90s names" specifically, there is a lot of it out there. Maybe it never shows up in the code interfaces HNers have to exercise in their day jobs, but the tech, for better or worse, is massively prevalent in the everyday world of transit systems and payroll and payment...ie all the unsexy parts of modern life.
I think it's famously said that 5% of IT is in the exciting new stuff that's on Hacker News front page, and 95% is in boring line-of-business, back office "enterprise" software that's as unglamorous as it is unavoidable :-). Even seemingly modern giants like Google or Amazon etc - check what their payroll and financial system is in the background.
And wait until I tell you about my Cobol open seats - on modern Linux on cloud VMs too! :-)
There are two CUDAs – a hardware architecture, and a software stack for it.
The software is proprietary, and easy to ignore if you don't plan to write low-level optimizations for NVIDIA.
However, the hardware architecture is worth knowing. All GPUs work roughly the same way (especially on the compute side), and the CUDA architecture is still fundamentally the same as it was in 2007 (just with more of everything).
It dictates how shader languages and GPU abstractions work, regardless of whether you're using proprietary or open implementations. It's very helpful to understand peculiarities of thread scheduling, warps, different levels of private/shared memory, etc. There's a ridiculous amount of computing power available if you can make your algorithms fit the execution model.
This is a JAX article, a parallel computation library that's meant to abstract away vendor specific details. Obviously if you want the most performance you need to know specifics of your hardware, but learning the high level of how a GPU vs TPU works seems like useful knowledge regardless.
Sounds good on paper but unfortunately I've had numerous issues with these "abstractors". For example, PyTorch had serious problems on Apple Silicon even though technically it should "just work" by hiding the implementation details.
In reality, what ends up happening is that some features in JAX, PyTorch, etc. are designed with CUDA in mind, and Apple Silicon is an afterthought.
I think I’d rather get familiar with cupy or Jax or something. Blas/lapack wrappers will never go out of style. It is a subset of the sort of stuff you can do on a GPU but it seems like a nice effort:functionality reward ratio.
You can write software for the hardware in a cross-compiled language like Triton. The hardware reality stays the same, a company like Cerebras might have the superior architecture, but you have server rooms filled with H100, A100, and MI300s whether you believe in the hardware or not.
The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.
Shamelessly: I’m open to work if anyone is hiring.
It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).
We should remember that these structural diagrams are _not_ necessarily what NVIDIA actually has as hardware. They carefully avoid guaranteeing that any of the entities or blocks you see in the diagrams actually _exist_. It is still just a mental model NVIDIA offers for us to think about their GPUs, and more specifically the SMs, rather than a simplified circuit layout.
For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.
Interesting perspective. Aren't SMs basically blocked while running tensor core operations, which might hint that it's the same FPUs doing the work after all?
I doubt that can fully be the case, because there are other functional units on SMs, like Load/Store, ALU / Integer ops, and Special Function Units. But you may be right, we would need to consult the academic "investigatory" papers or blog posts and see whether this has been checked.
This whole series is fantastic! Does an excellent job of explaining the theoretical limits to running modern AI workloads and explains the architecture and techniques (in particular methods of parallelism) you can use.
Yes it's all TPU focussed (other than this most recent part) but a lot of what it discusses are generally principles you can apply elsewhere (or easy enough to see how you could generalise them).
This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.
Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.
Do TPUs allow having a variable array dimension at somewhat inner nesting level of the loop structure yet?
Like, where you load expensive (bandwidth-heavy) data in from HBM, process a variable-length array with this, then stow away/accumulate into a fixed-size vector?
Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently.
An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer.
If you have optimized your math heavy code and it is already in a typed language and you need it to be faster, then you think about the GPU options
In my experience you can roughly get 8x speed improvement.
Turning a 4 second web response into half a second can be game changing. But it is a lot easier to use a web socket and put a spinner or cache result in background.
This is part 12 … the title seems to hint on how do one think about Gpu today … eg why llm comes about. Instead it is about cf with tpu? And then I note the part 12 … not sure what one should expect to jump in the middle of a whole series and what … well may stop and move on.
It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.
What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.
As a real time rendering engineer, this is how it’s always been. NV obfuscates much of the info to prevent competitors from understanding changes between generations. Other vendors aren’t great at this either.
In games, you can get NDA disclosures about architectural details that are closer to those docs. But I’ve never really seen any vendor (besides Intel) disclose this stuff publicly
With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.
Plenty of circumstantial evidence pointing to the fact NVIDIA prefers to hand out semi-tailored documentaion resources to signatories and other "VIPs", if not the least to exert control over who and how uses their products. I wouldn't put it past them to routinely neglect their _public_ documentation, for one reason or another that makes commercial sense to them but not the public. As for incentives, go figure indeed -- you'd think by walling off API documentation, they're shooting themselves in the feet every day, but in these days of betting it all on AI, which means selling GPUs, software and those same NDA-signed VIP-documentation articles to "partners", maybe they're all set anyway and care even less for the odd developer who wants to know how their flagship GPU works.
What makes you think that? It appears most of this material came straight out of NVIDIA documentation. What do you think is missing? I just checked and found the H100 diagram for example is copied (without being correctly attributed) from the H100 whitepaper: https://resources.nvidia.com/en-us-hopper-architecture/nvidi...
Much of the info on compute and bandwidth is from that and other architecture whitepapers, as well as the CUDA C++ programming guide, which covers a lot of what this article shares, in particular chapters 5, 6, and 7. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
There’s plenty of value in third parties distilling and having short form versions, and of writing their own takes on this, but this article wouldn’t have been possible without NVIDIA’s docs, so the speculation, FUD and shade is perhaps unjustified.
Meaning what? Something less flexible? Less CUDA cores and more Tensor Cores?
The majority of NVidia's profits (almost 90%) do come from data center, most of which is going to be neural net acceleration, and I'd have to assume that they have optimized their data center products to maximize performance for typical customer workloads.
I'm sure that Microsoft would provide feedback to Nvidia if they felt changes were needed to better compete with Google in the cloud compute market.
I've got to assume so, since data center revenue growth seems to have grown in sync with recent growth in AI adoption. CUDA has been around for a long time, so it would seem highly coincidental if non-AI CUDA usage was only just now surging at same time as AI usage is taking off, and new data center build announcements seem to invariably be linked to AI.
That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1].
This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
if (theradIdx.x < 4) {
A;
B;
} else {
X;
Y;
}
Z;
The diagram shows how this executes in the following order:
IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
I find the piece, much like a lot of other documentation, "imprecise". Like most such efforts, it likely caters to a group of people expected to benefit from being explained what a GPU is, but it fumbles it terms, e.g. (the first image with burned-in text):
> The "Warp Scheduler" is a SIMD vector unit like the TPU VPU with 32 lanes, called "CUDA Cores"
It's not clear from the above what a "CUDA core" (singular) _is_ -- this is the archetypical "let me explain things to you" error most people make, in good faith usually -- if I don't know the material, and I am out to understand, then you have gotten me to read all of it but without making clear the very objects of your explanation.
And so, for these kind of "compounding errors", people who the piece was likely targeted at, are none the wiser really, while those who already have a good grasp of the concepts attempted explained, like what a CUDA core actually is, already know most of what the piece is trying to explain anyway.
My advice to everyone who starts out with a back of envelope cheatsheet then decides to publish it "for the good of mankind", e.g. on Github: please be surgically precise with your terms -- the terms are your trading cards, then come the verbs etc. I mean this is all writing 101, but it's a rare thing, evidently. Don't mix and match terms, don't conflate them (the reader will do it for you many times over for free if you're sloppy), and be diligent with analogies.
Evidently, the piece may have been written to help those already familiar with TPU terminology -- it mentions "MXU" but there's no telling what that is.
I understand I am asking for a tall order, but the piece is long and all the effort that was put in, could have been complemented with minimal extra hypertext, like annotated abbreviations like "MXU".
I can always ask $AI to do the equivalent for me, which is a tragedy according to some.
Shamelessly responding as the author. I (mostly) agree with you here.
> please be surgically precise with your terms
There's always a tension between precision in every explanation and the "moral" truth. I can say "a SIMD (Single Instruction Multiple Data) vector unit like the TPU VPU with 32 ALUs (SIMD lanes) which NVIDIA calls CUDA Cores", which starts to get unwieldy and even then leaves terms like vector units undefined. I try to use footnotes liberally, but you have to believe the reader will click on them. Sidenotes are great, but hard to make work in HTML.
For terms like MXU, I was intending this to be a continuation of the previous several chapters which do define the term, but I agree it's maybe not reasonable to assume people will read each chapter.
There are other imprecisions here, like the term "Warp Scheduler" is itself overloaded to mean the scheduler, dispatch unit, and SIMD ALUs, which is kind of wrong but also morally true, since NVIDIA doesn't have a name for the combined unit. :shrug:
I agree with your points and will try to improve this more. It's just a hard set of compromises.
I appreciate your response. I made a point of not revising my comment after posting it and finding in a subsequent parable the following, quoting:
> Each SM is broken up into 4 identical quadrants, which NVIDIA calls SM subpartitions, each containing a Tensor Core, 16k 32-bit registers, and a SIMD/SIMT vector arithmetic unit called a Warp Scheduler, whose lanes (ALUs) NVIDIA calls CUDA Cores.
And right after:
> CUDA Cores: each subpartition contains a set of ALUs called CUDA Cores that do SIMD/SIMT vector arithmetic.
So, to your defense and my shame -- you *did* do better than I was able to infer from first glance. And I can take absolutely no issue with a piece elaborating on originally "vague" sentence later on -- we need to read top to bottom, after all.
Much of the difficulty with laying out knowledge in written word is inherent constraints like choosing between deferring detail to "further down" at the expense of giving the "bird's eye view". I mean there is a reason writing is hard, technical writing perhaps more so, in a way. You're doing much better than a lot of other stuff I've had to learn with, so I can only thank you to have done as much as you already have.
To be more constructive still, I agree the border between clarity and utility isn't always clearly drawn. But I think you can think of it as a service to your readers -- go with precision I say -- if you really presuppose the reader should know SIMD, chances are they are able to grok a new definition like "SIMD lane" if you define it _once_ and _well_. You don't need to be "unwieldy" in repetition -- the first time may be hard but you only need to do it once.
I am rambling. I do believe there are worse and better ways to impart knowledge of the kind in writing, but I too obviously don't have the answers, so my criticism was in part inconstructive, just a sheer outcry of mild frustration once I started conflating things from the get go but before I decided to give it a more thorough read.
One last thing though: I always like when a follow-up article starts with a preamble along of "In the previous part of the series..." so new visitors can simultaneously become aware there's prior knowledge that may be assumed, _and_ navigate their way to desired point in the series, all the way to the start perhaps. That frees you from e.g. wanting to annotate abbreviations in every part, if you want to avoid doing that.
Thank you for taking the time to write this reply. Agree with "in the previous part of this series" comment. I'll try to find a way to highlight this more.
What I'd like to add to this page is some sort of highly clear glossary that defines all the terms at the top (but in some kind of collapsable fashion) so I can define everything with full clarity without disrupting the flow. I'll play with the HTML and see what I can do.
1) Thank you for writing this.
2) What are your thoughts on links to the wiki articles under things such as "SIMD" or "ALUs" for the precise meaning while using the metaphors in your prose?
Most novices tend to Google and end up on Wikipedia for the trees. It's harder to find the forest.
I feel you handle this balance quite gracefully, to the point where I was impressed at your handling of the issue while reading and before checking the comments section. I don't know why the idea of something being called something by marketing or documentation (names which one must, strategically accept and internalize) but fundamentally and functionally being better described with other language isn't clearer (which is more useful, so also needed) to the grandparent poster. You want people to be aware of both and explain both without dwelling or getting caught on it, it struck me as an artful choice.
I often put requirements at the top of article
> This article assumes you've read [this] and [this] and understand > [this topic] and [this topic too]
I'm not sure that's helpful, and, I don't put everything. Those links might also have further links saying you need X, Y, and Z. But at least there is a trail on where to start
> It's not clear from the above what a "CUDA core" (singular) _is_
A CUDA core is basically a SIMD lane on an actual core on an NVIDIA GPUs.
For a longer version of this answer: https://stackoverflow.com/a/48130362/1593077
So it's a "SIMD lane" that can itself perform actual SIMD instructions?
I think you want a metaphor that doesn't also depend on its literal meaning.
Nvidia’s marketing team uses confusing terminology to make their product sound cooler than it is.
An Intel “core” can perform AVX512 SIMD instructions that involve 16 lanes of 32-bit data. Intel cores are packaged in groups of up to 16. And, they use hyperthreading, speculative execution and shadow registers to cover latency.
An Nvidia “Streaming Multiprocessor” can perform SIMD instructions on 32 lanes of 32-bits each. Nvidia calls these lanes “cores” to make it feel like one GPU can compete with thousands of Intel CPUs.
Simpler terminology would be: an Nvidia H100 has 114 SM Cores, each with four 32-wide SIMD execution units (where basic instructions have a latency of 4 cycles) and each with four Tensor cores. That’s a lot more capability than a high-end Intel CPUs, but not 14,592 times more.
The CUDA API presents a “CUDA Core” (single SIMD lane) as if it was a thread. But, for most purposes it is actually a single SIMD lane in the 32-wide “Warp”. Lots of caveats apply in the details though.
I guess “GPUs for people who are already CPU experts” is a blog post that already exists out there. But if it doesn’t, you should go write it, haha.
This is not true. GPUs are SIMT, but any given thread in those 32 in a warp can also issue SIMD instructions. see vector loads
It's all very circular, if you try to avoid the architecture-specific details of individual hardware designs. A SIMD "lane" is roughly equivalent to an ALU (arithmetic logic unit) in a conventional CPU design. Conceptually, it processes one primitive operation such as add, multiple, or FMA (fused-multiply-add) at a time on scalar values.
Each such scalar operation is on a fixed width primitive number, which is where we get into the questions of what numeric types the hardware supports. E.g. we used to worry about 32 vs 64 bit support in GPUs and now everything is worrying about smaller widths. Some image processing tasks benefit from 8 or 16 bit values. Lately, people are dipping into heavily quantized models that can benefit from even narrower values. The narrower values mean smaller memory footprint, but also generally mean that you can do more parallel operations with "similar" amounts of logic since each ALU processes fewer bits.
Where this lane==ALU analogy stumbles is when you get into all the details about how these ALUs are ganged together or in fact repartitioned on the fly. E.g. a SIMD group of lanes share some control signals and are not truly independent computation streams. Different memory architectures and superscalar designs also blur the ability to count computational throughput, as the number of operations that can retire per cycle becomes very task-dependent due to memory or port contention inside these beasts.
And if a system can reconfigure the lane width, it may effectively change a wide ALU into N logically smaller ALUs that reuse most of the same gates. Or, it might redirect some tasks to a completely different set of narrower hardware lanes that are otherwise idle. The dynamic ALU splitting was the conventional story around desktop SIMD, but I think is less true in modern designs. AFAICT, modern designs seem more likely to have some dedicated chip regions that go idle when they are not processing specific widths.
> that can itself perform actual SIMD instructions?
Mostly, no; it can't really perform actual SIMD instructions itself. If you look at the SASS (the assembly language used on NVIDIA GPUs) I don't believe you'll see anything like that.
In high-level code, you do have expressions involving "vectorized types", which look like they would translate into SIMD instruction, but they 'serialize', at the single thread level.
There are exceptions to this though, like FP16 operations which might work on 2xFP16 32-bit registers, and other cases. But that is not the rule.
Please see https://docs.nvidia.com/cuda/parallel-thread-execution/index....
The "video instructions" are indeed another exception: Operations on sub-lanes of 32-bit values: 2x16 or 4x8. This is relevant for graphics/video work, where you often have Red, Green, Blue, Alpha channels of 8 bits each. Their use is uncommon (AFAICT) in CUDA compute work.
not true; there are a lot of simd instructions on GPUs
Such as?
dp4a, ldg. just Google it. there's a whole page of them
Nvidia calls their SIMD lanes “CUDA cores” for marketing reasons.
Interestingly, I find llms are really good for this problem; when looking up one term just leads to more unknown terms and you struggle to find a starting point from which to understand the rest, I have found that they can tell you where to start.
https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
should have most of it
I’m being earnest: what is an appropriate level of computer architecture knowledge? SIMD is 50 years old.
From the resource intro: > Expected background: We’re going to assume you have a basic understanding of LLMs and the Transformer architecture but not necessarily how they operate at scale.
I suppose this doesn’t require any knowledge about how computers work, but core CPU functionality seems…reasonable?
SIMD is quite old but the changes Nvidia made to call it SIMT and that they used as an excuse to call their vector lanes "cores" are quite a bit newer.
My recursive brain got a chuckle out of wondering about "imprecise" being in quotes. I found the quotes made the meaning a touch...imprecise.
While I can understand the imprecise point, I found myself very impressed by the quality of the writing. I don't envy making digestible prose about the differences between GPUs and TPUs.
This is a chapter in a book targeting people working in the machine learning domain.
I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.
Yeah that was what I told myself a decade ago when I skipped CUDA class during college time.
The principles of parallel computing, and how they work at hardware and driver levels are more broad. Some parts of it are provincial (Strong province though...), and others are more general.
It's hard to find skills that don't have a degree of provincialism. It's not a great feeling, but you more on. IMO, don't over-idealize the concept of general-knowledge to your detriment.
I think we can also untangle the open-source part from the general/provincial. There is more to the world worth exploring.
It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.
I agree that “learning CUDA wasn’t particularly difficult to get started,” there are Grand Canyon sized chasms between CUDA and its alternatives when attempting to crank performance.
Well, I think to a degree that depends what you're targeting.
Single socket 8 core CPU? Yes.
If you spent some time playing with trying to eke out performance on Xeon Phi and have done NUMA-aware code for multi socket boards and optimising for the memory hierarchy of L1/L2/L3 then it really isn't that different.
It will improve for sure but this shouldn’t be downplayed.
Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.
Tech is always like this.
You move from one thing to the next.
With your transferable skills, experience and thinking that is beyond one programming language.
Even Apple is simply exporting to CUDA now.
> Even Apple is simply exporting to CUDA now.
Really!!! Any resources you can share?
It’s one way but still something.
https://9to5mac.com/2025/07/15/apples-machine-learning-frame...
> Even Apple is simply exporting to CUDA now.
This is like when journalists write clickbait article titles by omitting all qualifiers (eg "states banning fluoride" when it's only some states).
One framework added a CUDA backend. You think all of Apple uses only one framework? Further what makes you think this even gets internal use?
I didnt say any of those things at all.
Only that Apple not only might use CUDA internally but made a public release available too.
CUDA seems to be a trigger word in this thread for some.
https://9to5mac.com/2025/07/15/apples-machine-learning-frame...
> I didnt say any of those things at all.
what does this sentence mean?
> Apple is simply exporting to CUDA now.
My takeaway was definitely not that all of Apple is using only one framework.
then please enlighten me: what does the sentence mean?
Can you please stop posting in the cross-examining, flamewar style? It's not what this site is for. We want people to learn from each other here.
https://news.ycombinator.com/newsguidelines.html
what exactly is the acceptable style of discourse here then? purely exultant tropes and clickbait i guess.
Does https://news.ycombinator.com/newsguidelines.html not answer that?
There are plenty of acceptable styles. The guidelines don't insist on only one style.
Only in Silicon Valley. But if you can, definitely do.
I grew up learning programming on a genuine IBM PC running MS-DOS, neither of which was FOSS but taught me plenty that I routinely rely on today in one form or another.
Very true.
Stories of exploring DOS often ended up at hex editing and assembly.
Best to learn with whatever options are accessible, plenty is transferable.
While there is an embarrassment of options to learn from or with today the greatest gaffe can be overlooking learning.
There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.
I've been hearing that for over a decade. I can't even name off hand any CUDA competitors, none of them are likely to gain enough traction to upset CUDA in the coming decade.
Hence the "if" :-)
ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.
Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.
Look at the state of PyTorch’s CI pipelines and you’ll immediately see that ROCm is a nightmare. Especially nowadays when TPU and MPS, while missing features, rarely create cascading failures throughout the stack.
I still don't see ROCm as that serious a threat, they're still a long way behind in library support.
I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.
Waiting just adds more dust to the skills pile.
Programming languages are groups of syntax.
Talking about hardware rather than software, you have AMD and Intel. And - if your platform is not x86_64, NVIDIA is probably not even one of the competitors; and you have ARM, Qualcomm, Apple, Samsung and probably some others.
...Well, the article compares GPUs to tpus, made by a competitor you probably know the name of...
It's a valid point of view, but I don't see the value in sharing it.
There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.
It reads a bit like "You shouldn't use it because..."
Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?
It's a useful bit of caution to remember transferrable fundamentals, I remember when Oracle wizards were in high demand.
There are tons of ML compilers right now, FlashAttention brought back the cache-aware model to parallel programming, Moore's law hit is limit and heterogeneous hardware is taking taking off.
Just some fundamentals I can think of off the top of my head. I'm surprised people saying that the lower level systems/hardware stuff are untransferable. These things are used everywhere. If anything, it's the AI itself that's potentially a bubble, but the fundamental need for understanding performance of systems & design is always there.
Im actually doing a ton of research in the area myself, the caution was against becoming an Nvidia expert narrowly rather than a general low level programmer with Nvidia skills included.
I mean, I'm in Toronto Canada, a fairly big city and market, and have an open seat for a couple of good senior Oracle DBAs pretty much constantly. The market may have reduced over decades but there's still more demand than supply. And the core DBA skills are transferable to other RDBMS as well. While I agree that some niche technologies are fleeting, it's perhaps not the best example :-)
That's actually interesting! My experience is different, especially compared to the late 90s and early 00s, most people avoid Oracle if they can. But yes, its always worth having someone who's job is to think about the database if it's your lynchpin.
Well, there's the difference. Maybe demand has collapsed for the kind of people who knew how to tune the Oracle SGA and get their laughable CLI client to behave, but the market for people who structurally understood the best ways to organize, insert and pull data back out is still solid.
Re Oracle and "big 90s names" specifically, there is a lot of it out there. Maybe it never shows up in the code interfaces HNers have to exercise in their day jobs, but the tech, for better or worse, is massively prevalent in the everyday world of transit systems and payroll and payment...ie all the unsexy parts of modern life.
I think it's famously said that 5% of IT is in the exciting new stuff that's on Hacker News front page, and 95% is in boring line-of-business, back office "enterprise" software that's as unglamorous as it is unavoidable :-). Even seemingly modern giants like Google or Amazon etc - check what their payroll and financial system is in the background.
And wait until I tell you about my Cobol open seats - on modern Linux on cloud VMs too! :-)
There are two CUDAs – a hardware architecture, and a software stack for it.
The software is proprietary, and easy to ignore if you don't plan to write low-level optimizations for NVIDIA.
However, the hardware architecture is worth knowing. All GPUs work roughly the same way (especially on the compute side), and the CUDA architecture is still fundamentally the same as it was in 2007 (just with more of everything).
It dictates how shader languages and GPU abstractions work, regardless of whether you're using proprietary or open implementations. It's very helpful to understand peculiarities of thread scheduling, warps, different levels of private/shared memory, etc. There's a ridiculous amount of computing power available if you can make your algorithms fit the execution model.
This is a JAX article, a parallel computation library that's meant to abstract away vendor specific details. Obviously if you want the most performance you need to know specifics of your hardware, but learning the high level of how a GPU vs TPU works seems like useful knowledge regardless.
> abstract away vendor specific details
Sounds good on paper but unfortunately I've had numerous issues with these "abstractors". For example, PyTorch had serious problems on Apple Silicon even though technically it should "just work" by hiding the implementation details.
In reality, what ends up happening is that some features in JAX, PyTorch, etc. are designed with CUDA in mind, and Apple Silicon is an afterthought.
I think I’d rather get familiar with cupy or Jax or something. Blas/lapack wrappers will never go out of style. It is a subset of the sort of stuff you can do on a GPU but it seems like a nice effort:functionality reward ratio.
You can write software for the hardware in a cross-compiled language like Triton. The hardware reality stays the same, a company like Cerebras might have the superior architecture, but you have server rooms filled with H100, A100, and MI300s whether you believe in the hardware or not.
It's money. You would do it for money.
Generally, earning an honest living seems to be a requirement of this world that individual words and beliefs won’t change.
Work keeps us humble enough to be open to learn.
What's in this article would apply to most other hardware, just with slightly different constants
Nvidia also trotted along with a low share price for a long time financing and supporting what they believed in.
When cuda rose to prominence were there any viable alternatives?
> I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors.
Better not learn CUDA then.
I mean it is similar to investing time in learning assembly language.
For most IT folks it doesn't make much sense.
The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.
Shamelessly: I’m open to work if anyone is hiring.
It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
Yes, 450GB/s is the per GPU bandwidth in the nvlink domain. 3.2Tbps is the per-host bandwidth in the scale out IB/Ethernet domain.
I believe this is correct. For an H100, the 4 NVLink switches each have 64 ports supporting 25GB/s each, and each GPU uses a total of 18 ports. This gives us 450GB/s bandwidth within the node. But once you start trying to leave the node, you're limited by the per-node InfiniBand cabling, which only gives you 400GB/s out of the entire node (50GB / GPU).
Is it GBps (gigabytes per second) or Gbps (giga bits per second)? I see mixed usage in this comment thread so I’m left wondering what it actually is.
The article is consistent and uses Gigabytes.
GBps
We should remember that these structural diagrams are _not_ necessarily what NVIDIA actually has as hardware. They carefully avoid guaranteeing that any of the entities or blocks you see in the diagrams actually _exist_. It is still just a mental model NVIDIA offers for us to think about their GPUs, and more specifically the SMs, rather than a simplified circuit layout.
For example, we don't know how many actual functional units an SM has; we don't know if the "tensor core" even _exists_ as a piece of hardware, or whether there's just some kind of orchestration of other functional units; and IIRC we don't know what exactly happens at the sub-warp level w.r.t. issuing and such.
Interesting perspective. Aren't SMs basically blocked while running tensor core operations, which might hint that it's the same FPUs doing the work after all?
I doubt that can fully be the case, because there are other functional units on SMs, like Load/Store, ALU / Integer ops, and Special Function Units. But you may be right, we would need to consult the academic "investigatory" papers or blog posts and see whether this has been checked.
This whole series is fantastic! Does an excellent job of explaining the theoretical limits to running modern AI workloads and explains the architecture and techniques (in particular methods of parallelism) you can use.
Yes it's all TPU focussed (other than this most recent part) but a lot of what it discusses are generally principles you can apply elsewhere (or easy enough to see how you could generalise them).
This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.
Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.
Do TPUs allow having a variable array dimension at somewhat inner nesting level of the loop structure yet? Like, where you load expensive (bandwidth-heavy) data in from HBM, process a variable-length array with this, then stow away/accumulate into a fixed-size vector?
Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently.
An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer.
If you have optimized your math heavy code and it is already in a typed language and you need it to be faster, then you think about the GPU options
In my experience you can roughly get 8x speed improvement.
Turning a 4 second web response into half a second can be game changing. But it is a lot easier to use a web socket and put a spinner or cache result in background.
Running a GPU in the cloud is expensive
It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.
Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.
This is part 12 … the title seems to hint on how do one think about Gpu today … eg why llm comes about. Instead it is about cf with tpu? And then I note the part 12 … not sure what one should expect to jump in the middle of a whole series and what … well may stop and move on.
It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.
What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.
As a real time rendering engineer, this is how it’s always been. NV obfuscates much of the info to prevent competitors from understanding changes between generations. Other vendors aren’t great at this either.
In games, you can get NDA disclosures about architectural details that are closer to those docs. But I’ve never really seen any vendor (besides Intel) disclose this stuff publicly
With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.
Plenty of circumstantial evidence pointing to the fact NVIDIA prefers to hand out semi-tailored documentaion resources to signatories and other "VIPs", if not the least to exert control over who and how uses their products. I wouldn't put it past them to routinely neglect their _public_ documentation, for one reason or another that makes commercial sense to them but not the public. As for incentives, go figure indeed -- you'd think by walling off API documentation, they're shooting themselves in the feet every day, but in these days of betting it all on AI, which means selling GPUs, software and those same NDA-signed VIP-documentation articles to "partners", maybe they're all set anyway and care even less for the odd developer who wants to know how their flagship GPU works.
Nvidia has ridiculously good documentation for all of this compared to its competitors.
What makes you think that? It appears most of this material came straight out of NVIDIA documentation. What do you think is missing? I just checked and found the H100 diagram for example is copied (without being correctly attributed) from the H100 whitepaper: https://resources.nvidia.com/en-us-hopper-architecture/nvidi...
Much of the info on compute and bandwidth is from that and other architecture whitepapers, as well as the CUDA C++ programming guide, which covers a lot of what this article shares, in particular chapters 5, 6, and 7. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
There’s plenty of value in third parties distilling and having short form versions, and of writing their own takes on this, but this article wouldn’t have been possible without NVIDIA’s docs, so the speculation, FUD and shade is perhaps unjustified.
Why haven't Nvidia developed a TPU yet?
This article suggests they sort of did: 90% of the flops is in matrix multiplication units.
They leave some performance on the table, but they gain flexible compilers.
They don't need to. Their hardware and programming model are already dominant, and TPUs are harder to program for.
Meaning what? Something less flexible? Less CUDA cores and more Tensor Cores?
The majority of NVidia's profits (almost 90%) do come from data center, most of which is going to be neural net acceleration, and I'd have to assume that they have optimized their data center products to maximize performance for typical customer workloads.
I'm sure that Microsoft would provide feedback to Nvidia if they felt changes were needed to better compete with Google in the cloud compute market.
> most of which is going to be neural net acceleration
is it?
I've got to assume so, since data center revenue growth seems to have grown in sync with recent growth in AI adoption. CUDA has been around for a long time, so it would seem highly coincidental if non-AI CUDA usage was only just now surging at same time as AI usage is taking off, and new data center build announcements seem to invariably be linked to AI.
Fantastic resource! Thanks for posting it here.
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
SIMT is just a programming model for SIMD.
Modern GPUs still are just SIMD with good predication support at ISA level.
That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
I'm not aware of any GPU that implements this.
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
The diagram shows how this executes in the following order:Volta:
pre Volta: The SIMD equivilant of pre Volta is: The Volta model is: [1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...
IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active.
Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.
"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.
This paper
https://arxiv.org/abs/2407.02944
ventures some guesses how Nvidia does this, and runs experiments to confirm them.
"threads"
I was referring to this portion of TFA
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.
For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
Thanks, I will look into that.
However, I'm still confused about the original statement. What I had thought was that
pre-volta GPUs, each thread in a warp has to execute in lock-step. Post-volta, they can all execute different instructions.
Obviously this is a surface level understanding. How do I reconcile this with what you wrote in the other comment and this one?
Thanks for the really thorough research on that . Right what I wanted for my morning coffee
"How to Think About NVIDIA GPUS" is a better title
Discussion of original series:
How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)
A comment from there:
> There are plans to release a PDF version; need to fix some formatting issues + convert the animated diagrams into static images.
I don't see anything on the page about it, has there been an update on this? I'd love to put this on my e-reader.
[dead]
[dead]
so, Why didn't Nvidia developed a TPU yet?
Probably proprietary. Go GOOG. I like how bad your comment is.