Blackwell: Nvidia's GPU

108 points by pella 3 days ago

ggreg84 2 days ago

Chips and Cheese GPU analysis are pretty detailed, but they need to be taken with a huge grain of salt because the results only really apply to OpenCL and nobody buying NVIDIA or AMD GPUs for Compute runs OpenCL on them; its either CUDA or HIP, which differ widely in parts of their compilation stack.

After reading the entire analysis, I'm left wondering, what observations in this analysis - if any - actually apply to CUDA?

nromiun 2 days ago

For benchmarking code like this CUDA, HIP and OpenCL are almost the same. You will only see the difference in big codebases, where you launch multiple kernels and move data between them.
Otherwise OpenCL is very good as well, with the added benefit of running on all GPUs.
almostgotcaught 2 days ago

> its either CUDA or HIP, which differ widely in parts of their compilation stack.
This is an ironic comment - OpenCL uses the same compiler as CUDA on NVIDIA and HIP on AMD.
- JonChesterfield 2 days ago
  
  Sort of. Same compiler backend, mostly, but the set of intrinsics and semantic rules are different.
  
  almostgotcaught 2 days ago
  
  i have no idea what your point is - same compiler, different frontend, yes that's literally what i said.

CalChris 2 days ago

The Nvidia technical brief says 208 billion transistors.

https://resources.nvidia.com/en-us-blackwell-architecture

Blackwell uses the TSMC 4NP process. It has two layers. A very back of the envelope estimate:

  750mm^2 / (208/2) * 10^9 = 7211 nm^2
  85 nm x 85 nm

NB: process feature size does not equal transistor size. Process feature size doesn't even equal process feature size.

gchadwick 2 days ago

> It has two layers
Where did you get that from? Pretty sure it's a single planar set of transistors. Those transistors are manufactured using multiple layers of mask.
FinFET transistors are described as 3D or non-planar but crucially this isn't allowing transistor on transistor stacking you've just got the gate structure of the FinFET poking out above the plane of the rest of the transistors.
Silicon on silicon die stacking is a possibility but limits your power and GPUs run very hot so it's not an option for them.
- murderfs 2 days ago
  
  GPUs are not particularly hot for compute silicon, they just have ridiculously huge dies. Comparing the 5090 to a Core Ultra 285K, the GPU has a 750 mm^2 die compared to the CPU's 243 mm^2, but has a peak power of 575W compared to 250W. The CPU uses 25% more power per area, and that's before considering the fact that consumer CPUs are packaged for user installation, so there's an extra heatspreader on top of the die, whereas GPUs are sold as integrated units, so the heatsink sits directly on top of the die.
  
  kvemkon 2 days ago
  
  > consumer CPUs are packaged for user installation
  I'd say advanced users or skilled staff.
  20+ years ago e.g. Athlon XP had a small CPU die in the middle and 4 round spacers in the corners for a proper heatspreader installation. Despite the CPU die wouldn't clock down and go in flames in case of cooler removal during operation.
  Nowadays with a safer CPU monitoring its temperature, one has to risk to remove the heatspreader and replace it with "special" direct die cooling resulting in either a bit more performance or 15-20 grad lower temperatures or a smaller or a silent cooler. One is free to choose.
  Sure, even advanced user must take more care working around the naked die. But the technology to make this safer than before could have also matured.
  
  adrian_b a day ago
  
  Half a century ago, neither Pentium 4 nor Athlon would go in flames without a cooler, but because they included thermal protection circuits that stop the clock above a certain temperature threshold, which cannot be modified by users. The halting of the clock is invisible for software, except that frequent and/or long clock haltings will lengthen accordingly the execution time of a long task that has spanned over time intervals when the clock has been stopped.
  The latest CPUs still include such a thermal protection, which is independent of the turbo clock frequency management, which is based on the power and current limits, which can be modified by the user, e.g. for overclocking.
  Older CPUs, until about the middle of the nineties, could be destroyed by inadequate cooling. This has forced both Intel and AMD to implement thermal protections, to avoid requests for CPU replacements.
  
  kvemkon a day ago
  
  Perhaps, I've formulated not exact enough. I was remembering what has been demonstrated in the famous video [1] for Athlon 1400 and 1200. Yes, the went not in flame but just in smoke. Back then the user was allowed to install a cooler direct onto the die though AMD CPU had no thermal protection (just thermal diode for emergency power-off the whole system by BIOS?).
  I have no confirmation for the following but I've considered that the pre-installed and hard and risky to remove the pretty massive heatspreader plate works not only as die physical protection but also as a minimal needed buffer for heat dissipation in case of rapid cooler removal.
  [1] "What happens when the CPU cooler is removed?" (Tom's Hardware Guide, 2001)
  https://www.youtube.com/watch?v=y39D4529FM4
  https://www.youtube.com/watch?v=UoXRHexGIok
  P.S. Another rather philosophical question I have since a ventilator (not just some heatspreader) was required for the first time for CPU to operate as advertised. Can one consider that CPUs are actually factory overclocked since then? Even though not to the absolute limit, e.g. Celeron 333A (@66) up to 500 (@100) manually.
dist-epoch 2 days ago

You also need space for wires, ..., etc, right? It's not just transistors.
- gchadwick 2 days ago
  
  The wires sit on top of the transistors. Many layers of them in a modern process.
  However you can't always pack the transistors as dense as you would like because you can't fit the wiring for them in above at the same density.
  Plus there are various 'design rules' that constrain how things get placed. These are needed to ensure manufacturing is successful and achieved good yield. An important set of rules are the 'antenna rules' that requires the insertion of antenna diodes (using silicon reducing transistor density) to prevent circuitry being destroyed during manufacturing: https://www.zerotoasiccourse.com/terminology/antenna-report/
- CalChris 2 days ago
  
  The wires didn't fit on the back of the envelope.
  
  a_wild_dandan 2 days ago
  
  I love this retort and I'm stealing it.
- amelius 2 days ago
  
  The wires run over the transistors.
bgnn 2 days ago

This assumes 100% utilization. Ralistically the utilization (active device area wrt total die area) 70-75% at best.

ksec 2 days ago

I heard there is still trouble to buy consumer grade Nvidia GPU. At this point I am wondering if it is Gaming market demand, AI, or simply a supply issue.

On another note I am waiting for Nvidia's entry to CPU. At some point down the line I expect the CPU will be less important, ( relatively speaking ) and Nvidia could afford to throw a CPU in the system as bonus. Especially when we are expecting ARM X930 to rival Apple's M4 in terms of IPC. CPU design has become somewhat of a commodity.

Incipient 2 days ago

My understanding is it's the AI demand and willingness to pay crazy money for wafer that makes consumer GPUs a significantly less attractive product to produce.
I don't have really solid evidence, just semi-anecdotal/semi-reliable internet posts:
Eg. https://www.tomshardware.com/tech-industry/more-than-251-mil...
Nvidia as a whole has been fairly anti-consumer recently with pricing, so I wouldn't be banking on them for a great cpu option. Weirdly Intel is in the position where they have to prove themselves, so hopefully they'll give us some great products in the next 2-5 years - if they survive (think the old lead-up-to-ryzen era for amd)
- KronisLV 2 days ago
  
  > Nvidia as a whole has been fairly anti-consumer recently with pricing, so I wouldn't be banking on them for a great cpu option.
  If they’re swimming in the AI cash and the consumer GPU segment isn’t that important (https://www.visualcapitalist.com/nvidia-revenue-by-product-l...) then why on earth couldn’t they do less price gouging?
  It feels a bit like the Intel Core Ultra desktop CPU launch where the prices were the critical factor that doomed an otherwise pretty okay product. At least Intel's excuse is that they’re closer to going under than before, even if their GPUs were pretty fairly priced anyways.
  It’s almost like everyone complains about their prices and the fact that they’re releasing 8 GB cards… and then still go and give them money anyways.
- p_l 2 days ago
  
  The same chip as the "proper"[1] 5090 is also used for workstation and some server cards, which go for easy higher price. So it's just an allocation of child to different products, taking into account that with the power demands and design issues in 5090s power supply there isn't all that much demand for 5090 either.
  [1] there are now 5090 branded cards that use same chip as 5080
jonas21 2 days ago

> I am waiting for Nvidia's entry to CPU.
Haven't they already started doing this with Grace and GB10?
- https://www.nvidia.com/en-us/data-center/grace-cpu/
- https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...
- wtallis 2 days ago
  
  Their Grace datacenter CPU is basically a chip where they put down all the LPDDR5 memory controllers (albeit curiously slow), NVLINK and PCIe IOs they needed around the perimeter, and then filled in the interior with boring off the shelf ARM cores. It's basically an IO and memory expander that happens to run Linux.
  GB10 when it ships might be more interesting, since it'll go into systems that need to support use cases other than merely feeding a big GPU ML workloads. But it sounds like the CPU chiplet at least was more or less outsourced to Mediatek.
magicalhippo 2 days ago

The whole missing ROPs saga[1][2] didn't help. I bought a 5070 Ti and had to return it due to missing ROPs. Had to get another brand as replacement, as they had so little stock.
[1]: https://gamersnexus.net/gpus/investigating-nvidias-defective...
[2]: https://nvidia.custhelp.com/app/answers/detail/a_id/5628/~/h...
xl-brain 2 days ago

The micro center in my neighborhood has hundreds of 5090s in stock. I'm not sure its as hard as it used to be.
enqk 2 days ago

I keep wondering if the yields have gone all bad with the newer processes

Aissen 2 days ago

Does the comparison even makes sense, considering there's (more than) an order of magnitude difference in price between the AMD's Desktop GPU and NVIDIA's Workstation accelerator?

dist-epoch 2 days ago

Why doesn't NVIDIA also build something like Google TPU, a systolic array processor? Less programmable, but more throughput/power efficiency?

It seems there is a huge market for inference.

AlotOfReading 2 days ago

Nvidia tensor cores are small systolic arrays. They'd have to throw out a lot of their ecosystem investments and backwards compatibility to make effective use of them as the main GPU compute, and there's really no need given how competitive their chips are right now.
aurareturn 2 days ago
```
  Less programmable, but more throughput/power efficiency?
```
I also wonder the same. It'd make sense to sell two categories of chips:
Traditional GPUs like Blackwell that can do anything and have backwards compatibility.
Less programmable and more ASIC-like inference chips like Google's TPUs. Inference market is going to be multiple times bigger than training soon.
StochasticLi 18 hours ago

The CUDA ecosystem is their moat.