How CUDA Programming Works

249 points by bangonkeyboard 2 years ago

ykl 2 years ago

Controversial opinion: I wish every GPU programming language and API did memory management like CUDA. In CUDA, you allocate and manage GPU memory using things like cudaMalloc and cudaMemcpy, which behave exactly like what their names suggest, and handles to GPU memory allocations are just plain old pointers just like regular pointers, except pointing to the GPU’s address space. Basically if you know how to deal with memory in C, you already know how to deal with GPU memory in CUDA.

Contrast with Vulkan/GL/Metal/DirectX, where in order to manage GPU memory, you need to know about a whole diverse zoo of different buffer types and different rules for different buffers and there’s a billion different API calls to handle different cases… I just want a chunk of GPU memory that I can do whatever I want with!

kllrnohj 2 years ago

I don't think it's controversial at all to wish everything was simple. That said...
> Basically if you know how to deal with memory in C, you already know how to deal with GPU memory in CUDA.
is not even remotely true in the slightest. Using memory in CUDA is massively more complex than in C. The actual allocation & deallocation API isn't meaningfully different, no, but that's not where the story ends. The story ends with ensuring memory accesses are friendly for the GPU, meaning coalesced or strided. It means adjusting the behavior of L2 cache for the workload. It means optimizing how you actually get data to and from the GPU in the first place. And, last but certainly not least, cudaMalloc and cudaFree are much more expensive than malloc & free are. Like, orders of magnitude more expensive.
Yes you can do all that micro-optimization for CPUs, of course. But for GPGPU it's actually super critical or your performance is just dreadfully bad. Like, don't even bother using the GPU at all bad.
- jpgvm 2 years ago
  
  Not to mention how insanely difficult heterogeneous memory management can get if you start needing coherence and/or relaxed atomics.
  I have taken multiple swings at getting good at mixed computing and decided in most cases it's better to just pick CPU or GPU, trying to use both at the same time increases the difficulty by orders of magnitude.
  Hats off to people that are good at that shit.
boulos 2 years ago

Partly it was second system syndrome. OpenCL in particular thought it was going to be "better", particularly for hybrid programming and portability between "cpu only, GPU only and mixed". I personally find it was a failure, and not just because NVIDIA never cared to really push it.
DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.
But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).
- bl0b 2 years ago
  
  I think CUDA even lets you allocate pinned host memory too now - cuHostMalloc or something like that - so you can skip the double-copy
  
  my123 2 years ago
  
  CUDA provides a tier significantly above that: unified memory.
  See: https://on-demand.gputechconf.com/gtc/2017/presentation/s728...
  And: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
  However, the Windows driver infrastructure's unified memory support is much further behind, with the pre-Pascal feature set.
  For those you'll have to use Linux. Note that WSL2 is considered as Windows for this, it's a driver infrastructure limitation in Windows.
  
  andoma 2 years ago
  
  I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.
  
  einpoklum 2 years ago
  
  Unified memory is just _different_, not above or below. It offers on-demand paging. But that comes at a cost (at times) in terms of memory I/O speed.
  
  my123 2 years ago
  
  It's a feature tier above, with much more emphasis on ease of use from the programmer's perspective.
  It also allows for significantly more approachable programming models. For example: https://developer.nvidia.com/blog/accelerating-standard-c-wi...
  
  boulos 2 years ago
  
  Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.
  The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).
  So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".
  
  01100011 2 years ago
  
  Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.
  
  jhj 2 years ago
  
  In CUDA, some transfers involving pageable host memory are completely synchronous from the perspective of the host, even if you use `cudaMemcpyAsync`:
  https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behav...
  Pinned memory is typically used to get around the synchronization aspects.
- wrnr 2 years ago
  
  The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.
adwn 2 years ago

That's primarily because Vulkan is a very low-level, but still portable, API. Different GPUs with different architectures provide memory heaps with different functional and performance characteristics (device local yes/no, host visible yes/no, host cached yes/no, host coherent yes/no), often even within the same device. For example, my Radeon RX 550 has access to 3 heaps: 1.75 GiB device local, 256 MiB device local + host visible + host coherent, 3 GiB host visible + host coherent + [optional] host cached. The last heap can be allocated either host cached or uncached; access from the GPU is presumably faster if it's treated as uncached.
This gives the maximum amount of control over the performance/convenience tradeoff. If you want more convenience, then you can use a library or engine that handles the details for you. It's the same for CUDA: you get the convenience of malloc(), but you potentially lose a bit of performance.
- pjmlp 2 years ago
  
  And thus Vulkan has become the same extension spaghetti as OpenGL in less amount of years than has taken OpenGL to reach that point.
  The pain is so big that Khronos has caved in and now is trying to clean up the mess with Vulkan Profiles.
darzu 2 years ago

To the contrary, this is my biggest complaint with CUDA. It's a nice seeming abstraction from a programmer's perspective, unfortunately it's not a good abstraction b/c it doesn't match the reality of hardware. The truth is there is a heterogeneous arena of memory kinds, access patterns, and sizes that gives you drastically different performance trade offs for different tasks. (This isn't even about diverse hardware, every modern GPU has this complexity.) CUDA oversimplifies this which causes solutions to have opaque performance cliffs and you end up having to understand a lot of the underlying complexity and then awkwardly back-propagate that up into CUDA's ill-fit APIs to get decent performance. It's a false sense of simplicity that ends up causing more work and complexity.
Contrast that with something like WebGPU where their notions of GPU buffers, textures, pipelines and command queues maps well into what actually happens in hardware and it's much simpler to get predictable performance.
Now I totally agree there needs to be more work done to provide simpler abstractions on top of WebGPU/Vulkan/Metal/DirectX for certain common patterns of work. But pretending you have a pointer to a blob of GPU memory isn't the way.
This talk gives a great overview of the GPU compute landscape: https://www.youtube.com/watch?v=DZRn_jNZjbw
LtWorf 2 years ago

I'm no CUDA expert but from what I recall you do need to use different types of memory to write efficient code, or your code might end up being very slow.
And then you have different card models, some share RAM, some don't. Some can process AND transfer data, some can't. Some can transfer only one way while processing. All of this must be accounted while programming.
dragontamer 2 years ago
C++Amp has the easiest GPU allocation.
```
    array<int, 128> myGPUArray;
```
That's it. Thanks to RAII, this will be malloc() upon the constructor being called, and free() when the destructor is called.
A proper C++ language, with proper RAII and integration would be best. C++Amp played around a little bit with some concepts, but it seems to have died off in 2014.
There's also "array_view", which serves as an automagic pointer, that copies data "when the system determines it needs to be copied". (Using a C++ future-like system to block when the CPU-side is accessing before the GPU-side is done writing, and vice versa).
"Array" in C++Amp is similar to cudaMalloc. array_view is pretty nifty overall, and I like the idea of transfers being abstracted away by blocking / futures (besides: CUDA itself has blocking calls to kernels, so its not a completely alien concept).
High speed CPU/GPU interaction would be hampered by array_view (too much automagic), but having a simpler interface and less boilerplate code is a good idea IMO.
- brandmeyer 2 years ago
  
  Is this under another namespace or something? std::array<T, Size> doesn't call malloc or free at all. Its entire reason for being is that it provides a fixed-sized type which models a RandomAccessContainer while being allocated inline to the parent object.
  
  dragontamer 2 years ago
  
  concurrency::array from C++ AMP.
  EDIT: It doesn't really call malloc/free, but instead is the equivalent to cudaMalloc / cudaFree.
nmfisher 2 years ago

I've only touched OpenGL very briefly, but isn't this mostly explained by the fact that CUDA is for general purpose computing and not hardware rendering? If you're mostly dealing with vertex/texture buffers, then presumably you'll need to dress up CUDA with a bit of custom boilerplate, at which point it starts looking like the OpenGL approach anyway.
- dotnet00 2 years ago
  
  On older GPUs, that was the case. But ever since unified shader model became a thing, all the various buffers except textures are just generic chunks of memory similar to what CUDA deals with. In fact the 'modern' OpenGL approach is basically to tie together various pieces to get close to CUDA's model, where you deal only with generic chunks of memory with raw pointers to them (except for textures). The issue mainly being that it's a lot more boilerplate than CUDA's straightforward malloc calls.
  
  kllrnohj 2 years ago
  
  The GP included Metal & Vulkan in that list, though. But at least in the case of Vulkan you do just have a single vkCreateBuffer for non-texture data. There's a usage bitmask, but that's more about ensuring you are told the right alignment and padding requirements.
  CUDA, meanwhile, makes you just sort of "know" that from reading the docs. CUDA's model isn't simpler or friendlier here. It's the same complexity, "just" moved from API to documentation instead. At least, assuming you want to get anywhere close to the performance the hardware is capable of, that is.
tehsauce 2 years ago

I agree. Maybe there’s a reason for all the extra complexity in vulkan, perhaps because such a wide variety of devices are targeted?
- my123 2 years ago
  
  Explicit GPU pointers, instead of opaque structures from the programmer's view, only became part of core in Vulkan 1.2 (VK_KHR_buffer_device_address).
  And for Metal, getting the GPU address of a buffer only became exposed in Metal 3.0, which is going to be released this fall in iOS/iPadOS 16 and macOS Ventura.
  
  dagmx 2 years ago
  
  I'm excited about the addition to Metal, especially in conjunction with the shared memory of the apple silicon chips
  
  ykl 2 years ago
  
  Yup, I’m excited by APIs finally moving in this direction. IMO CUDA got it right from the start.
curiousgal 2 years ago

I mean you do still have to fiddle around with global vs shared memory.

halotrope 2 years ago

NVIDIA gets a lot of flak (and rightfully so) but CUDA is a really approachable API. Considering how much complexity it encapsulates, it could be a lot worse. The concurrency primitives, C(++) integration and memory management are quite good. The profiling and debugging toolkits that come with they CUDA stack are also world-class. I really wish the could become a better citizen in Linux and OSS at some point.

pjmlp 2 years ago

NVidia saw what Khronos has decided to ignore for so long.
Programming models beyond plain C, good IDE and graphical debugging experience, library ecosystem for GPGPU code.
Proprietary APIs from Apple, Sony, Nintendo and Microsoft are similar in this regard.
Khronos has waken up with SPIR, SPIR-V and C++ for OpenCL, how much it will matter remains to be seen. It is no wonder that SYCL is adopting a backend agnostic approach instead.
einpoklum 2 years ago

For writing kernels and profiling, yes, it's pretty nice. But the host-side API is rather unapproachable. There are, like, 20 API calls just for different variants of memory allocation, all inconsistently named. Also, the APIs are almost entirely (but not 100%) C'ish.
In the API wrappers I've written: https://github.com/eyalroz/cuda-api-wrappers
I try to address these and some other issues.
We should also remember that NVIDIA artificially prevents its profiling tools from supporting OpenCL kernels - with no good reason.
dahart 2 years ago

> I really wish the could become a better citizen in Linux and OSS at some point.
Did you miss the good news a month ago? https://www.makeuseof.com/nvidia-open-sources-gpu-driver-for...

m_mueller 2 years ago

This right here is why Nvidia has such a massive lead in computational GPU sales. Investments into this ecosystem were made 15-17 years ago. I.e. there are long ways to go for AMD, Intel and others (Apple?).

sennight 2 years ago

They have a massive lead due to vendor locking any software written to depend on CUDA instead of an open standard... and that is a good thing? Okeydokey.
- dotnet00 2 years ago
  
  Developers weren't forced into using CUDA, that was entirely because of their ecosystem being much better than anyone else's.
  Facebook and Google obviously wouldn't want to lock themselves into CUDA for PyTorch and Tensorflow, but there genuinely wasn't any other realistic option. OpenCL existed but the implementation on AMD was just as bad as the one on NVIDIA.
  Consider that Blender's Cycles render engine had only gotten OpenCL support when AMD assigned some devs specifically to help work through driver bugs and even then they had to resort to a 'split kernel' hack, which recently led to OpenCL support once again being entirely dropped as the situation hadn't really improved over the decade. Instead the CUDA version was ported to HIP and a HIP supporting Windows driver was released.
  Even now, if you need to do GPGPU on PCs, CUDA is essentially the easiest option. Every NVIDIA card supports it pretty much right from launch on both Linux and Windows, while with AMD you currently only get support for a few distros (Windows support is probably not too far off now), slow support for new hardware and a system of phasing out support for older cards even if they're still very common. On top of that, NVIDIA offers amazing profiling and debugging tools that the competition hasn't caught up to.
  
  sennight 2 years ago
  
  > Developers weren't forced into using CUDA...
  ... no, just as any other consumer isn't necessarily "forced" by companies employing anticompetitive practices.
  > Facebook and Google...
  lol, they have such a high churn rate on hardware that I seriously doubt they'd give it much thought at all. Their use case is unique to a tiny number of companies - high churn, low capital constraint, no tolerance for supplier delay. In such a scenario CUDA vendor lock in wouldn't even register as a potential point of pain.
  > OpenCL existed but the implementation on AMD was just as bad as the one on NVIDIA.
  For those unaware of how opencl works: an API is provided by the standard, to which software can be written by people - even those without signed NDAs. The API can interface to a hardware endpoint that has further open code and generous documentation... like an open source DSP, CPU, etc - or it can hit an opaque pointer. If your hardware vendor is absurdly secretive and insists on treating microcode and binary blobs as competitive advantages, then your opencl experience is wholly dependent on that vendor implementation. Unfortunately for GPUs that means either NVIDIA or AMD (maybe Intel, we'll see)... so yeah - not good. AMD has improved things open sourcing a great deal of their code, but that is a relatively recent development. While I'm familiar with some aspects of their codebase (had to fix an endian bug, guess what ISA I use), I dunno how much GPGPU functionality they're still hiding behind their encrypted firmware binary blobs. Also, to the point on NVIDIA's opencl sucking: anybody else remember that time that Intel intentionally crippled performance for non-Intel hardware running code generated by their compiler or linked to their high performance scientific libraries? Surely NVIDIA would never sandbag opencl...
  Anyway, this is kind of a goofy thing to even discuss given two facts:
  * There are basically two GPU vendors - so vendor lock is practically assured already.
  * CUDA is designed to run parallel code on NVIDIA GPUs - full stop. Opencl is designed for heterogeneous computing, and GPUs are just one of many computing units possible. So not apples to apples.
  
  kllrnohj 2 years ago
  
  > CUDA is designed to run parallel code on NVIDIA GPUs - full stop. Opencl is designed for heterogeneous computing, and GPUs are just one of many computing units possible. So not apples to apples.
  This is really why OpenCL failed. You really can't write code that works just as well on CPUs as it does on GPUs. GPGPU isn't really all that general purpose, it's still quite specialized in terms of what it's actually good at doing & the hoops you need to jump through to ensure it performs well.
  This is really CUDA's strength. Not the API or ecosystem or lock-in, but rather because CUDA is all about a specific category of compute and isn't afraid to tell you all the nitty gritty details you need to know in order to make effective use of it. And you actually know where to go to get complete documentation.
  > There are basically two GPU vendors - so vendor lock is practically assured already.
  Depends on how you scope your definition of "GPU vendor." If you only include cloud compute then sure, for now. If you include consumer devices then very definitely no, not at all. You also have Intel (Intel's integrated being the most widely used GPU on laptops, after all), Qualcomm's Adreno, ARM's Mali, IMG's PowerVR, and Apple's PowerVR fork. Also Broadcom's VideoCore that's still in use by the very low end like the Raspberry Pi and TVs.
  
  pjmlp 2 years ago
  
  CUDA is designed to support C, C++, Fortran as first class languages, with PTC bindings for anyone else that wants to join the party, including .NET, Java, Julia, Haskell among others.
  OpenCL was born as C only API, requires compilation at run time. The later additions for SPIR and C++ were an afterthought after they started to take an heavy beating. Still no IDE or GPGPU debugging that compares to CUDA, and OpenCL 3.0 is basically 1.2.
  Really not apples to apples.
  
  dotnet00 2 years ago
  
  >lol, they have such a high churn rate on hardware that I seriously doubt they'd give it much thought at all. Their use case is unique to a tiny number of companies - high churn, low capital constraint, no tolerance for supplier delay. In such a scenario CUDA vendor lock in wouldn't even register as a potential point of pain
  Considering that PyTorch and Tensorflow are the two most popular deep learning frameworks used in the industry, this argument doesn't make sense. Of course they care about CUDA lock-in, it makes them dependent on a competitor and limits the range of hardware they support and thus potentially limits the adoption of their framework. The fact that they chose CUDA anyway is essentially confirmation that they didn't see any other viable option.
  >Also, to the point on NVIDIA's opencl sucking: anybody else remember that time that Intel intentionally crippled performance for non-Intel hardware running code generated by their compiler or linked to their high performance scientific libraries? Surely NVIDIA would never sandbag opencl...
  If NVIDIA were somehow intentionally crippling OpenCL performance on non-NVIDIA hardware, it would be pretty obvious since they don't control all the OpenCL compilers/runtimes out there. They very likely were crippling OpenCL on their own hardware, but that obviously wouldn't matter if the competitors (as you mentioned, OpenCL was designed for heterogenous compute in general, so there would have been competition from more than just AMD) had a better ecosystem than CUDA's.
  
  ChrisLomont 2 years ago
  
  >For those unaware of how opencl works: an API is provided by the standard, to which software can be written by people - even those without signed NDAs
  And no one has made it work as well as CUDA - developers that want performance will choose CUDA. If OpenCL worked as well people would choose it, but it simply doesn't.
  >I seriously doubt they'd give it much thought at all.
  Having talked to people at both companies about exactly this, they have put serious thought into it - it amounts to powering their multi-billion dollar cloud AI infrastructure. The alternatives are simply so bad that they choose CUDA/NVidia stuff, as do their clients. Watching them (and AWS and MS) choose NVidia for their cloud offerings is not because all are stupid of cannot make new APIs if needed - they choose it because it works.
  >Surely NVIDIA would never sandbag opencl...
  So fix it. There's enough people that can and do reverse engineer such things that one would have likely found such conspiracies. Or publish the proof. Reverse engineering is not that hard that if this mythical problem existed that you could not find it and prove it and write it up, or even fix it. There's enough companies besides NVidia that could fix OpenCL, or make a better API for NVidia and sell that, yet neither of those have happened. If you really believe it is possible, you are sitting on a huge opportunity.
  Or, alternatively, NVidia has made really compelling hardware and the best software API so far, and people use that because it works.
  Open source fails at many real world tasks. Choose the tool best suited to solve the problem you want solved, regardless of religious beliefs.
  
  badsectoracula 2 years ago
  
  Sorry but this part...
  > Choose the tool best suited to solve the problem you want solved, *regardless of religious beliefs*.
  ...is nonsense. Open source isn't about "religion", is about actually being able to do something like...
  > So fix it.
  ...without needing to do stuff like...
  > do reverse engineer such things
  ...which is a pointless waste of time regardless of how "not that hard" it might be (which is certainly not easy and certainly much easier to have the source code around).
  This association of open source / free software with religion doesn't have any place here, people didn't come up with open source / free software because of some mystical experience with otherworldly entities, they came up with it because they were faced with actual practical issues.
  
  ChrisLomont 2 years ago
  
  OP complains people use CUDA instead of a non-existent open source solution.
  That's religion.
  And a significant amount of open source solutions are the result of reverse engineering. It's a perfectly reasonable and time tested method to replace proprietary solutions.
  > they came up with it because they were faced with actual practical issues
  People use CUDA for actual practical issues. If someone makes a cross platform open source solution that solves those issues people will try it.
  So far it has not been done.
  
  badsectoracula 2 years ago
  
  First of all, i replied to the generalization "Open source fails at many real world tasks. Choose the tool best suited to solve the problem you want solved, regardless of religious beliefs" not just about CUDA. Open source might fail at tasks but it isn't pushed or chosen because of religion. It has nothing to do with religion. In fact...
  > OP complains people use CUDA instead of a non-existent open source solution. That's religion.
  ...that isn't religion either. The person you replied to complains because CUDA not only is closed source but also is vendor locked to Nvidia both of which have a ton of issues inherent to being vendor locked and closed source software, largely around control - the complaint comes from these issues. These issues for many can either be showstoppers or just make them look and wish and push for alternatives and they come from practical concerns, not out of religious issues.
  > And a significant amount of open source solutions are the result of reverse engineering. It's a perfectly reasonable and time tested method to replace proprietary solutions.
  It is not reasonable at all, it is the last-ditch effort when nothing else seems to do, can be a tremendous waste of time and telling people "So fix it" when doing that would require reverse engineering is practically the same as telling them to shut up and IMO can't even be taken seriously as anything else than that.
  The proper way to fix something is to have access to the source code.
  And again to be clear:
  > People use CUDA for actual practical issues. If someone makes a cross platform open source solution that solves those issues people will try it.
  The "actual practical issues" i mentioned have nothing to do with CUDA or any issues they might use with CUDA or any other closed source (or not) technology. The "actual practical issues" i mentioned are about the issues inherent to closed source technologies in general - like fixing any potential issues one might have and being under the control of the vendor of those technologies.
  These are all widely known and talked about issues, it might be a good idea to not dismiss them.
  
  Const-me 2 years ago
  
  > they choose it because it works
  MS DirectCompute also works. Yet last time I checked, MS Azure didn’t support DirectCompute with their fast GPUs. These virtual machines come with TCC (Tesla Compute Cluster) driver which only supports CUDA, DirectCompute requires a WDM (Windows Driver Model) driver. https://social.msdn.microsoft.com/forums/en-US/2c1784a3-5e09...
  
  my123 2 years ago
  
  You can flip the model from TCC to WDDM via nvidia-smi.
  But AFAIK, C++ AMP is deprecated and going away.
  https://docs.microsoft.com/en-us/cpp/parallel/amp/cpp-amp-ov...
  > C++ AMP headers are deprecated, starting with Visual Studio 2022 version 17.0. Including any AMP headers will generate build errors. Define _SILENCE_AMP_DEPRECATION_WARNINGS before including any AMP headers to silence the warnings.
  So please don't rely on DirectCompute. It's firmly in legacy territory. Microsoft didn't invest the effort necessary to make it thrive.
  
  Const-me 2 years ago
  
  > flip the model from TCC to WDDM via nvidia-smi
  I’m not sure that’s legal. I think NV wants extra money, details there https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...
  > C++ AMP is deprecated and going away.
  Indeed, but I never used that thing.
  > don't rely on DirectCompute
  DirectCompute is a low-level tech, a subset of D3D11 and 12. It’s not deprecated, used by lots of software, most notably videogames. For instance, in UE5 they’re even rasterizing triangles with compute shaders, that’s DirectCompute technology.
  Some things are worse than CUDA. Different programming language HLSL, manually managed GPU buffers, compatibility issues related to FP64 math support.
  Some things are better than CUDA. No need to install huge third-party libraries, integrated with other GPU-related things (D2D, DirectWrite, desktop duplication, media foundation). And vendor agnostic, works on AMD and Intel too.
  
  my123 2 years ago
  
  > I’m not sure that’s legal. I think NV wants extra money, details there https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...
  Use: nvidia-smi -g {GPU_ID} -dm 0
  Cloud providers do pay for an extensive vGPU license, don't worry about that part.
  > DirectCompute is a low-level tech, a subset of D3D11 and 12.
  D3D11. The compute subset of D3D12 is named D3D12 and directly got rolled into that. Also, you have CLon12 today which does support SPIR-V.
  
  Const-me 2 years ago
  
  > Use: nvidia-smi -g {GPU_ID} -dm 0
  I think I tried that a year ago, and it didn’t work. Documentation agrees, it says “GRID drivers redistributed by Azure do not work on non-NV series VMs like NCv2, NCv3” https://docs.microsoft.com/en-us/azure/virtual-machines/wind... Microsoft support told me the same. I wanted NCv3 because on paper, V100 GPU is good at FP64 arithmetic which we use a lot in our compute shaders.
  > The compute subset of D3D12 is named D3D12
  Interesting, didn’t know about the rebranding.
  
  jjoonathan 2 years ago
  
  > Surely NVIDIA would never sandbag opencl...
  In my experience the AMD OpenCL implementation was worse than NVIDIA's OpenCL implementation, and not a little worse, but a lot worse. NVIDIA beat AMD at AMD's own game -- even though NVIDIA had every incentive to sandbag. It was shameful.
- jacquesm 2 years ago
  
  One such developer: I love CUDA, even if I don't like Nvidia. CUDA is the most direct and transparent way to work with the GPU for the stuff I do and it is on account of it not being an open standard: it doesn't have four vendors trying to pull it their way ending somewhere in the middle and it has been very stable for a long time so I don't need to update my code every time I get new hardware, though sometimes some tweaks are required to get it close to theoretical maximum speed. That alone stops me from going with an 'open standard' even though I'm a huge fan of those. But in this case the hardware is so tightly coupled to the software I see no point: there isn't anything out there that would tempt me.
  So, locked in? Yes. But voluntarily so, I could switch if I wanted to but I see absolutely no incentive, performance wise or software architecture wise. And the way things are today that is unlikely to change unless some part is willing to invest a massive amount of money into incentivizing people to switch. And I'm not talking about miners here but people that do useful computations and modeling on their hardware, and all this on Linux to boot, a platform that most vendors could not care less about.
  
  einpoklum 2 years ago
  
  > CUDA is the most direct and transparent way to work with the GPU
  Yes, but it's still not direct and transparent enough. The libraries and drivers are closed.
  > it doesn't have four vendors trying to pull it their way ending somewhere in the middle
  Well, no, but it does have "marketing-ware", i.e. features introduced mostly to be able to say: "Oh, we have feature X" - even if the feature does not help performance.
  
  jacquesm 2 years ago
  
  > The libraries and drivers are closed.
  Yes, but that does not bother me all that much, since they are tied to that specific piece of hardware. I'm more concerned with whether they work or not and unless I'm planning to audit them or improve on them what's in them does not normally bother me, I see the combination Card+Firmware as a single unit.
  > Well, no, but it does have "marketing-ware", i.e. features introduced mostly to be able to say: "Oh, we have feature X" - even if the feature does not help performance.
  I'm not aware of any such features other than a couple of 'shortcuts' which you could have basically provided yourself. Beyond that NVidia goes out of its way to ship highly performant libraries with their cards for all kids of ML purposes and that alone offsets any kind of bad feeling I have towards them for not open sourcing all of their software, which I personally believe they should do but which is their right to do or not to do. I treat them the same way I treat Apple: as a hardware manufacturer. If their software is useful (NVidia: yes, Apple: no) to me then I'll take it, if not I'll discard it.
  
  TomVDB 2 years ago
  
  I don’t know which features you’re talking about, but over the years, CUDA has received quite a bit of features where Nvidia was quite explicit that they were not for performance, but for ease of use. “If you want code to work with 90% performance, use this, if you want 100%, use the old way, but with significantly more developer pain.”
  Seems fair to me.
  
  dahart 2 years ago
  
  > The libraries and drivers are closed.
  Are you including the new open source kernel modules? https://developer.nvidia.com/blog/nvidia-releases-open-sourc...
  > it’s still not direct and transparent enough.
  Out of curiosity, not direct enough for what? What do you need access to that you don’t have at the moment?
  > features introduced mostly to be able to o say: “Oh we have feature X” - even if the feature does not help performance.
  Which features are you referring to? Are you suggesting that features that make programming easier and features that users request must not be added? Does your opinion extend to all computing platforms and all vendors equally? Do you have any examples of a widely used platform/language/compiler/hardware that has no features outside of performance?
  
  einpoklum 2 years ago
  
  > Are you including the new open source kernel modules?
  Apparently, they simply moved almost all of their driver into firmware (the "GPU system processor") - and that firmware is closed. Stated [here](https://github.com/NVIDIA/open-gpu-kernel-modules/issues/19#...).
  And what about the host-side library for interacting with the driver? And the Runtime API library? And the JIT compiler library? This seems more like a gimmick than actual adoption of a FOSS strategy.
  Just to give an example of why open sourcing those things can be critical: Currently, if you compile a CUDA kernel dynamically, the NVRTC library prepends a boilerplate header. Now, I wouldn't mind much if it were a few lines, but - it's ~150K _lines_ of header! So you write a 4-line kernel, but compile 150K+4 lines... and I can't do anything about it. And note this is not a bug; if you want to remove that header, you may need to re-introduce some parts of it which are CUDA "intrinsics" but which the modified LLVM C++ frontend (which NVIDIA uses) does not know about. With a FOSS library, I _could_ do something about it.
  > Out of curiosity, not direct enough for what? What do you need access to that you don’t have at the moment?
  I can't even tell how may slots I have left in my CUDA stream (i.e. how many more items I can enqueue).
  I can't access the module(s) in the primary context of a CUDA device.
  Until CUDA 11.x, I couldn't get the driver handle of an apriori-compiled kernel.
  etc.
  > Which features are you referring to?
  One example: Launching kernels from within other kernels.
  > Are you suggesting that features that make programming easier and features that users request must not be added?
  If you add a feature which, when used, causes a 10x drop in performance of your kernel, then it's usually simply not worth using, even if it's easy and convenient. We use GPUs for performance first and foremost, after all.
  
  dahart 2 years ago
  
  > Launching kernels from within other kernels.
  This feature exists? It’s news to me if so and I would be interested. Is it brand new? Can you link to the relevant documentation?
  I’m pretty lost as to why this would represent something bad in your mind, even if it does exist. Is this what you’re saying causes a 10x drop in perf? CUDA has lots of high level scheduling control that is convenient and doesn’t overall affect perf by much but does reduce developer time. This is true of C++ generally and pretty much all computing platforms I can think of for CPU work. There are always features that are convenient but trade developer time for non-optimal performance. Squeezing every last cycle always requires loads more effort. I don’t see anything wrong with acknowledging that and offering optional faster-to-develop solutions alongside the harder full throttle options, like all platforms do. Framing this as a negative and a CUDA specific thing just doesn’t seem at all accurate.
  Anyway I’d generally agree a 10x drop in perf is bad and reason to question convenience. What feature does that? I still don’t know what you’re referring to.
  
  einpoklum 2 years ago
  
  > Is it brand new?
  No, it's 5 years old or so.
  > Can you link to the relevant documentation?
  https://developer.nvidia.com/blog/cuda-dynamic-parallelism-a...
  > and doesn’t overall affect perf by much
  Oh, but it very easily does. Slow I/O to the device, no parallelization of I/O and compute, etc - are all extremely easy pitfalls to walk into.
- physicsguy 2 years ago
  
  It's not vendor locking when the functionality doesn't exist on the other platforms.
  Just take as an example cuFFT. That's a core library that's been there pretty much since the beginning with CUDA. It has a compatibility interface compatible with FFTW, which everybody knows how to use. So porting from a CPU code that used FFTW was trivial.
  rocFFT is not as mature, the documentation is poor, and the performance is worse. And that's where there is an equivalent library that exists. In other cases, there isn't one.
- snovv_crash 2 years ago
  
  CUDA is easy to use and the open standard has an extremely high barrier to entry. This is what enabled Nvidia to even lock people in in the first place - their tech was just so much better.
  
  sennight 2 years ago
  
  That "barrier to entry" line works for things that saturate broad markets... and that is definitely not the case with GPGPU. So when you try to use that line of thinking, given the incredibly well funded and hyper niche use cases it sees, it sounds as if you're saying that opencl is too hard for those dummies writing nuclear weapon simulators at Oak Ridge National Laboratory. And before anybody swings in on a chandelier with "but the scientists are science-ing, they can't be expected to learn to code - something something python!": take a look at the software and documentation those labs make publicly available - they are definitely expected to be comfortable with things weirder than opencl.
  
  snovv_crash 2 years ago
  
  If you have a hard task to accomplish, and there's a way of making it substantially easier, the smart engineer is, frankly, going to take the easier option. They're only going to go with the hard one if that's the only hardware they have access to.
  Back in university we had to specify to the IT department that we needed Nvidia GPUs because people had done all sorts of cool things with CUDA that we could build on, and if we'd had to write it on AMD GPUs back in 2013 we would have burnt through all of our time just getting the frameworks compiling.
  
  jjoonathan 2 years ago
  
  Yep.
  Source: burnt through all my time just getting frameworks compiling on AMD GPUs back in 2014.
  
  oytis 2 years ago
  
  Maybe they can work with complex libraries, but if there is a better one available I would totally understand that they prefer it. You need to be in a certain software bubble not to understand how to work with OpenCL, but to care enough about whether something is an open standard, whether something is open source etc.
  
  sennight 2 years ago
  
  > You need to be in a certain software bubble...
  Or have a fundamental understanding of the way mainframes have been built since forever... massive job schedulers feeding various ASICs purpose built for different tasks. IBM is really something else when it comes to this sort of thing, the left hand really doesn't know what the right hand is doing over there. Summit at ORNL... a homogeneous supercomputer cluster made of power9 machines effectively acting as pci backplanes for GPGPUs. You'd think they'd know better... the choice for the ISA made sense given the absolute gulf between them and x86 I/O at the time, but to then not take full advantage of their CPUs by going with CUDA... wow. Oh well, this is the same company that fully opened their CPU - and then immediately announce how their next CPU was going to depend on binary blobbed memory controllers... aaand they also sold the IP for the controllers to some scumbag softcore silicon IP outfit. So despite all their talk with regard to open source, no - they don't seem to actually understand how to fully take advantage of it.
  
  snovv_crash 2 years ago
  
  Mainframes died for a reason, and that reason wasn't the capabilities of their specialized hardware blocks, but
  a. the difficulty of using them in unintended ways, such as wiring together a bunch of them to get better performance
  b. the specialized knowledge required to get them to do anything at all
  c. the limited use cases of the hardware (see a.) which made it expensive
  Applauding OpenCL because it is more like mainframes than CUDA... seems nonsensical to me.
- pjmlp 2 years ago
  
  Competition has had 15 years to provide competitive graphical development experience, GPGPU libraries and polygon development experience.
  They don't have anyone to blame but themselves.
  
  sennight 2 years ago
  
  Uh, and everyone else who isn't NVIDIA or the "Competition" aka AMD? Do we get the good old "Don't like the way Private Company does X, build your own X then!"? Everyone had years to provide a competitive petroleum company - apparently Standard Oil did nothing wrong after all.
  
  pjmlp 2 years ago
  
  Definitely, they are the ones to blame by not upping their game, this includes Intel as well.
  They have had 15 years to improve OpenCL tooling to outmatch CUDA, they haven't done so.
  
  sennight 2 years ago
  
  Lazy end users, not writing drivers to hardware they don't have engineering specs to. Oh well, NVIDIA might get more than they bargained for with their scheme to dominate GPGPU with their proprietary API. It would be a real shame if people started using GPUs for rendering graphics, and just computed the prior work loads on DSPs and FPGAs sitting behind APIs already wired into LLVM, and silicon fab at process levels suited to the task only gets cheaper as new facilities are created to meet the demands of the next node level. That would be just awful, CUDA being so great and all - so beloved due to "Investments ... made 15-17 years ago" and no other reason. Huh, I wonder if that is why they unsuccessfully tried to buy Arm - because they knew the GPU carriage is at risk of turning back into a pumpkin, and they want to continue dominating the market of uncanny cat image generators.
  
  pjmlp 2 years ago
  
  Playing victim usually doesn't work.
  Again, AMD and Intel could have done their job to appeal to developers, they fully failed at it .
  ARM you say?
  Ironically Google never supported OpenCL on mobile, rather pushed their RenderScript dialect that no one cares, and now is pushing for Vulkan Compute, even worse than OpenCL in regards to tooling.
  
  salawat 2 years ago
  
  >Playing victim usually doesn't work.
  Neither does gaslighting or dismissively waving off someone else's suffering, but golly gee do responses like yours seem to make up a disproportionate amount of programmer's attitudes toward end users nowadays.
  Time was you didn't need to have a multi-billion dollar tech company behind you to write drivers or low level API's because you could actually get access to accurate datasheets, specs, etc.
  Now good luck if you want to facecheck some weird little piece of hardware to learn without signing a million NDA's or being held hostage by signed firmware blobs.
  That's a bit beside the main point though, as my gripes with Nvidia stem from their user hostile approach to their drivers rather than their that use the drivers.
  So apologies, shouting at clouds.
  
  pjmlp 2 years ago
  
  End users aren't writing GPGPU code.
  
  salawat 2 years ago
  
  All authors of code are end users.
  Not all end users are authors of code.
  Stop seeing devs as seperate from end users. That's how you get perverse ecosystems and worse, get perverted code.
  Furthermore, you shouldn't tout that end users aren't writing GPGPU code as either an excuse, or point of pride. If we were actually doing our jobs half as well as we should be (as programmers/computer scientists/teachers), they damn well would be.
  
  izacus 2 years ago
  
  Google is also pushing NNAPI for compute in their typical fashion :)
  https://developer.android.com/ndk/guides/neuralnetworks
  
  my123 2 years ago
  
  > and just computed the prior work loads on DSPs and FPGAs sitting behind APIs already wired into LLVM
  Ha, ha, good one. Upstream LLVM supports PTX just fine. (and GCC too by the way)
  FPGAs have a _much more_ closed-down toolchain. They're really not a good example to take. Compute toolchains for FPGAs are really brittle and don't perform that well. They're _not_ competitive with GPUs perf-wise _and_ are much more expensive.
  More seriously, CUDA maps to the hardware well. ROCm is a CUDA API clone (albeit a botched one).
  > the market of uncanny cat image generators.
  GPUs are used for far more things than that. Btw, Intel's Habana AI accelerators have a driver stack that is also closed down in practice.
  
  TomVDB 2 years ago
  
  I honestly have no idea what message you’re trying to convey here…

torginus 2 years ago

It's so strange, when I went to school almost a decade ago, I took a GPGPU course - I assumed that with GPU compute becoming widely available, it would become ubiquitous, why sort an array on the CPU in nlogn time, when you can do it on the GPU in logn with bitonic sort?

For some reason this future has never come to pass, GPGPU remained a computational curiosity, with it largely relegated to computer graphics and machine learning (although in the latter case, people rarely write CUDA themselves).

I wonder why that is the case, and if it will ever become mainstream.

jahewson 2 years ago

For most day-to-day (non-big) data the delay from copying the data to and from the GPU will outweigh the time saved. For large data, there’s usually an existing library for handing it - images, videos, ML weights, etc. It may take advantage of SIMD instructions, perhaps, or GPU, the end user doesn’t need to know.
Another reason is that specialised devices beat general devices. Once everyone settles on H.264, then adding specialised decoding silicon to the die is a big win, especially where battery life and efficiency matter. Same for ray-tracing, neural/tensor cores.
- torginus 2 years ago
  
  Yeah, but we got integrated GPUs which share memory memory with the CPU - consoles, mobiles, have done this forever, and it seems that sharing memory between the CPU and GPU is the way to go, considering stuff like the M1.
  
  rocqua 2 years ago
  
  M1 can pull off shared memory because it has ridiculous RAM speed. iGPU has to share memory.
  Thing is CPU is very memory latency sensitive, whilst GPU is much more memory throughout dependent. Hence they get memory with different tradeoffs.
majke 2 years ago

In my humble opinon there just isn't much point to use GPGPU for normal computing. I'll give you an example.
https://blog.cloudflare.com/computing-euclidean-distance-on-...
during this work I was able to achieve:
- 7.5kqps on 40 Xeon CPU's (optimized avx code)
- 35kqps on V100 gpu
V100 is say $10k, Xeon 6162 is $3k. That means $0.4 per request on intel, and $0.28 per request on GPU.
This really is not much difference. I'd prefer to own more general purpose servers than hard to program proprietary GPU's.
Obviousluy, getting GPU's sometimes makes sense. There are tasks that are a better fit for CUDA/GPGPU, stuff that requires high memory bandwidth, doesn't do branches. Furthermore, while V100 costs a lot, 3080 is supposed to be only say $700, which changes the equation in favor of gpu.
But overally, modern x86 AVX is surprisingly good, and apart from very specialistic tasks, porting generic algorithms to GPU is not worth the effort.
- dahart 2 years ago
  
  > 3080 is supposed to be only say $700, which changes the equation in favor of gpu.
  Yes, more than an order of magnitude cheaper, and also 2x higher peak teraflops. This makes a 20x-30x difference in your dollars per request calculation. So it’s somewhat confusing why at the end you ignore this and conclude GPUs aren’t worth it. V100 is a couple generations old, and comes in a half price model too, plus there are some other server grade options. The reasons to use the most expensive V100 are the mem capacity, or double precision, or to argue that it’s too expensive… ;)
  
  wmf 2 years ago
  
  GeForce GPUs aren't "allowed" to be used in servers so you "have to" buy the 10x more expensive ones.
  
  dahart 2 years ago
  
  Yeah true some configs are certified for servers and GeForce aren’t. Xeon’s in the same boat. So it does depend on your goals and requirements then. Using the V100 is still cherry picking in a price/perf argument since you don’t have to spend 10x, there are others cheaper options that are server qualified, right?
  
  wmf 2 years ago
  
  Yeah, but I assume all the server GPUs are 10x worse than consumer.
  
  dahart 2 years ago
  
  Huh. Okay. Why? What do you mean by ‘worse’?
  
  wmf 2 years ago
  
  If a $10K server GPU is equivalent to a $1K consumer GPU, I assume the $2K server GPU is equivalent to a $200 consumer one. If the price/performance sucks, picking a different model won't help.
  
  dahart 2 years ago
  
  I see. Well picking a different model actually does help, a lot, so the main thing to consider when asking whether your assumptions are valid is whether the $10k GPU and $1k GPU are equivalent (they’re not), and what you’re paying for, because it’s not primarily for flops. Take the 2 models of GV100 for example that have exactly the same perf, and yet one is half the price of what @majke picked as the example. In this case, picking a different model helps price by 2x. The difference is memory size. Other non-perf differences that affect price include thermal properties, support level, and generation of GPU. These things come down to your goals and requirements. Maybe @majke didn’t check recently but you can buy newer GPUs than a GV100 that has even more memory, higher perf, is server certified, and costs about half, so even using the half-price the smaller GV100 would be cherry picking in my book. And if we’re talking about consumer hobbyist needs and not server farm needs, that’s when you can get a lot of compute for your money.
  
  majke 2 years ago
  
  Thanks @wmf @dahart for the discussion.
  You are both right:
  - I can't just buy 3080 and stuff it into my servers due to legal.
  - I can't just buy 3080 and stuff it into my servers due to availability.
  - Often (as the example I given) the price-to-performance of GPU is not worth the cost of porting software.
  - Often (as the example I given) the price-to-performance of GPU is not super competitive with CPU.
  - Sometimes, you can make the math work. Either by picking a problem which GPU excels at (memory speed, single precision, etc), or by picking consumer grade GPU or by having access to cheap/used datacenter grade GPU's.
  - In the example I given, even with cheap 3080, and say 20-30x better perf/dollar ratio of GPU's.... is it still worth it? It's not like my servers are calculating euclidean distance for 100% their CPU. The workload is diverse, nginx, dns, database, javascript. There is a bit of heavy computation, but it's not 100% of workload. In order to get GPGPU to pay for itself it would need to take over a large portion of load, which, in general case is not possible. So, I would take GPU into consideration if it was 200x-1000x better per dollar then CPU, then I could make a strong financial argument.
  The point I was trying to make, is that GPU's are a good fit for a small fraction of computer workloads. For them to make sense:
  - more problems would need to fit on them
  - or the performance/dollar would need to improve further by orders of magnitude
  
  adrian_b 2 years ago
  
  Consumer GPUs can be used only for computations with single-precision floating numbers or with lower precision.
  Their native double-precision computation speed is typically lower than that of the CPUs with which they are paired.
  They are great for machine learning or graphics, but bad for CAD or physical simulations.
  As a workaround for the non-support of double precision, it is possible to rewrite many algorithms to use a mixed precision or to implement higher precision operations with double-single, triple-single or quadruple-single numbers.
  However, this greatly increases the complexity of porting most computations to a consumer GPU, so it is not normally worthwhile when compared to the alternative of just using a fast CPU.
  More than a decade ago, NVIDIA has produced a huge amount of propaganda that presented CUDA as the future of all computing, which will soon replace all CPUs for any task requiring high throughput.
  Nevertheless, a few years later NVIDIA has killed all that future lavishly presented by them, with their pricing and market segmentation strategies.
  Datacenter GPUs have 2 to 4 times higher throughput than CPUs at equal power consumption, so they are the best choice for supercomputers or large datacenters.
  On the other hand, the throughput per dollar of the datacenter GPUs is now lower than that of the CPUs, even if 5 years ago the relationship was reversed. Meanwhile, the price of CPU cores has been lowered by competition, while for datacenter GPUs the prices have been raised, by both NVIDIA and AMD.
  So the datacenter GPUs are completely useless for small companies or individuals, for whom the acquisition price may be more important than the electricity expenses.
  The consumer GPUs require too much effort to be used in any application for which single-precision is not good enough, so they remain great for the applications suited for them, but not good for many others.
  What is annoying is that the crippling of the consumer GPUs started by NVIDIA cannot be justified as a cost-saving measure, it is a just a pure market segmentation method.
  Reducing the throughput of double-precision from 1/2 of the single-precision throughput, to 1/8 of the single-precision throughput, would have already provided most of the area reduction in a GPU that can be obtained by removing completely the support for double-precision. A DP throughput of 1/8 would have still been competitive with CPUs and usable for some problems.
  Reducing the DP throughput much more than that, to 1/32 of the SP throughput, like NVIDIA did, was just intended to make the DP useless, except for algorithm testing purposes, in order to push the NVIDIA customers towards the overpriced datacenter products.
  AMD has also followed the NVIDIA strategy, replacing GCN, which had provided the last consumer GPUs with good DP capabilities, with the split by market RDNA and CDNA GPUs.
  
  dahart 2 years ago
  
  I can’t think of a chip maker that doesn’t segment their products, I think you’re giving Nvidia too much credit to make it sound like they invented the practice. Same goes for suggesting AMD isn’t thinking for themselves, right? If doubles were a competitive advantage, certainly they’d jump on it, no? It seems like they do have a bit higher average double-to-single ratio. But maybe the real problem is that double precision just isn’t in that much demand, especially for GPUs with less memory? Maybe the market is doing the segmenting and not the manufacturers?
  BTW, didn’t prices come down in the last gen? Consumer did, and I don’t know about data center products, but you certainly can get newer GPUs with more mem and more single & double perf for less money than what @majke quoted for GV100. If you want a consumer GPU with DP perf, maybe the Titan V is what you want?
  Here’s a naive question: what is the expected ratio of double to single flops if you build the double precision hardware as much as possible out of single precision math units? So I mean given a certain single precision perf level, what double precision perf level do you get if you devote the absolute minimum amount of chip area to DP specific logic? I’m no hardware architecture expert, so maybe the question is weird, but I would speculate wildly that multipliers might be the limiting factor and that it would take 4 single precision multiplies to get 1 double precision result. So I’m wondering if a 1/4 ratio is the most ‘natural’ in some sense? Adds alone I could see being 1/2, but multiplies and other math, especially transcendentals seem like they’d bring down the average.
  
  adrian_b 2 years ago
  
  If you design in a CPU or GPU with only double-precision arithmetic units, they can also be used for single-precision operations, providing double throughput for them.
  This is what is done both in CPUs and in GPUs like the AMD CDNA GPUs and the top NVIDIA datacenter GPUs, i.e. A100 or V100.
  However, a double-precision multiplier has 4 times the area of a single-precision multiplier, not 2 times. So replacing a part of the DP ALUs with SP ALUs, will save area, due to the smaller multipliers.
  Exactly how much area is saved depends on what percentage of the total GPU area is occupied by the multipliers.
  Let's suppose that the multipliers occupy 32% of the area, even if this number must be greatly exaggerated, as besides the multipliers there are a many other circuits in an ALU, and there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.
  Under this assumption, removing all support for DP would save about 16% of the area, reducing the DP throughput to 1/4 would save about 8%, to 1/8 about 12%, to 1/16 about 14% and to 1/32 about 15%.
  Like I have said, the real savings must be much less, but approximately in the same ratios.
  I agree that everybody does market segmentation, but that does not make it less annoying, especially because I always fall in the smaller segment of the customers who are forced to pay excessive prices for features which in fact would be useful for much more customers, but those are not aware about what they should demand when buying something.
  In the case of NVIDIA, their market segmentation strategy is much more annoying than in the case of other companies, because I remember very clearly a large number of NVDIA presentations from approximately 2006 to 2010, in which they described a very different company future, about how they will provide great computing power for the masses.
  Then they have realized that there are more efficient ways to make money, so, starting with their Fermi GPUs, they have implemented rigid market segmentation policies.
  While I understand their reasons, I strongly despise company liars.
  
  dahart 2 years ago
  
  > there are many control circuits and a large number of registers associated with each ALU. Therefore the real percentage of area occupied by multipliers must be many times less than this.
  Might be true, but I don’t see how that follows necessarily, if every DP multiplier costs as much as 64 logic or add operations, right? Even in single precision, multipliers are much larger than other units. Multiplying by 4 only amplifies that and makes the ratio bigger.
  > Under this assumption, removing all support for DP would save about 16% of the area
  You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right? In other words, a 32% area for N double precision multipliers would take only 8% area for N single precision multipliers, no?
  > the real savings must be much less
  Could be true, but I’m not convinced yet, I don’t find this compelling. DP registers are double the size. If you want the same cache hit rates, the cache needs to be double the size. The buses are 2x, all of the math is at least 2x, and multiplies are 4x. GPUs are devoted to math, and double precision takes twice the memory and twice the bandwidth to maintain the same number of ops throughput, so it’s not clear why the area assumption shouldn’t start north of 2x - double as a baseline and a bit higher due to multipliers.
  > I remember very clearly […] a very different company […] they will provide great computing power for the masses.
  I’m curious now what you imagined happening back then. Is double precision your only or primary hang up? What does providing great computing power to the masses mean to you? The AMD 6800xt and Nvidia 3080 both are an insane amount of compute for an insanely cheap price compared to what was available in 2010.
  BTW how certain are you about needing double precision? Are you deep in error analysis and certain of your requirements, do you know where single precision breaks down in your work? Some people do, I don’t doubt you could be one of them, just curious.
  
  adrian_b 2 years ago
  
  > "You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right?"
  No.
  The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers. If the DP support is removed completely, an otherwise unchanged GPU will remain with 2N SP multipliers, which have a half of the original area, not a quarter.
  Therefore if the DP multipliers occupy P% of the area, removing the DP support completely saves (P/2)% of the area, while reducing the DP throughput to 1/4 of SP throughput saves (P/4)% of the area as half of the DP multipliers are replaced by twice as many SP multipliers, to keep the SP throughput unchanged.
  Reducing the DP throughput to less than 1/4 of SP throughput produces various savings intermediate between (P/4)% and (P/2)%.
  Also, a 64-bit multiplier (actually the DP multiplier is only a 53-bit multiplier) is significantly less than 64 times larger than an adder, because the adders that compose the multiplier are much simpler than a complete adder (the chain of operations is organized in such a way that there are much less modulo-2 sums and carry propagations than when naively adding 64 partial products with complete adders).
  I have already said that there are ways to use the single-precision consumer GPUs, either by rewriting the algorithms to use a carefully chosen mix of single-precision operations and double precision operations, or by representing numbers by multiple single-precision values (which already reduces the speed at least 10 times, making only the most expensive consumer GPUs faster than typical CPUs, but which is still faster than the native 1/32 speed).
  However using such methods may require 10 times or 100 times more effort for writing a program than simply writing it in double-precision for CPUs, so this is seldom worthwhile.
  For almost any problem in engineering design or physical-systems modeling and simulation, double-precision is mandatory.
  Single-precision numbers are perfectly adequate for representing all input and output values, because their precision and range matches those available in digital-analog and analog-digital converters.
  On the other hand most intermediate values in all computations must be in double precision. Not only the loss of precision is a problem, but also the range of representable values is the problem. With single-precision, there are many problems where overflows or underflows are guaranteed to happen, while no such things happen in double precision.
  In theory, it is possible to avoid overflows and underflows by using various scale factors, adjusted to prevent the appearance of out-of-range results.
  However, this is an idiotic method, because the floating-point numbers have been invented precisely to avoid the tedious operations with scale factors, which are needed when using fixed-point numbers. If you have to manage in software scale factors, then you might as well use only operations with integer numbers, as the floating-point numbers bring no simplification in such a case.
  There are many other such pieces of advice for how to use SP instead of DP, which are frequently inapplicable.
  For example, there is the theory that one should solve a system of equations first approximately in SP, and then refine the approximate solution iteratively in DP, to get the right solution.
  There are some very simple mostly linear problems where this method works. However many interesting engineering problems, e.g. all simulations of electronic circuits, have systems of equations obtained by the discretization of stiff non-linear differential equations. Trying to approximately solve such systems of equations in SP usually either results in non-convergence or it results in solutions which, when refined in DP, converge towards different solutions than those that would have been obtained if the system of equations would have been solved in DP since the beginning.
  In conclusion, even if single-precision may be used successfully, very rarely can that be done by just changing the variable types in a program. In most cases, a lot of work is necessary, to ensure an acceptable precision of the results.
  In most cases, I do not see any advantage in doing extra work and pay for GPUs, just because the GPU maker is not willing to sell me a better GPU at a price difference proportional with the difference in manufacturing cost.
  Instead of that, I prefer to pay more for a faster CPU and skip the unnecessary work required for using GPUs.
  I still have a few GPUs from the old days, when DP computation on GPUs was cheap (i.e. around $500 per double-precision teraflop/s), but they have become increasingly obsolete in comparison with modern CPUs and GPUs, and no replacement for them has appeared during the last years and no similar GPU models are expected in the future.
  
  dahart 2 years ago
  
  > The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers.
  Right. You’re answering a different question than what I asked.
  Why is it 2N single multipliers when the area for DP is 4x? Your premise seems to be missing a factor of 2 somewhere.
  
  adrian_b 2 years ago
  
  One DP multiplier has approximately the area of 4 SP multipliers, therefore twice the area of 2 SP multipliers.
  One DP multiplier, by reconfiguring its internal and external connections can function as either 1 DP multiplier, or as 2 SP multipliers. Therefore a GPU using only DP multipliers which does N DP multiplications per clock cycle will also do 2N SP multiplications per clock cycle, like all modern CPUs.
  For example, a Ryzen 9 5900X CPU does either 192 SP multiplications per cycle or 96 DP multiplications per cycle, and an old AMD Hawaii GPU does either 2560 SP multiplications per clock cycle or 1280 DP multiplications per clock cycle.
  When you do not want DP multiplications, the dual function DP/SP multiplier must be replaced by two SP multipliers, to keep the same SP throughput, so that the only difference between the 2 designs is the possibility or impossibility of doing DP operations. In that case the 2 SP multipliers together have a half of the area needed by a DP multiplier with the same SP throughput.
  If you would compare 2 designs having different SP throughputs, then there would be other differences between the 2 designs than the support for DP operations, so the comparison would be meaningless.
  When all DP multipliers are replaced by 2 SP multipliers each, you save half of the area previously occupied by multipliers, and the DP throughput becomes 0.
  When only a part of the DP multipliers are replaced by 2 SP multipliers each, the SP throughput remains the same, but the DP throughput is reduced. In that case the area saved is less than half of the original area and proportional with the number of DP multipliers that are replaced with 2 SP multipliers each.
  
  dahart 2 years ago
  
  I understand your assumptions now. You’re saying SP mult is by convention twice the flops for half the area, and I was talking about same flops for one fourth the area. It’s a choice. There might be a current convention, but regardless, the sum total is a factor of 4 cost for each double precision mult op compared to single. Frame it how you like, divvy the cost up different ways, the DP mult cost is still 4x SP. Aaaanyway... that does answer my question & confirm what I thought, thank you for explaining and clarifying the convention.
nemothekid 2 years ago

The number one issue is hardware; particularly memory latency. Unless you have a lot of compute that excels on GPU (like a gorillion matrix multiplications) or a lot of memory that you want to execute, then the execution time is dominated by memory transfer.
Recently I had a task where I wanted to compute just cosine similarities between two vectors. For a couple hundred thousand floats my code spent something like ~1ms on CPU and ~25ms on the GPU. The GPU didn't start winning until I got to the millions of floats. For my use case a better solution was just taking advantage of a SIMD library.
oriolid 2 years ago

In my experience a big obstacle is that GPU programming is considered really difficult and a normal programmers shouldn't touch it on employer's time because it will just waste time and maybe produce difficult bugs if they get something done. The other obstacle is that any optimization is considered evil anyway, and since using GPGPU needs some architecture decisions, it's premature optimization which is extra evil. It's much cheaper to tell the users to buy a faster CPU.
For machine learning and graphics GPGPU is allowed, because they would be more or less infeasible, not just slower, with just CPU.
- torginus 2 years ago
  
  I think GPGPU is considered difficult because people aren't taught how to do it, and there's very little software support for it outside of esolangs, research projects, and single-person GitHub crusades, or vendor specific stuff like CUDA.
  There's other stuff that's difficult too, like farming out a compute workload to a bunch of unreliable servers through unreliable networks, but there's just so much tooling and expertise going around for it, that people do it regularly.
  
  oriolid 2 years ago
  
  If CUDA feels bad, there's this cross platform API called OpenCL. It's even possible to generate OpenCL from C++ without directly writing kernels with Boost.Compute, and I wouldn't call C++ an esolang. And if you're fine with nVidia, there's stuff like Thrust and cuBLAS. It's true that it's not taught, but again optimization isn't interesting for computer scientists, software engineers are taught it's evil and physicists are supposed to just read a book and get to work.
  I think distributed computing is OK because it again enables things that would be impossible with a single computer, no matter how huge.
  
  pjmlp 2 years ago
  
  Great printf debugging experience with C, and playing compiler with source code on the fly, meanwhile on CUDA side, graphical debugging with shader code and support for standard C++17, shipping bytecode.
  
  oriolid 2 years ago
  
  It took a while to figure out how the comment was related to the discussion. But I never said that CUDA would actually be difficult, just that I've never met a manager who would not turn down a programmer's suggestion to do calculation on GPU.
- dahart 2 years ago
  
  > The other obstacle is that any optimization is considered evil anyway
  What? By whom? With all due respect to whoever told you that, no, no, and definitely no. Are you referring to Knuth’s famous quote, or something else? Knuth was advocating for optimization, not against it. He did tangentially and jokingly warn against premature optimization when writing software, but either way, neither using a GPU nor making architectural decisions counts as premature optimization unless you do it without thinking about your problem at all (and in that case, using a CPU is also premature optimization). Using a GPU is no different from using a faster CPU as a hardware solution to a compute problem, if you are doing SIMD-compatible work, and the GPU will be lots less expensive in terms of flops. Plus using the GPU they already have is loads cheaper for the users than buying a faster CPU.
  > because they would be more or less infeasible, not just slower, with just CPU.
  That’s a strange way to frame it. It’s infeasible because it’s slower, so much slower people can’t tolerate it. This is precisely why people should and do look to GPGPU for math and compute intensive applications.
  
  oriolid 2 years ago
  
  I'm not expecting anyone to know what Knuth actually wrote, but the comment about premature optimization is quoted a lot out of context, usually with "premature" left out. Whether it was a joke or not, it's now taken as gospel.
  I find it really odd that you have never run into this attitude. I've experienced it in almost every job I've ever had. I'm fairly sure that Scrum also considers thinking about your problem at all wrong, because it's not an user story.
  > Plus using the GPU they already have is loads cheaper for the users than buying a faster CPU.
  Actual quote from a former CTO: "The client likes that our software has so high hardware requirements, because it makes them think it does something really difficult"
  > It’s infeasible because it’s slower, so much slower people can’t tolerate it.
  Yes, this is exactly what I meant.
  
  dahart 2 years ago
  
  > I find it really odd that you have never run into this attitude.
  I didn’t say that, don’t make assumptions. I’ve run into people misquoting Knuth a lot, above it seemed like you were one of them. If you know it’s a misquote, please refrain from forwarding misinformation. The argument that a lot of people misuse the quote or misunderstand it is in no way compelling as a reason to believe or to spread the wrong version.
  I haven’t heard a lot of people saying any and all optimization is bad, even after profiling. That’s just completely and utterly silly and now I know you agree and know it too. Our jobs as programmers is partly to help other programmers and managers and customers to see their choices more clearly, not to just amplify their misconceptions, right?
  Your CTO quote about how to trick customers isn’t relevant to what we were discussing, and on top of that it effectively supports the idea of using GPUs, albeit for the wrong reasons.
  
  oriolid 2 years ago
  
  > I didn’t say that, don’t make assumptions.
  In that case I don't really understand why you had to ask who says that.
  > I haven’t heard a lot of people saying any and all optimization is bad, even after profiling. That’s just completely and utterly silly and now I know you agree and know it too.
  After profiling is too soon. The right time is when you're in deep trouble and can't explain your way out any more. Again, not really my opinion but I've encountered this a bit too many times.
  
  dahart 2 years ago
  
  > The right time is when you’re in deep trouble
  I mean, I don’t agree with that and neither does Knuth, and it sounds like neither do you, but hey, it’s not up to me to tell anyone other than my own team how to write code or run a business. There are always going to be dev jobs that just don’t care about performance until they’re in trouble and wasting money and time. If what you’re saying is that some dev shops don’t want to consider GPU programming because they don’t care about performance at all, and it would be hard to propose CUDA as a solution there due to the prevailing attitudes, then yeah I agree that’s not the place to learn CUDA. I’d have to agree that’s an obstacle to learning CUDA, but that’s not really CUDA’s fault or a general obstacle to using GPUs, it’s just the wrong place & time. Better to find a job at a company that cares about performance, right? There are lots of them.
  FWIW, your phrasing makes it sound like you do hold this opinion, which is why I asked about who believes this. You’re stating it both here and above first as though it’s a general fact before later qualifying it’s someone else’s belief and slightly distancing yourself. I still can’t tell where you really land, but hopefully we’re more violently agreeing that disagreeing. All I’m saying is it would be doing your peers and HN both a service to challenge misinterpretations and misunderstandings of what Knuth was trying to get across, that performance matters (and also that efficient use of your time matters too).
  
  oriolid 2 years ago
  
  > There are lots of them.
  I need names. Especially those that don't require a long history of GPU programming or PhD in a related field to even get an interview. Bonus points if they're not a startup that is about to fail and is desperate to hire anyone who wants work on cool stuff for free while it lasts. Even better if they have a business model that is not cryptocurrency, HFT or just hoping to get acquired.
  Yes, I'm more than a bit disillusioned with the field. We could do much better if there hadn't been some people who made a lot of money on the "move fast and break things" or "nobody got fired for buying X" attitudes. I was trying to communicate those things as if they were commonly accepted attitudes but not really true. I think I failed. Sarcasm never works on the Internet.
  
  dahart 2 years ago
  
  I see, I hear you. Well, all the FAANG companies have high performance dev groups that dabble with GPU. Games companies all do GPU, plus any team doing graphics. Neural network jobs at AI companies are around but more likely to require the history & PhD, and land in the startup category. First place I did any GPU work for my job was WebGL at LucidChart (a web app competitor to Visio). AMD, Apple, Sony, Nvidia and anyone else making GPU or SIMD units have tons of jobs and you can likely start with something close to what you currently do and transition toward more high performance programming. I dunno if that’s at all helpful, but sorry about the disillusionment, I know that can be difficult to deal with.

Const-me 2 years ago

Nice presentation.

None of what was said is specific to CUDA. It’s the same thing in DirectCompute HLSL. Only the terminology is different: CUDA thread blocks are called thread groups in DirectCompute, warps are called waves, shared memory is called “group shared”, etc.

Probably applies to others too (Vulkan, Metal, etc.) albeit I never used these for HPC-like stuff, not 100% sure.

flakiness 2 years ago

It's fun to see how they refer to semi-classic supercomputers like Clay and Hitachi as the scale, although it doesn't help grasping the speed at all! To be useful (in price of non-cool), it can probably be replaced with today's cheap computers like lower end phones or chromebooks.

Well, I know it isn't worth it as the deck is more about encouragement and excitement than getting detailed understanding :-)

I love how NVIDIA folks love computers.

mike_hearn 2 years ago

This is really a very good talk, kudos (cudos?) to the speaker. It really drives home the core point about the high level resource management issues you'll face when using a GPU and how you have to plan your algorithms so carefully to ensure they'll fit.

upbeat_general 2 years ago

Does anyone know of good tutorials for CUDA programming incl debugging? Especially one that also covers the recommended way to bind w/PyTorch?

I've always found writing CUDA kernels to be a bit unapproachable.

arpankapoor 2 years ago

I found the introductory chapters (1-3) of this book[0] quite good. It is different from the NVIDIA CUDA C++ guide in that it uses modern C++ and has non-trivial real-world examples.
I also wrote a blog post [1] exploring CUDA to write a simple CNN inference module which you might find useful.
[0]: https://www.cambridge.org/core/books/programming-in-parallel... [1]: https://arpankapoor.com/cuda
synergy20 2 years ago

I kind of feel there are two different levels of sw development for CUDA.
1. CUDA level programming for graphic processing etc, or writing a c/c++ library for Pytorch/tensorflow framework.
2. Pytorch/Tensorflow level coding(e.g. training a model), you just pick their CUDA-specific APIs, and the two frameworks handle the rest under the hood, no CUDA specific coding details(point 1 above) is required from the users as far as I can tell.
if you're interested in 1, Nvidia has c++ guide to download, if you're interested in 2, then the focus is on the AI framework instead of CUDA.
zmgsabst 2 years ago

I don’t know a good tutorial, but numba’s CUDA is good for learning on your own projects.
I made some automata engines with that framework.
http://numba.pydata.org/numba-doc/latest/cuda/
serial_dev 2 years ago

CUDA by Example helped me a lot learning about general purpose GPU programming. It doesn't cover Python, PyTorch at all though.

cbrogrammer 2 years ago

Writing CUDA by hand is like writing x86 assembly by hand, why would anyone ever do that willingly and not for fun?

pjmlp 2 years ago

So now standard C++17 or modern Fortran is like writing Assembly by hand?
- cbrogrammer 2 years ago
  
  > is **like** writing x86 assembly by hand
  yes absolutely, way too granular and low level
  
  w1nk 2 years ago
  
  What? CUDA is intentionally granular and low level, why do you feel that is a negative thing at this level of abstraction? Are you suggesting the tools should be better so that doesn't have to be the case? I can't figure out what you're actually trying to express here.
  
  cbrogrammer 2 years ago
  
  CUDA is absolutely fine just like x86 assembly is absolutely fine. They are well thought out. What I am saying is that from my perspective they are as granular and difficult, and programming directly in them for even small projects is not a good idea. GPU programming is, at the moment, stuck in 1960's way of thinking, as there are multiple people writing CUDA code by hand.
  
  w1nk 2 years ago
  
  There's nothing 1960s about it, that's just not well reasoned. These computation constructs/tools just don't currently have better abstractions while maintaining the desired computational performance.
  It's a strange intersection of needs where one wants or needs something like CUDA, but doesn't care to ensure their computation is actually running optimally. If you don't want to be bothered with control and granularity, why are you trying to write high performance CUDA code in the first place?
  Would you mind elaborating as to what your hobby project was?
  
  pjmlp 2 years ago
  
  Now C++17 with NVidia parallel libraries is too granular and low level, what a joke.
  Yes one can make use of the Assembly like low level APIs, which unlike OpenCL, are opt-in, not forced upon everyone.
  Clever devs use the high level C++ and Fortran NVidia libraries.