Raytracing on AMD’s RDNA 2/3, and Nvidia’s Turing and Pascal

100 points by treesciencebot 2 years ago

I suspect the reason the author is seeing very shallow trees for Nvidia might be because the lower levels are done fully behind the scenes:

https://forums.developer.nvidia.com/t/extracting-bvh-from-op...

As someone who deals with BVHs a lot for ray intersection, I find it pretty difficult to believe that leaf nodes with that number of primitives will be anywhere near performant, even with fast dedicated hardware like the RT cores.

It's true that the Nvidia cards have better intersection performance than ray/box tests, but I don't believe it's in the 100x ratio range which I suspect would be needed if the BVHs were that shallow and leaf nodes that large.

frogblast 2 years ago

I strongly suspect the reason Nvidia trees are so shallow is that NSight simply isn't showing the actual tree structure, probably because Nvidia considers that proprietary. It appears to just list all the leafs of a tree in a big flat list. But there definitely is a tree in there.
- graffix 2 years ago
  
  At work I inherited a raytracer codebase with a severe memory bloat problem on terrains. The size of terrain BLASes is precisely what one would expect from a bog-standard BVH with branch factor 2, so I'm sure you're right.
  This is on Turing. Nvidia would've been motivated to de-risk the introduction of RTX by making boring choices. You may well see different results on later archs.
- kevingadd 2 years ago
  
  Perhaps the rest of it isn't a tree and is some other optimized data structure? Like some sort of spatial hash or sort
- sounds 2 years ago
  
  Since cache hit ratios are so central to fast GPU code, the tree structure doesn't have to be exotic, likely the secret sauce is how it performs in the caches.
- Arrath 2 years ago
  
  I'm very curious to see it unrolled down to its actual structure.
TinkersW 2 years ago

Isn't wide BVH how embree works, 1 ray vs SIMD width boxes.. maybe Nvidia is simply doing the same thing but with the wider GPU SIMD(32 I believe).
- berkut 2 years ago
  
  Yes, but normally 4- or 8-wide is the norm: the wider you go the more sorting you have to do to traverse things in order or find the nearest hit which has an overhead (hardware may help with this, but it's still an overhead).
  Previous indications from Nvidia about their BVHs don't seem to show anything about very shallow trees for any of the BVH algorithms that OptiX supports (scroll to bottom for reverse visualisation of a BVH hierarchy on top of the Stanford Bunny model): https://drive.google.com/file/d/1B5fNRFwv2LsGlCBJ8oKYRiiDUtL...

matthewfcarlson 2 years ago

I've often wondered why Nvidia cards are generally so much better at rendering scenes in Blender's cycles renderer (a raytracing engine). The benchmarks on Blender's website are really telling (https://opendata.blender.org/benchmarks/query/?group_by=devi...) by the fact that the only non Nvidia entry on the first page is the AMD 2X EPYC 9654 96-Core.

This really lays out the decisions that Nvidia made compared to AMD and how their approach tends to hide some of the shortcoming of GPUs (latency and utilization).

zokier 2 years ago

That is more of a software (ecosystem) thing. Nvidias CUDA and OptiX are well beyond anything AMD has to offer. In Cycles case, I believe that on Nvidia it is taking good advantage of RT cores while on AMD they are completely unused which has predictable effect on performance. Even ignoring the RT cores I suspect the Nvidia code path is likely far more optimized than AMD one
https://www.phoronix.com/news/AMD-HIP-RT-Blender-3.5-Plans
- dotnet00 2 years ago
  
  Plus on AMD's side we have their inability to commit to fully supporting any specific system long term, limiting open source interest in doing things for them.
  
  Melatonic 2 years ago
  
  I thought AMD was all in on OpenCL?
  
  dotnet00 2 years ago
  
  Nope, their OpenCL support has been kind of stagnant for a while, especially on Windows. On top of that, part of why Blender dropped its OpenCL supporting renderer was that AMD's OpenCL was still pretty buggy, making the renderer a pain to maintain.
  Lately their focus is ROCm and their CUDA equivalent language, but it also has limited official hardware support and AFAIK the Windows SDK for it is still not public.
  Similar commitment issues have plagued their custom renderers.
  
  my123 2 years ago
  
  AMD's OpenCL driver is significantly worse than NVIDIA's since ages afaik...
  
  jjoonathan 2 years ago
  
  Yep, this has been true for many years.
  Hopefully now that AMD has money they can turn things around. As much as it pains me to admit, giving up on OpenCL and focusing on the CUDA clone is probably the correct strategic move. OpenCL is just far too different to facilitate porting, and there's a lot of porting to do before AMD can start to take compute marketshare.
  
  imtringued 2 years ago
  
  It isn't because it further fragments the market.
  Now you have Nvidia CUDA, AMD "CUDA", Intel oneAPI. At least Intel is building their oneAPI on top of OpenCL via SYCL which means all AMD has to do is implement a bunch of Intel's OpenCL extensions, which are far from unreasonable (for example, unified shared memory which should have been in OpenCL 2.x instead of shared virtual memory).

ladberg 2 years ago

Would love to see a more in-depth article on BVH construction itself! I'm decently familiar with the main concepts but have no clue what the current SOTA looks like (is that even public info?).

BVH construction is my favorite question to ask in interviews because there's no single best solution and it mostly relies on mathy heuristics to get a decent tree. You can also always devote more time to making a more optimal tree but there's a tradeoff where it'll eventually take more time than it saves in raytracing.

frogblast 2 years ago

This is the best I've found that covers recent developments:
https://meistdan.github.io/publications/bvh_star/paper.pdf

shmerl 2 years ago

Ray tracing on Linux for CP2077 with 7900 XTX is still barely usable, but it's getting better.

I'd say RDNA 3 is not really giving useful ray tracing on for example 2560x1440 unless you use upscaling to speed it up. May be in a few GPU generations ray tracing will become usable with native resolutions.

newZWhoDis 2 years ago

It’s more than usable today on Nvidia
- shmerl 2 years ago
  
  I doubt it, unless you use upscaling there as well. I'd take higher framerate without RT at native resolution over RT with upscaling.
  
  dvtkrlbs 2 years ago
  
  If you have 40 series card frame generation is really nice. I still have DLSS quality enabled but with DLSS 3, everyting max at 4K I am able to get above 90 fps on a 4090
  
  shmerl 2 years ago
  
  Well, that's the point. How much are you getting without DLSS? Upscaling is kind of counter intuitive if you want to have better visuals from ray racing at the same time.
  When tray tracing will become fast enough without upscaling, then it will be a much better feature.
  
  izacus 2 years ago
  
  No it absolutely isn't counter intuitive - the visual difference is obvious. You should actually try it.
  
  shmerl 2 years ago
  
  I doubt it's obvious in the sense that you can tell in general case that it's not losing image quality due to upscaling. So no, thanks. Upscaling by definition implies image quality loss.
  
  WithinReason 2 years ago
  
  What's wrong with upscaling?
  
  shmerl 2 years ago
  
  Nothing wrong, except it's combining ideas going in opposite directions. Upscaling - reducing image quality as a price for improving performance. Ray tracing - increasing image quality (in regards to lighting and etc.).
  
  WithinReason 2 years ago
  
  I would say they are orthogonal, not in opposite directions.

br1 2 years ago

Interesting that card/drivers customize so much of ray tracing, like rasterization in pre vulkan/metal/d3d12 or even fixed function gpu days.

joanne123 2 years ago

[dead]

sylware 2 years ago

I did not get into the real details yet, but mesa radv pulls that horrible glslang due to some shaders related to acceleration structures.

Personnaly, I am a dev, then I patch to compile out all that (and all the tracers at the same time) since ray tracing has currently a ridiculous ratio benefits/technical costs.

This defeats the very purpose of vulkan spirv: getting rid of those horrible high level shader compilers from the driver stack and keep them contained at the application level.

It seems beyond clumsy, but as I said, I need to get into the details of "why" those shaders in the first place, and then why they are not written directly in RDNA assembly or SPIR-V assembly (that would require an "assembler" coded in simple and plain C).

TazeTSchnitzel 2 years ago

Generating a ray tracing acceleration structure is very complex, who'd want to implement that in assembly language?