Knight’s Landing: Atom with AVX-512

75 points by rbanffy a year ago

janwas a year ago

Great in-depth article. One especially interesting data point relevant to ongoing discussions about AVX-512 area cost:

40% of 2.93 mm2 per core is AVX-512, so 1.14 mm2. This is a large fraction, but as the article says the core is basically a minimum wrapper around the vector unit, with rather weak L1i/branch predictor/store buffers.

Let's put that in the context of modern chips. 14nm density was 44.67 MTr/mm2 so that's 50.9 MTr for AVX-512. To compare with 5nm, let's use density of TSMC N5 (138.2 MTr/mm2) to get 0.37 mm2.

So that's about 10% of an Apple M1 Firestorm core to enable 5-10x speedups vs scalar code. Sounds worthwhile to me. We can now stop saying that "AVX-512 is a huge fraction of modern cores", and "give us more cores instead". Let's instead use the hardware we have :)

celrod a year ago

Good analysis. It's also worth pointing out that this is for 2x 512 bit FMA, which is more than client Ice/Tiger/Rocket lake or Zen4 have.
Personally, I bought HEDT (Skylake-X and Cascadelake) because I wanted 2x 512 bit AVX512. Glad it's cheap in terms of area, and I'm hoping we'll get more options with great vector performance in the future.
- janwas a year ago
  
  Good point about the second FMA. I'm not certain it's the best tradeoff, Genoa only has two half-width FMA.
  I share your hope for more focus on vectors. It's also up to us software devs, CPUs will not invest as heavily if we don't use it.
  
  celrod a year ago
  
  I do think that Genoa's approach is a reasonable one. I'd like to see one of Gracemont's successors doing the same.
  Maybe we'll even see quadruple pumping for AVX-512 some day? I'll be impressed if/when an Atom line CPU gets 4x 128bit fma units to match ARM's Cortex-X line or Apple's Firestorm).
  I think these are good options, and can allow AVX-512 to sort of act like SVE, but with the benefits of a fixed size architecture (i.e., shuffles); compile one set of code and you're able to run it anywhere, with performance dictated by how much the vendor decided was worth investing into the vector units. And AVX512 can still help (like it does Genoa) by taking a lot of pressure off of the front end.
  I'm also trying to do my part as a software dev! I wrote/maintain the JuliaSIMD ecosystem, and working on good loop vectorizers to let people take advantage of their vector units is my passion; LoopVectorization.jl has gotten great results on many benchmarks[0], and I'm rewriting it as an LLVM pass to try and address as many of its flaws and limitations as I can.
  [0] For example, in a simple self-dot-product benchmark, LLVM's 256 bit code is actually faster than its 512 bit code when testing random sizes from 1-256: https://github.com/JuliaSIMD/LoopVectorization.jl/issues/446... However, LoopVectorization.jl's 512 bit code is close to twice as fast as either LLVM's 256 bit or 512 bit code. This is a trivial example; the difference can be much larger for more complicated code.
  
  janwas a year ago
  
  > Maybe we'll even see quadruple pumping for AVX-512 some day? > can allow AVX-512 to sort of act like SVE, but with the benefits of a fixed size architecture (i.e., shuffles); compile one set of code and you're able to run it anywhere, with performance dictated by how much the vendor decided was worth investing into the vector units
  That makes a lot of sense. It's basically the equivalent of RISC-V's LMUL=4, with the big advantage of reducing instruction count as you say. That seems a better route than 4x128, which might actually be less in practice if there are resource conflicts.
  > I'm also trying to do my part as a software dev! I wrote/maintain the JuliaSIMD ecosystem
  That's awesome, congrats on the good result. Looks like your preference is to allow people to write high-level code without much worry about the arch details. Any thoughts on how we can spread awareness of the basics such as data-oriented programming (avoiding branches, optimizing for cache and contiguous memory accesses)?
phonon a year ago

Knight’s Landing AVX-512 units run at about 1.5 GHz at peak (even when only a few are in use.)
The ones in Sapphire Rapids run 2x+ faster. Likely there is more pipelining and many more transistors used to reach those clock speeds.
- janwas a year ago
  
  Physical design is outside my area of expertise, but I understand that a large part of the increase may be due to improved fab. 15-20% speed improvement per node would account for the majority of the clock speed gain. Also, with smaller feature sizes comes the ability to trade away some of the density for further increases in the frequency, without actually requiring more transistors.
rbanffy a year ago

> "AVX-512 is a huge fraction of modern cores", and "give us more cores instead". Let's instead use the hardware we have
If what you do can use a wide SIMD pipeline, then, by all means, get AVX-512.
If what you want is to have lots of branchy processes running serving different things, then more cores is a better idea.
I'd suggest the best of both worlds - asymmetric cores. Put a couple that have wide SIMD pipelines, alongside a couple others that don't, but may be smaller and more numerous.
AVX-512 tasks that are scheduled to non-AVX-512 trap and are scheduled to AVX-512 cores. Tasks that don't trap continue running wherever the kernel thinks is best.
ksec a year ago

Yes. That was part of the plan for Intel. Remember that Intel was suppose to have 7nm by 2019, which is roughly equal to TSMC N5.
moonchild a year ago

> give us more cores instead
That should be a pretty obvious conclusion--higher-clocked cores is better than wider cores, and wider cores is better than more cores.
Communication cost between concurrency domains for superscalar/supervector: min <1ns
Communication cost between concurrency domains for multiprocessors: min 10-100ns
- janwas a year ago
  
  :) Obvious or not, one does still occasionally hear "give us more cores instead".
  I agree about the communication cost. Higher clocks are harder - quadratic increase in power. For shared-nothing problems, an array of wimpy (low frequency) cores is pretty good. It seems to me that wider, lower-frequency cores are a good compromise: beefy enough (thanks to vectors) to reduce the number of cores required, while still power-efficient due to both vectors and lower clocks.

rektide a year ago

Finally! One of the excellent excellent Chips & Cheese coverages of a really weird chip! Nice!

> Intel spends nearly 40% of the core’s die area to implement wide AVX-512 execution, so Knight’s Landing gets some incredibly high throughput for a low power architecture. In fact, it almost feels like a small OoO core built around feeding giant vector units

Well said!

There's some interesting details here. I love how the article tests SMT1 (aka no SMT), SMT2, and SMT4, and shows how things change. The re-order resources getting subdivided as SMT scales up is obvious but a neat hack, a real neat hack, for repartitioning core resources- on a core (Atom) that wasnt doing reordering much at all, at that point.

Also a good reminder that this almost 10 year old chip was using DDR4. I'd forgotten that DDR4 has been around for so so long!

The chip has some really monstrous capabilities. Modern flagship consumer GPUs are hitting the 1Tbps mark in memory bandwidth. AMD's upcoming Navi 3 has 96MB of L3 cache good for 5.3TB/s. Meanwhile, here's a chip that's 10 years older that has 16GB of onboard DRAM that C&C gets up to 4.5TB/s. Damn ya'll. And with huge AVX-512x2 per core, as mentioned in the top quote... this thing was such a wonderful & fascinating beast.

I didn't realize the previous incarnation- which I bought used/hella-cheap a long time ago & wanted to toy with but never did- was P54C based, aka an original-ish Pentium. That's wild.

This article is just so good. On and on, with every little characteristic and quirk. This is like a perfect "Speaking for the Dead", out of Enders Game. Alas that the world just wasn't awesome enough to put this shit to use at real volume. It's like an ultra-flexible on-the-fly reconfiguring DSP?

klauspost a year ago

Reading it, it seems very similar to the Playstation 3 Cell, with its fate mirroring it very much.
A highly specialized processor that has very high computational throughput for specialized operations, but a quite limited scalar unit.
In both cases you really have to write software specifically for it to get it to perform with any reasonable speed. If you do that, you get great value, but any existing code will need significant modifications to perform.
Granted AVX512 is (now) more common than SPE code ever became. It is slightly better than the Itanium approach, but scalar performance (especially single threaded) will have limited the value you get from this CPU, unless you write your own software and it can utilize AVX-512.
- rbanffy a year ago
  
  It was much easier to program though. The Cell had two different ISAs, one for the PPUs and one for the SPUs. The SPUs also didn't have direct access to memory and the PPUs had to manage task assignment and completion, as well as setting up DMA transfers between SPU memory and main memory.
  The Phi was closer to the Sun Niagara family - lots and lots of simple, slow, cores, with the note that, in the case of the Phi, the weakling x86's had mighty SIMD abilities while the Niagara had more or less standard SPARC stuff.
  Neither will have amazing performance unless you have at least as many threads running as you have cores, and most of the time, at least twice as many. For single-threaded code, they were on the slow side.
  Still, I always suggested people use Phis to develop because they'd get a taste of future computers. Nowadays a decent laptop will have half a dozen cores and, unless your mail client has that many threads, it'll not feel as fast as it could be.
- steve76 a year ago

Moissanite a year ago

I have a book ("Intel Xeon Phi Processor High Performance Programming") which has sat un-loved on my shelf since 2018 - alas I never got chance to play with one of these things. The last batch was put to good use by an oil and gas company though: https://www.hpcwire.com/2019/03/13/oil-and-gas-supercloud-ta...

rbanffy a year ago

It's a shame mere mortals never got to see the Knights Mill family. Since KNL they could be the main CPU of your workstation and Supermicro even had motherboards that could use it. And Knights Mill added virtualization, which made them a lot more useful for a developer.
Unfortunately, I never saw used ones on eBay.

shoo a year ago

Tom Forsyth gave a talk at handmade seattle 2019 titled "SMACNI to AVX512 the life cycle of an instruction set"

https://guide.handmade-seattle.com/c/2019/talks/lifecycle-of...

mk_stjames a year ago

The Xeon Phi coprocessor pci-e cards with similar CPU can be had on ebay for like $60 these days. Sadly I haven't seen any good writeups on setting up an environment to run such cards in a homebuilt system. I think it would be neat to build a little mini-supercomputer with 8 of them stacked in one of those mining motherboards with 8 pci-e slots. But I'd have no idea where to start with the toolchain needed to compile an run code across them.

cpgxiii a year ago

The only Xeon Phi products to be available in PCIe form are the first-gen KNC cards. These are not Atom-based, they do not share the x86_64 ABI, and the only effective toolchain for them is older versions of the proprietary Intel compiler.
While they were quirky at the time (and I got some neat simulations out of them), they were a massive pain in the ass in every other way:
- Requires a motherboard that supports large PCIe BAR addressing, much more common now, not at release time (and good luck getting this working with multiple Xeon Phi coprocessors on a non-server motherboard) - You likely need enough host memory to cover each of the cards (so 8GB * number of cards) - You'll need to patch the Intel MPSS kernel driver to work with newer kernels - You'll need to patch the Intel MPSS tools to work on anything that isn't ~2015 RHEL - You'll want to buy the water cooling kit that alphacool made for them, it's the only way to keep temperatures and noise to a reasonable level
Solve all of those, and you get a (fairly cheap) coprocessor reasonably good at parallel yet branchy tasks. It's much easier to use the second generation of KNL (or third-generation KNM) which share the standard X86_64 ABI and can boot a modern Linux distro and you can use a normal toolchain).
- rbanffy a year ago
  
  > or third-generation KNM
  Never found Knights Mill available anywhere. My understanding is that they were only available to integrators in the HPC space.
thechao a year ago

The original cards couldn't be thermally throttle, safely. Whenever the union guys came to our floor they were required to wear PPE (helmet, goggles, earplugs, earmuffs, and a vest). It was as loud as a high powered low quality vacuum cleaner. Staying within 5' of the towers, when running, without ear protection was an OSHA violation.
Y_Y a year ago

I remember playing with them at the local supercomputer facility. I thought it was so funny that you were expected to ssh into a pcie device.

chazeon a year ago

My colleague are still using KNL cpus on Stampede2 to run molecular dynamics simulations. They are just “cheap” in terms of job accounting, because there unloved because they are old and cannot catch up with state of art AMD cpus today. Butter they have a shorter queue for it also.

But it is very interesting to see that these CPUs’ optimizations for high throughput by adopting near-die MCDRAM. This strategy is very similar to what we see in Apple’s M1 series chips today.

muziq a year ago

Great read, and always loved these processors.. My first introduction to AVX-512 and the beginnings of a love affair :)

convolvatron a year ago

didn't this whole processor line get cancelled?

wmf a year ago

Yes, they were canceled several years ago.

bell-cot a year ago

Not denying that there are use cases...but my first reaction to "Atom with AVX-512" is "Fiat 500 with a 5th Wheel".

https://en.wikipedia.org/wiki/Fiat_500_(2007)

https://en.wikipedia.org/wiki/Fifth-wheel_coupling

sorenjan a year ago

I wonder how well this would run a raytracer, aren't those pretty branchy?

berkut a year ago

Fairly well, Intel's Embree raytracing kernels have support for Xeon Phi, although going more than 8-wide with incoherent rays (i.e. pathtracing) starts to require more and more sorting to batch things up for decent utilisation...
- sorenjan a year ago
  
  Would you mind giving an easy explanation of the difference between raytracing and pathtracing? I find it a bit confusing.
  
  namibj a year ago
  
  Ray tracing is typically incapable of diffuse reflections. It goes from the camera and traverses geometry until hitting something, possibly experiencing a few mirror/shiny reflections.
  Path tracing is usually bi-directional (from both light and camera) and uses something like Metropolis-Hastimgs to sample more efficiently than brute force. Our still requires many samples per pixel but the sampling nature means it can naturally support the random bouncing of diffuse reflections. Between bounces the is typically still ray tracing, though, unless it's some special volumetric material like Jade (the green jewel rock) or fog or (the contents of) a glass of milk.
  Compare e.g. POV-Ray to LuxRenderer.

rektide a year ago

Some comments on the submission from 16 hours back: https://news.ycombinator.com/item?id=33916480

rbanffy a year ago

Can they get merged?
- dang a year ago
  
  I've merged the threads into the current one. It's later, but it's still on the front page.