Though in reality nobody is stopping you from using those undocumented instructions. If Apple’s Accelerate framework can use it so do you. (Or is it against the EULA?…) There is a slim chance that the behavior of these instructions might change with a software (if Apple is doing any kind of microcode update), but I kinda doubt it.
You’re probably thinking of AVX, Advanced Vector Extensions. That’s a family of vector extensions that has been out for just over a decade (2011). This is about AMX, Advanced Matrix Extensions. AMX is, as the name implies, built around 2D INT8/BF16 matrices (not “1D” FP16/32/64 vectors). It’s not on silicon available to the public.
The main obvious thing is that Intel simply doesn’t have fp32 or fp64 support. If you primarily care about ML inference, that mostly doesn’t matter to you; if you care about other types of dense computation as well, it matters a lot.
Which (assuming 1 multiply-add = 2 operations) is the same int8 operations/cycle as the Apple M1's float16 operations/cycle. Intel's 16-bit operations might be the same rate, but I'd guess half? That'll almost certainly be at a higher clock-speed, and one-per-core rather than one-per-four-P-cores. (And I think Apple might have doubled their throughput in M2. As you said, performance comparison is hard.)
Apple provides the Accelerate framework [0] to access these instructions through high-level APIs. It is available in user-mode, and I doubt it switches to kernel mode although I didn't check.
The elephant in the room contrast: Apple has been shipping this for years, whereas Intel _might_ ship theirs in something later this year.
> Note that these instructions are neither documented nor supported by Apple.
What does that mean? Someone pointed out below Apple ships accelerate framework as a higher level supported mechanism to use these instructions?
Intel/AMD have been always good at documenting most of their stuff so perhaps we will see proper supported ones whenever Intel ships it.
I think they mean that using the raw instructions is unsupported and undocumented, officially you have to use the libraries as the interface to them.
Though in reality nobody is stopping you from using those undocumented instructions. If Apple’s Accelerate framework can use it so do you. (Or is it against the EULA?…) There is a slim chance that the behavior of these instructions might change with a software (if Apple is doing any kind of microcode update), but I kinda doubt it.
They've had something similar off-core with GNA since 2019 10th gen Ice Lake for low-power always-on inference use-cases.
Hasn't Intel been shipping scalar instructions for more than a decade?
You’re probably thinking of AVX, Advanced Vector Extensions. That’s a family of vector extensions that has been out for just over a decade (2011). This is about AMX, Advanced Matrix Extensions. AMX is, as the name implies, built around 2D INT8/BF16 matrices (not “1D” FP16/32/64 vectors). It’s not on silicon available to the public.
Also not to be be confused with VMX (AltiVec/Velocity Engine), which predates AVX by another decade or so.
To keep the record straight, we have: 1. AMX, 2. AVX, and 3. VMX.
There’s also Intel VMX, the Virtual Machine Extensions ;) Sometimes it’s referred to as VT-x.
Intel's AMX specifically hasn't yet shipped, but will be coming with their new Sapphire Rapids Xeon chips this year.
*Next year (potentially).
Sapphire rapids might well end up only shipping in Aurora and then being replaced immediately by it's successor.
The article would have been much better for me if it drew conclusions about the usefulness of the two (partial) instruction sets.
What can you do easily and what’s hard?
I also expected to read something about relative performance of the two.
The main obvious thing is that Intel simply doesn’t have fp32 or fp64 support. If you primarily care about ML inference, that mostly doesn’t matter to you; if you care about other types of dense computation as well, it matters a lot.
Performance comparison is hard given that the Intel one hasn’t shipped yet.
Yeah, I think Intel's only number so far is "2048 int8 operations/cycle/core" (as opposed to VNNI's 256): https://www.servethehome.com/wp-content/uploads/2021/09/Inte...
Which (assuming 1 multiply-add = 2 operations) is the same int8 operations/cycle as the Apple M1's float16 operations/cycle. Intel's 16-bit operations might be the same rate, but I'd guess half? That'll almost certainly be at a higher clock-speed, and one-per-core rather than one-per-four-P-cores. (And I think Apple might have doubled their throughput in M2. As you said, performance comparison is hard.)
Seems to be related to https://news.ycombinator.com/item?id=32722510
Is the Apple AMX genlut op approximately the same as =HLOOKUP()?
Can AMX be used in User mode?
Apple provides the Accelerate framework [0] to access these instructions through high-level APIs. It is available in user-mode, and I doubt it switches to kernel mode although I didn't check.
[0]: https://developer.apple.com/documentation/accelerate