That pattern of doing something for the tail (and possibly also the head, if you're on a platform where aligned loads are preferred or required) of your data is very common in SIMD linear algebra. It's so common that my default way of writing those things is:
1. Write a function implementing the happy-path SIMD
2. (eventually) Write a function implementing the cleanup code
3. (eventually) Implement the final result by appropriately calling into (1) and (2)
That gives you a few benefits. Notably, every problem in TFA (other than the inlining limit) magically goes away. Also, in the vast majority of problems, your cleanup code is itself a valid solution to the problem and also easy to understand, so for free you have a correct implementation to fuzz your fast implementation against (a testing pattern I highly recommend with all optimized code).
Most importantly though, 99 times out of 100 I actually don't need (2) or (3). You can push the padding and alignment constraints all the way to the top level of your program, and the rest of your code can work on slices of vectors instead of slices of floats (or whatever other safe representation you want to choose). The resulting code is much easier to maintain, and the compiled binary is usually smaller and faster.
This is actually a great way to write code in general. Write the happy function. Then write various wrappers to handle all the ugly checks and transforms to get it on the happy path.
So (2) in this case would be the scalar version of the function that takes a starting pointer, row stride, channel stride, and a row_length and channel_length. By setting the strides to 1 and lengths to max you effectively tell that function the whole matrix is leftovers and it just does the whole computation. Correct?
That's not the only option, but it's very reasonable and almost certainly the thing I'd start with, maybe the thing I'd finish with. You then pass an appropriate sub-matrix to this thing which does a whole-matrix computation, and that's the entire "add another loop" trick from TFA.
In c++, I would write a single template for both paths, and make the compiler generate two functions for each case, so the ifs and checks are automatically removed. Then I would call each function with a parameter like "bool boundary". It's more maintainable
> and adding a single line of code can completely change the way your program is compiled without you even noticing
You cannot work on SIMD micro-optimizations without assessing the final compiler codegen. Compilers will easily cancel your optimizations effort once you move your code from an isolated test-case to an actual binary that is being released.
Otherwise profilers don't really help much if you don't know what are you looking for. And to know what to look for you really need to understand how the CPU internals work. There's no way around it. May you switch to another CPU and you will find that your optimization scales differently.
The phrase used by the author "because the CPU can't predict more than one branch per cycle" is misleading.
Most existing CPUs can execute at least 2 conditional jumps a.k.a. branches per clock cycle, if they are predicted as not taken.
Such CPUs can execute only one conditional jump per clock cycle when it is predicted as taken, which also means that in the same clock cycle one can speculatively execute only instructions that are beyond a single conditional jump that is predicted as taken.
So in most CPUs at least 2 conditional jumps can be predicted per clock cycle, but whether they can also be executed depends on the result of the prediction.
As mentioned in an endnote of the article, in the latest generation of CPUs from multiple companies, including AMD Zen 5, the ability to execute in a single clock cycle 2 branches predicted as taken has appeared (and the ability to speculatively execute instructions past the second taken branch).
Because of this difference between taken and not taken branches, it is always good for a compiler or for a human programmer to generate code such that all the conditional jumps (except those that terminate loops) are statically predicted as not taken. Both "if ... then ..." and "if ... then ... else ..." conditional statements have alternative implementations that can ensure that all conditional jumps are statically predicted as not taken. For "switch" a.k.a. "select" a.k.a. "case" instructions it is always possible to reorder the tested cases such that all the conditional jumps are statically predicted as not taken.
Unfortunately, in most programming languages it is not possible to provide a hint for the compiler on whether a tested condition is expected to be true or false. Such a hint could be easily provided in a programming language that has both "if" and "unless" conditional statements or expressions, where "if" could mean that the conditional statement is expected to be not executed and "unless" could mean that the conditional statement is expected to be executed (for the "else" or "otherwise" branch the expectation would be opposite).
In general, 40 samples would certainly not be enough for the experiment result to have statistical significance but perhaps in this case when the code being tested is very tight, e.g. just several tens of instructions, it may be enough. Code in this example is testing the CPU compute capabilities and the only source of variance is the CPU internals itself.
> This blog post is mostly written from memory, since I didn't keep around every version of the code being discussed.
How do you track these optimization attempts, given you don't commit many dead-end changes? CS major often doesn't teach how to experiment, which is a bummer as there supposed to be established ways to do it among other scientific disciplines. Maybe you can use something like Jupyter notebook? But what if the code is in, say, Rust and lives outside the Jupyter kernel.
the assembly can mislead you, i have seen complex, long assembly that even does memory reads perform 2.5x the performance of “simple” assembly that would have been 4-5 lines.
Benchmarking is the only thing you can really do in some cases.
> Benchmarking is the only thing you can really do in some cases.
I think the complete opposite! Benchmarking is very difficult to get right and in some situations you couldn't really make a benchmark to test what you're interested in improving. Reading assembly can be objective if you know or look up the instruction latency/throughput of each instruction (or you have a tool provide it) and you can also use loop throughput analyzers (as the other commenter mentioned) that will try to predict the typical throughput for a given loop.
IMO if you get used to looking at assembly it becomes obvious in the majority of cases whether there's performance left on the table.
And "simplicity" does not equal "faster". Although for cold code, reducing its impact on i$ and d$ of the rest of the system is probably smart, so sometimes speed is not the only factor.
If only metric you look for is number of instructions then yes and this is a wrong way of looking at this type of problems. One can write a 50 LoC heavily optimized SIMD code that outperforms the compiler generated 10 LoC by 5x.
This is an edge case. Disassembly may not tell the whole story but it is the definitive answer to what the compiler thinks about your code. Given enough knowledge it is not difficult to account for cache/memory accesses, see where store-to-load forwarding may or may not kick in, consider optimal independent operations scheduling, check if the code is too branch heavy or not, etc etc.
And once you're done, you can do another round with tools which can do more detailed analysis like https://uica.uops.info
I do this when writing performance-sensitive C# all the time (it's very easy to quickly get compiler disassembly) and it saves me a lot of time.
I'm not too sure what the author thinks they did wrong! Perhaps implementing the first attempt before profiling, but... would it have made a difference in this scenario?
not using uprof (or vtune for intel users) well seems like a misstep; it can give you the rate of branch prediction success, though that is arguably more a lack of knowledge/experience than a mistake
they're cool tools, I encourage anyone to mess with them
branch prediction rates wouldn't have helped here. the problem here is that even though the branch is predicted correctly ~99% of the time, the branch drops you to ipc of 1 because you can't speculate past a 2nd branch
That pattern of doing something for the tail (and possibly also the head, if you're on a platform where aligned loads are preferred or required) of your data is very common in SIMD linear algebra. It's so common that my default way of writing those things is:
1. Write a function implementing the happy-path SIMD
2. (eventually) Write a function implementing the cleanup code
3. (eventually) Implement the final result by appropriately calling into (1) and (2)
That gives you a few benefits. Notably, every problem in TFA (other than the inlining limit) magically goes away. Also, in the vast majority of problems, your cleanup code is itself a valid solution to the problem and also easy to understand, so for free you have a correct implementation to fuzz your fast implementation against (a testing pattern I highly recommend with all optimized code).
Most importantly though, 99 times out of 100 I actually don't need (2) or (3). You can push the padding and alignment constraints all the way to the top level of your program, and the rest of your code can work on slices of vectors instead of slices of floats (or whatever other safe representation you want to choose). The resulting code is much easier to maintain, and the compiled binary is usually smaller and faster.
This is actually a great way to write code in general. Write the happy function. Then write various wrappers to handle all the ugly checks and transforms to get it on the happy path.
So (2) in this case would be the scalar version of the function that takes a starting pointer, row stride, channel stride, and a row_length and channel_length. By setting the strides to 1 and lengths to max you effectively tell that function the whole matrix is leftovers and it just does the whole computation. Correct?
That's not the only option, but it's very reasonable and almost certainly the thing I'd start with, maybe the thing I'd finish with. You then pass an appropriate sub-matrix to this thing which does a whole-matrix computation, and that's the entire "add another loop" trick from TFA.
Yeah, got it. Thanks for the explanation
In c++, I would write a single template for both paths, and make the compiler generate two functions for each case, so the ifs and checks are automatically removed. Then I would call each function with a parameter like "bool boundary". It's more maintainable
> and adding a single line of code can completely change the way your program is compiled without you even noticing
You cannot work on SIMD micro-optimizations without assessing the final compiler codegen. Compilers will easily cancel your optimizations effort once you move your code from an isolated test-case to an actual binary that is being released.
Otherwise profilers don't really help much if you don't know what are you looking for. And to know what to look for you really need to understand how the CPU internals work. There's no way around it. May you switch to another CPU and you will find that your optimization scales differently.
The phrase used by the author "because the CPU can't predict more than one branch per cycle" is misleading.
Most existing CPUs can execute at least 2 conditional jumps a.k.a. branches per clock cycle, if they are predicted as not taken.
Such CPUs can execute only one conditional jump per clock cycle when it is predicted as taken, which also means that in the same clock cycle one can speculatively execute only instructions that are beyond a single conditional jump that is predicted as taken.
So in most CPUs at least 2 conditional jumps can be predicted per clock cycle, but whether they can also be executed depends on the result of the prediction.
As mentioned in an endnote of the article, in the latest generation of CPUs from multiple companies, including AMD Zen 5, the ability to execute in a single clock cycle 2 branches predicted as taken has appeared (and the ability to speculatively execute instructions past the second taken branch).
Because of this difference between taken and not taken branches, it is always good for a compiler or for a human programmer to generate code such that all the conditional jumps (except those that terminate loops) are statically predicted as not taken. Both "if ... then ..." and "if ... then ... else ..." conditional statements have alternative implementations that can ensure that all conditional jumps are statically predicted as not taken. For "switch" a.k.a. "select" a.k.a. "case" instructions it is always possible to reorder the tested cases such that all the conditional jumps are statically predicted as not taken.
Unfortunately, in most programming languages it is not possible to provide a hint for the compiler on whether a tested condition is expected to be true or false. Such a hint could be easily provided in a programming language that has both "if" and "unless" conditional statements or expressions, where "if" could mean that the conditional statement is expected to be not executed and "unless" could mean that the conditional statement is expected to be executed (for the "else" or "otherwise" branch the expectation would be opposite).
Does a variance 5 microseconds with a min of 38 milliseconds and a max of 47 milliseconds with 40 samples seem like a valid statistic?
In general, 40 samples would certainly not be enough for the experiment result to have statistical significance but perhaps in this case when the code being tested is very tight, e.g. just several tens of instructions, it may be enough. Code in this example is testing the CPU compute capabilities and the only source of variance is the CPU internals itself.
> This blog post is mostly written from memory, since I didn't keep around every version of the code being discussed.
How do you track these optimization attempts, given you don't commit many dead-end changes? CS major often doesn't teach how to experiment, which is a bummer as there supposed to be established ways to do it among other scientific disciplines. Maybe you can use something like Jupyter notebook? But what if the code is in, say, Rust and lives outside the Jupyter kernel.
You integrate the benchmarking process into the commit action (or vice versa) and simply commit everything that successfully completes benchmark.
Might be an hour of additional effort at the beginning of the project, but will prove to be insightful at the end.
How about changes that doesn't make number better? Just create a dead-end branch?
I will just plug https://www.computerenhance.com/ which goes into great depth on all of this stuff. I highly recommend it, I learned a lot.
In other words, READ THE ASSEMBLY!
the assembly can mislead you, i have seen complex, long assembly that even does memory reads perform 2.5x the performance of “simple” assembly that would have been 4-5 lines.
Benchmarking is the only thing you can really do in some cases.
> Benchmarking is the only thing you can really do in some cases.
I think the complete opposite! Benchmarking is very difficult to get right and in some situations you couldn't really make a benchmark to test what you're interested in improving. Reading assembly can be objective if you know or look up the instruction latency/throughput of each instruction (or you have a tool provide it) and you can also use loop throughput analyzers (as the other commenter mentioned) that will try to predict the typical throughput for a given loop.
IMO if you get used to looking at assembly it becomes obvious in the majority of cases whether there's performance left on the table.
And "simplicity" does not equal "faster". Although for cold code, reducing its impact on i$ and d$ of the rest of the system is probably smart, so sometimes speed is not the only factor.
> the assembly can mislead you
If only metric you look for is number of instructions then yes and this is a wrong way of looking at this type of problems. One can write a 50 LoC heavily optimized SIMD code that outperforms the compiler generated 10 LoC by 5x.
This is an edge case. Disassembly may not tell the whole story but it is the definitive answer to what the compiler thinks about your code. Given enough knowledge it is not difficult to account for cache/memory accesses, see where store-to-load forwarding may or may not kick in, consider optimal independent operations scheduling, check if the code is too branch heavy or not, etc etc.
And once you're done, you can do another round with tools which can do more detailed analysis like https://uica.uops.info
I do this when writing performance-sensitive C# all the time (it's very easy to quickly get compiler disassembly) and it saves me a lot of time.
I'm not too sure what the author thinks they did wrong! Perhaps implementing the first attempt before profiling, but... would it have made a difference in this scenario?
not using uprof (or vtune for intel users) well seems like a misstep; it can give you the rate of branch prediction success, though that is arguably more a lack of knowledge/experience than a mistake
they're cool tools, I encourage anyone to mess with them
branch prediction rates wouldn't have helped here. the problem here is that even though the branch is predicted correctly ~99% of the time, the branch drops you to ipc of 1 because you can't speculate past a 2nd branch
I think that would be reported as some kind of stall? I can't say I'm an expert either
I think vtune and toplev will report that as a frontend stall in their topdown metrics.