Optimizing code on MMU-less processor versus MMU and even NUMA capable processor is vastly different.
The fact that the author achieves only a 3 to 6 times speedup on a processor running at a frequency 857 faster should have led to the conclusion that old optimizations tricks are awfully slow on modern architecture.
To be fair, execution pipeline optimization still works the same, but not taking into account the different layers of cache, the way the memory management works and even how and when actual RAM is queried will only lead to suboptimal code.
Seems like, You've got it backwards — and that makes it so much worse. ^_^
I ported from ABAP to Z80. Modern enterprise SAP system → 1976 processor.
The Z80 version is almost as fast as the "enterprise-grade" ABAP original. On my 7MHz ZX Spectrum clone, it's neck-and-neck. On the Agon Light 2, it'll probably win.
Think about that: 45-year-old hardware competing with modern SAP infrastructure on computational tasks.
This isn't "old tricks don't work on new hardware." This is "new software is so bloated that Paleolithic hardware can keep up." (but even this is nonsense - ABAP is not designed for this task =)
That Z80 code is not the equivalent of the modern code though, is it?
for example your modern code mentions 64KB lookup table.. no way you can port this to Z80 which has 64KB of address space total, shared for input, output, cache and code.
So what do those timings mean? Are those just a made up numbers for the sake of narrative?
Oh, that makes a lot more sense! I was puzzled as to how the new hardware could be so slow, but an inefficient interpreter easily explains it. I've seen over 1000× slowdowns from assembly to bash, so it sounds like ABAP is close to bash.
It's interesting how this stuff works. I started learning "tech" as a millenial in the late 90s when the internet and web was relatively new.
As such I picked up an awful lot of networking/windows/linux "fundamentals" simply because you had to know it to fix anything (truing - and failing - to get my crappy 28k winmodem working on Linux probably taught me many "months" worth of fundamentals alone!).
The other thing is when you are younger learning this stuff, most of us had pretty hard financial constraints on what we were doing. I couldn't persuade my parents to replace that winmodem with a proper modem (which I would do without thinking now), so you really had to make do with what you had.
One of the best lessons I learned was during a hard financial constraint.
Roughly around 1994 I had a new compute- a 486/66MHz with 4MB of RAM. I got LINUX and installed it, and was able to run X windows, g++, emacs, and xterm- but if I compiled while emacs was running, the system would page like crazy (especially obvious in those days when harddrives were very noisy).
I had to work really hard to convince myself to pay the $200 (as an undergraduate, I had many other things I would have preferred to spend money on) to double the ram to 8MB, and then another $200 to 16MB a year later, and finally a last $200 to max out the RAM at 32MB.
Once the system had 32MB of RAM, it performed quite well, with minimal paging, and it greatly increased my productivity. I learned that while RAM can be expensive, making sure your processor is not waiting for disk is worth it.
I probably also spent $1,000s of dollars on modem upgrades (1200->2400, 2400->9600, 9600->19200, 19200->48000, 48000->56K and then switching to DSL and later fiber). Each time was "worth it" but it was expensive and so I really thought hard abotu the upgrade and the value it brought me (a high level of job opportunities in areas I find interesting).
>... if I compiled while emacs was running, the system would page like crazy
The good old days when "eight megs and constantly swapping" was a real issue. I kind of miss them. (But not the modem speeds. Don't miss those at all.)
Imagine an era where a software-defined whatever was deemed "improper". Today we have software-defined radios that are clearly superior; you had one and wanted it in hardware.
From the article: Lookup tables are always faster than calculation - is that true? I'd think that while in the distant past maybe today due to memory being much slower than CPU the picture is different nowadays. If you're calculating a very expensive function over a small domain so the lookup fits in L1 Cache then I can see it would be faster, but you can do a lot of calculating in the time needed for a single main memory access.
3. https://www.intel.com/content/www/us/en/developer/articles/t... - shows you whether the above is matching the reality (besides the CPU alone, more often than not your bottleneck is actually memory accesses; at least on the first access which wasn’t triggered by a hardware prefetcher or a hint to it. On Linux it would be staring at “perf top” results.
So, the answer is as is very often - “it depends”.
I agree on "it depends". And usually not only on your actual code and data, but also how you arrange it over cache lines, what other code on the same core/complex/system is doing to your view of the cache and some other internal CPU features like prefetchers or branch predictors.
...and we always circle back to "premature optimization is the root of all evil", since processors are a wee bit more intelligent with our instructions than we thought. :)
Not that intelligent. If you have two loads and one store per cycle, then that’s it. (Recall that we have SSDs with 14 GB/s sequential reads now, yet CPU clocks are below 6 GHz.) Most of the computational power of a high-performance CPU is in the vector part; still the CPU won’t try to exploit it if you don’t, and the compiler will try but outside of the simplest cases won’t succeed. (Most of the computational power of a high-performance computer is in the GPU, but I haven’t gotten that deep yet.)
I don’t mean to say that inefficient solutions are unacceptable; they have their place. I do mean to say that, for example, for software running on end-users’ computers (including phones), the programmer is hardly entitled to judge the efficiency on the scale of the entire machine—the entire machine does not belong to them.
> We should forget about small inefficiences, say 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
D. E. Knuth (1974), “Structured Programming with go to Statements”, ACM Comput. Surv. 6(4).
You are right, but with a good optimizing compiler and out of order execution, your code will not work the way you guess most of the time, even though it accomplishes what you want.
On the other hand, while doing high performance compute, the processor will try to act smart to keep everything saturated. As a result, you still need to look at cache trash ratio, IPC, retirement ratio, etc. to see whether you are using the system at its peak performance, and again CPU is doing its thing to keep the numbers high, but that's not enough of course. You have to do your own part and write good code.
In these cases where you share the machine (which can be a cluster node or a mobile phone), maximizing this performance is again beneficial since it allows smoother operation both for your and other users' code in general. Trying to saturate the system with your process is a completely different thing, but you don't have to do that to have nice and performant code.
GPU computation is nice, and you can do big things fast, but it's not suitable for optimizing and offloading every kind of task, and even if though the task is suitable for the GPU, the scale of the computation still matters, because a competent programmer can fit billions of computations until a GPU starts running your kernel. The overhead is just too big.
Knuth's full quote doesn't actually invalidate me, because that's how I operate while writing code, designed for high performance or not.
Wait until you learn the Tomasulo algorithm is not a magic wizard box for fast code, but that you can also write code that meshes well with it and get even more performance. Your CPU can run up to 5ish instructions every cycle - are you giving it instructions in suitably matched groups to maximize parallelism?
Premature optimization may be the root of all evil, but timely optimization can give you insane benchmark numbers, at least. Not enough people pay attention to the word "premature" and they think it's about all optimization. Sadly, I've never worked on anything where it would have been timely. Haven't even used SIMD. Just watched other people's talks about it.
Some general principles are widely applicable: modern processors love to work on big linear SIMD-aligned arrays of data. Hence you have the structure-of-arrays pattern and so on. While a CPU is still the best tool you have for complex branching, you should avoid complex branching as much as possible. In Z80-era architectures it was the reverse - following a pointer or a branch was free, which is why they invented things like linked lists and virtual functions.
I have written a boundary element method based material simulation code in my Ph.D., and had time and opportunity to apply both good design practices and some informed coding such as minimum and biased branching (so the code runs unimpeded for most of the time), lockless algorithms and atomics in parallel code where applicable. Moreover I looked for hotspots with callgrind, and checked for leaks with valgrind.
As a result, I was handling two 3000x3000 (dense and double precision floating point) matrices and some vectors under ~250MB RAM consumption, and getting in an out of whole integration routine under a minute. Whole evaluation took a bit more than a minute in last decade's hardware.
Perf numbers returned great. ~5IPC, 20% cache trash, completely and efficiently saturated FPU and memory bandwidth.
Result? The PoC was running in 30 minutes, my code was run and done under 2 minutes for most problems. The throughput was 1.7 million complete Gaussian Integrations per second, per core. We have an adaptive approach which takes many partial integrations for a one complete integration [0], so it amounted to a higher number of integrations (IIRC it was ~5 mil/sec/core, but my memory is hazy.).
The code I have written doesn't call for SIMD explicitly, but Eigen [1] most probably does for Matrix routines. Nevertheless, the core "secret sauce" was not the formulae but how it's implemented in a way which minds for the processor's taste. Adding "-march native -mtune native -O3 -G0" gave a 3x performance boost in some steps of the code.
Currently slowly reimplementing this in a tighter form, so let's see how it goes.
So, also considering what I do for a job, I know how nuanced Knuth's quote is, but people are people, and I'm not the one who likes to toot his horn 7/24/365.
I just wanted to share this to confirm, validate and support your claim only, and to continue conversation if you are interested more in this.
> Lookup tables are always faster than calculation - is that true?
Maybe on the Z80. Contemporary RAM was quite fast compared to it, by our sad standards.
A table lookup per byte will see you hit a pretty hard limit of about 1 cycle per byte on all x86 CPUs of the last decade. If you’re doing a state machine or a multistage table[1] where the next table index depends on both the next byte and the previous table value, you’ll be lucky to see half that. Outracing your SSD[2] you’re not, with this approach.
If instead you can load a 64-bit chunk (or several!) at a time, you’ll have quite a bit of leeway to do some computation to it before you’re losing to the lookup table, especially considering you’ve got fast shifts and even multiplies (another difference from the Z80). And if you’re doing 128- or 256-bit vectors, you’ve got even more compute budget—but you’re also going to spend a good portion of it just shuffling the right bytes into the right positions. Ultimately, though, one of your best tools there is going to be ... an instruction that does 16 resp. 32 lookups in a 16-entry table at a time[3].
So no, if you want to be fast on longer pieces of data, in-memory tables are not your friend. On smaller data, with just a couple of lookups, they could be[4]. In any case, you need to be thinking about your code’s performance in detail for these things to matter—I can’t think of a situation where “prefer a lookup table” is a useful heuristic. “Consider a lookup table” (then measure), maybe.
On the Z80 any memory access had a fixed cost of 3 clock cycles (in reality the memory system could inject wait cycles, but that was an esoteric case). Together with the instruction fetch of 4 clock cycles the fastest instruction to load an 8-bit value from an address that's already in a 16-bit register (like LD A,(HL)) takes 7 clock cycles.
The fastest instructions that didn't access memory (like adding two 8-bit registers) were 4 clock cycles, so there's really not much room to beat a memory access with computation.
Today "it depends", I still use lookup tables in some places in my home computer emulators, but only after benchmarking showed that the table is actually slightly faster.
Depends on the hardware and what you are making with that hardware. Some processors can do complicated things stupidly fast (e.g. when SIMD done right), and for some hardware platforms, a mundane operation can be very costly since they are designed for other things primarily.
My favorite story is an embedded processor which I forgot its ISA. The gist was, there was a time budget, and doing a normal memory access would consume 90% of that budget alone. The trick was to use the obscure DMA engine to pump data into the processor caches asynchronously. This way, moving data was only ~4% of the same budget, and they have beaten their performance targets by a large margin.
> Lookup tables are always faster than calculation - is that true
No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
However I doubt that is what is meant. For complex calculations there are a lot of it depends and tradeoffs. A lookup table will often force you to think about trade offs because the table takes up a lot more memory and so you need to decide what values are important. A lookup table is also prone to bugs - back in the 1990s someone noticed that the Intel Pentium processor didn't give the right results for division - turns out they didn't enter a few values into the table correctly - if you write a table you could have the same bug.
Calculating sin() to as many decimal places as your highest precision floating point register allows will be slow, but that is likely what the sin built into your standard library does since you might be building a bridge that the person who wrote that sin function crosses latter. If you only need sin rounded to the nearest whole number a lookup table is probably faster. If you need sin to as precise as the computer can calculate that is a lot of RAM (x86 uses 80 bits internally for floating point numbers)
> No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
Note that a round of AES is now one aesenc instruction on modern systems.
You might be surprised how much better code is than memory lookups. Modern AMD Zen5 cores have 8 instruction pipelines but only 3 load/store pipelines.
You have more AVX512 throughput on modern Zen5 cores (4x Vector pipelines) than L1 throughput.
I'd go as far out to say that table lookups are the worst they've ever been in terms of compute speed. The reason modern encryption/hashing got so fast is that XChaCha and SHA3 are add/for/rotate based rather than lookup-based (sbox based like AES or DES).
Tables are still appropriate for some operations, but really prefer calculations if at all possible. Doubly so if you are entering GPU code where you get another magnitude more compute without much memory bandwidth improvements.
Oh, if you need the best of both worlds, consider pshufb (4-bit lookup table), or if you have access to AVX512 you could use vpermi2b as an effective 7-bit lookup table.
It's not quite a full memory lookup table but these instructions get a lookup-like behavior but using the vector units (128-bit or 512-bit registers).
In this case, the lookup table is used for popcount, and there's a comment in the Z80 assembly that says "One lookup vs eight bit tests." If the code could make use of a hardware popcount instruction, the lookup table would lose, but if that isn't available, a 256-byte lookup table could be faster. So it's less "lookup tables are always faster" and more "lookup tables can be faster, this is one such case."
probably fastest popcount in z80 that does not use aligned table, would be shift A through flag, then conditional INC C, unrolled, still slower than ld l,a: ld b,(hl).
The article does mention cache friendly access patterns in the same context.
But yes, you're right. Back when I started with optimizations in the mid 90s memory _latencies_ were fairly minor compared to complex instructions so most things that wasn't additions (and multiplications on the Pentium) would be faster from a lookup table, over time memory latencies grew and grew as clock speeds and other improvements made the distance to the physical memory an actual factor and lookup tables less useful compared to recomputing things.
Still today there are things that are expensive enough that can be fit in a lookup table that is small enough that it doesn't get evicted from cache during computation, but they're few.
Basically if the thing you are doing can be precomputed. Then you can use a lookup table. Then at that point do you have the space for it? Is the time to get it out of memory less than the operation you are caching in the lookup table. Then it could be a good candidate for a lookup table. Many times that can be true. But not always. Even if that is all true you can end up hurting something else because your lookup table evicted something else important from the l1/l2 cache. So you also have to test it in context.
I'm a sometimes CPU architect and came here to argue just this - modern CPUs have far far slower memory access (in clocks) than z80 memory access. To be fair you can probably fit any z80 table you're using into modern L1 cache, but even so you're looking at multiple clocks rather than 1.
Two reasons: 1) English is not my native tongue, 2) I hate LinkedIn article style -> let LLM convert my hadrcore-oldschool style into something like "You won't believe what my grandmother's cat taught me about ..."
You can use LLMs to check grammar and style and manually go through each suggested correction; it will take more time, but your readers will appreciate the effort and your own voice and style will be there. People get upset by LLM-generated content not because they are against LLMs, but because of the awkward style that is very recognizable and leaves you wondering "If the author couldn't be bothered to write it, why should I bother to read it?"
The AI-written text is junk. It's not doing anybody any good. The reader of your article wants to know who you are and what you think. They can't judge this when your ideas are covered in a layer of AI-generated slime.
I would rather read a non-native speaker write something hardcore-oldschool than read LLM-generated "You won't believe what my grandmother's cat taught me about..." You may get corrections about your English. You may even get complaints (not everyone on HN is nice all the time). But my impression is that you will get fewer complaints than you will from something that "feels" LLM-generated. (Or maybe that's just my personal taste.)
It is considered socially acceptable to just straightforwardly complain about an LLM-written post, while complaints about somebody’s English are usually phrased as constructive feedback. The latter takes more effort and is not as effective as a launching point for a rant, so the former might just be easier to detect based on lines-of-text.
Is bugging me because 1. ix is not used in the function and 2. operations on ix are limited (and slow), can't be used as "accumulator". It is a 16-bit register to access memory as an index (and as extra two 8-bit limited registers if you use undocumented Z80 opcodes).
Besides, why use b register as counter in the loop and use dec and jr when there's djnz for that?
I haven't checked anything else, but that has a bad smell.
Turns out "I want a burger" beats "we have equivalent burger at home" every time—even when the home solution is objectively better.
So yes, this reads like "What my goldfish taught me about microservices." But unlike those posts, this story has no moral—just nerdy fun with enterprise software roasting.
Sometimes you gotta speak the language they actually read there +)).
Please don't forget to put a license in your repo. Based on the zvdb linked to in the readme I'm guessing you prefer MIT but explicit is always better than leaving the audience wondering
Saying you ported SAP makes no sense. You ported a program you wrote for the SAP platform.
Porting all of SAP would be equivalent to porting all known human cancers from the human genome to a different newly discovered alien genome, both in terms of scale as well as sheer evil.
It really makes me want to flag this submission because of how grossly misleading such a title is, despite it being correctly copied from the upstream source
My comment was a reaction to the fact that you called out your experience/wisdom on the topic no less than five distinct times in a short (otherwise great!) post.
I would argue that by doing so, you made it the unintended central theme of the post.
I'm not saying this to hurt your feelings. That you didn't perceive the obvious sarcasm in my late-night comment suggests that you might not want to be perceived as pompous in your writing style.
My suggestion is that one explicit "I'm an expert" reminder per post is a perfectly good number.
Yes, this reinforces that the article suffered from the LLM rewrite. Your comments here prove your natural voice is more compelling for this audience than the LLM "cat post" style.
Earlier, @oisee said 'I hate LinkedIn article style -> let LLM convert my hadrcore-oldschool style into something like "You won't believe what my grandmother's cat taught me about ..."'
I am (definitely) not telling you to act shy or not be proud of your accomplishments.
Ultimately, it's up to you if you want to redeclare your bona fides in almost every paragraph. It doesn't make for great reading, but it's your call either way.
If anything, I'm encouraging you to let your narrative and actions speak for your skills.
oh, I was trying to understand what are you talking about, where this impression is coming from, and then realised, that the "article" md on github was out of sync with other versions. tnx!
Optimizing code on MMU-less processor versus MMU and even NUMA capable processor is vastly different.
The fact that the author achieves only a 3 to 6 times speedup on a processor running at a frequency 857 faster should have led to the conclusion that old optimizations tricks are awfully slow on modern architecture.
To be fair, execution pipeline optimization still works the same, but not taking into account the different layers of cache, the way the memory management works and even how and when actual RAM is queried will only lead to suboptimal code.
Are we intentionally ignoring that ABAP is byte code interpreted?
Seems like, You've got it backwards — and that makes it so much worse. ^_^
I ported from ABAP to Z80. Modern enterprise SAP system → 1976 processor. The Z80 version is almost as fast as the "enterprise-grade" ABAP original. On my 7MHz ZX Spectrum clone, it's neck-and-neck. On the Agon Light 2, it'll probably win. Think about that: 45-year-old hardware competing with modern SAP infrastructure on computational tasks. This isn't "old tricks don't work on new hardware." This is "new software is so bloated that Paleolithic hardware can keep up." (but even this is nonsense - ABAP is not designed for this task =)
The story has no moral, it is just for fun.
But if you ported the ABAP to a static language it would be significantly faster than both
That Z80 code is not the equivalent of the modern code though, is it?
for example your modern code mentions 64KB lookup table.. no way you can port this to Z80 which has 64KB of address space total, shared for input, output, cache and code.
So what do those timings mean? Are those just a made up numbers for the sake of narrative?
Input and output are in a separate address space on the Z80. It's on the 6502 where they share space with code and data.
Oh, that makes a lot more sense! I was puzzled as to how the new hardware could be so slow, but an inefficient interpreter easily explains it. I've seen over 1000× slowdowns from assembly to bash, so it sounds like ABAP is close to bash.
It's interesting how this stuff works. I started learning "tech" as a millenial in the late 90s when the internet and web was relatively new.
As such I picked up an awful lot of networking/windows/linux "fundamentals" simply because you had to know it to fix anything (truing - and failing - to get my crappy 28k winmodem working on Linux probably taught me many "months" worth of fundamentals alone!).
The other thing is when you are younger learning this stuff, most of us had pretty hard financial constraints on what we were doing. I couldn't persuade my parents to replace that winmodem with a proper modem (which I would do without thinking now), so you really had to make do with what you had.
One of the best lessons I learned was during a hard financial constraint.
Roughly around 1994 I had a new compute- a 486/66MHz with 4MB of RAM. I got LINUX and installed it, and was able to run X windows, g++, emacs, and xterm- but if I compiled while emacs was running, the system would page like crazy (especially obvious in those days when harddrives were very noisy).
I had to work really hard to convince myself to pay the $200 (as an undergraduate, I had many other things I would have preferred to spend money on) to double the ram to 8MB, and then another $200 to 16MB a year later, and finally a last $200 to max out the RAM at 32MB.
Once the system had 32MB of RAM, it performed quite well, with minimal paging, and it greatly increased my productivity. I learned that while RAM can be expensive, making sure your processor is not waiting for disk is worth it.
I probably also spent $1,000s of dollars on modem upgrades (1200->2400, 2400->9600, 9600->19200, 19200->48000, 48000->56K and then switching to DSL and later fiber). Each time was "worth it" but it was expensive and so I really thought hard abotu the upgrade and the value it brought me (a high level of job opportunities in areas I find interesting).
>... if I compiled while emacs was running, the system would page like crazy
The good old days when "eight megs and constantly swapping" was a real issue. I kind of miss them. (But not the modem speeds. Don't miss those at all.)
Now we have EGACS, which is called Electron, and soon ETACS, which will be called something ending with GPT.
That Electron is a acceptable way to write desktop applications is a big reason I miss those old days.
Imagine an era where a software-defined whatever was deemed "improper". Today we have software-defined radios that are clearly superior; you had one and wanted it in hardware.
From the article: Lookup tables are always faster than calculation - is that true? I'd think that while in the distant past maybe today due to memory being much slower than CPU the picture is different nowadays. If you're calculating a very expensive function over a small domain so the lookup fits in L1 Cache then I can see it would be faster, but you can do a lot of calculating in the time needed for a single main memory access.
You will need to first sit and ballpark, and then sit and benchmark, and discover your ballpark was probably wrong anyhow:-)
Some (for me) useful pointers to that regard for both:
1. https://www.agner.org/optimize/instruction_tables.pdf - an extremely nice resource on micro architectural impacts of instructions
2. https://llvm.org/docs/CommandGuide/llvm-mca.html - tooling from Intel that allows to see some of these in real machine code
3. https://www.intel.com/content/www/us/en/developer/articles/t... - shows you whether the above is matching the reality (besides the CPU alone, more often than not your bottleneck is actually memory accesses; at least on the first access which wasn’t triggered by a hardware prefetcher or a hint to it. On Linux it would be staring at “perf top” results.
So, the answer is as is very often - “it depends”.
A few more links for low level CPU benchmarking
1 - https://www.uops.info/index.html similar content to Anger's tables
2 - https://reflexive.space/zen2-ibs/ how to capture per micro op data on AMD >= Zen 1 CPUs
I agree on "it depends". And usually not only on your actual code and data, but also how you arrange it over cache lines, what other code on the same core/complex/system is doing to your view of the cache and some other internal CPU features like prefetchers or branch predictors.
...and we always circle back to "premature optimization is the root of all evil", since processors are a wee bit more intelligent with our instructions than we thought. :)
Not that intelligent. If you have two loads and one store per cycle, then that’s it. (Recall that we have SSDs with 14 GB/s sequential reads now, yet CPU clocks are below 6 GHz.) Most of the computational power of a high-performance CPU is in the vector part; still the CPU won’t try to exploit it if you don’t, and the compiler will try but outside of the simplest cases won’t succeed. (Most of the computational power of a high-performance computer is in the GPU, but I haven’t gotten that deep yet.)
I don’t mean to say that inefficient solutions are unacceptable; they have their place. I do mean to say that, for example, for software running on end-users’ computers (including phones), the programmer is hardly entitled to judge the efficiency on the scale of the entire machine—the entire machine does not belong to them.
> We should forget about small inefficiences, say 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
D. E. Knuth (1974), “Structured Programming with go to Statements”, ACM Comput. Surv. 6(4).
You are right, but with a good optimizing compiler and out of order execution, your code will not work the way you guess most of the time, even though it accomplishes what you want.
On the other hand, while doing high performance compute, the processor will try to act smart to keep everything saturated. As a result, you still need to look at cache trash ratio, IPC, retirement ratio, etc. to see whether you are using the system at its peak performance, and again CPU is doing its thing to keep the numbers high, but that's not enough of course. You have to do your own part and write good code.
In these cases where you share the machine (which can be a cluster node or a mobile phone), maximizing this performance is again beneficial since it allows smoother operation both for your and other users' code in general. Trying to saturate the system with your process is a completely different thing, but you don't have to do that to have nice and performant code.
GPU computation is nice, and you can do big things fast, but it's not suitable for optimizing and offloading every kind of task, and even if though the task is suitable for the GPU, the scale of the computation still matters, because a competent programmer can fit billions of computations until a GPU starts running your kernel. The overhead is just too big.
Knuth's full quote doesn't actually invalidate me, because that's how I operate while writing code, designed for high performance or not.
And so the line continues
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Wait until you learn the Tomasulo algorithm is not a magic wizard box for fast code, but that you can also write code that meshes well with it and get even more performance. Your CPU can run up to 5ish instructions every cycle - are you giving it instructions in suitably matched groups to maximize parallelism?
Premature optimization may be the root of all evil, but timely optimization can give you insane benchmark numbers, at least. Not enough people pay attention to the word "premature" and they think it's about all optimization. Sadly, I've never worked on anything where it would have been timely. Haven't even used SIMD. Just watched other people's talks about it.
Some general principles are widely applicable: modern processors love to work on big linear SIMD-aligned arrays of data. Hence you have the structure-of-arrays pattern and so on. While a CPU is still the best tool you have for complex branching, you should avoid complex branching as much as possible. In Z80-era architectures it was the reverse - following a pointer or a branch was free, which is why they invented things like linked lists and virtual functions.
I have written a boundary element method based material simulation code in my Ph.D., and had time and opportunity to apply both good design practices and some informed coding such as minimum and biased branching (so the code runs unimpeded for most of the time), lockless algorithms and atomics in parallel code where applicable. Moreover I looked for hotspots with callgrind, and checked for leaks with valgrind.
As a result, I was handling two 3000x3000 (dense and double precision floating point) matrices and some vectors under ~250MB RAM consumption, and getting in an out of whole integration routine under a minute. Whole evaluation took a bit more than a minute in last decade's hardware.
Perf numbers returned great. ~5IPC, 20% cache trash, completely and efficiently saturated FPU and memory bandwidth.
Result? The PoC was running in 30 minutes, my code was run and done under 2 minutes for most problems. The throughput was 1.7 million complete Gaussian Integrations per second, per core. We have an adaptive approach which takes many partial integrations for a one complete integration [0], so it amounted to a higher number of integrations (IIRC it was ~5 mil/sec/core, but my memory is hazy.).
The code I have written doesn't call for SIMD explicitly, but Eigen [1] most probably does for Matrix routines. Nevertheless, the core "secret sauce" was not the formulae but how it's implemented in a way which minds for the processor's taste. Adding "-march native -mtune native -O3 -G0" gave a 3x performance boost in some steps of the code.
Currently slowly reimplementing this in a tighter form, so let's see how it goes.
So, also considering what I do for a job, I know how nuanced Knuth's quote is, but people are people, and I'm not the one who likes to toot his horn 7/24/365.
I just wanted to share this to confirm, validate and support your claim only, and to continue conversation if you are interested more in this.
[0]: https://journals.tubitak.gov.tr/elektrik/vol29/iss2/45/
[1]: https://eigen.tuxfamily.org/
> Lookup tables are always faster than calculation - is that true?
Maybe on the Z80. Contemporary RAM was quite fast compared to it, by our sad standards.
A table lookup per byte will see you hit a pretty hard limit of about 1 cycle per byte on all x86 CPUs of the last decade. If you’re doing a state machine or a multistage table[1] where the next table index depends on both the next byte and the previous table value, you’ll be lucky to see half that. Outracing your SSD[2] you’re not, with this approach.
If instead you can load a 64-bit chunk (or several!) at a time, you’ll have quite a bit of leeway to do some computation to it before you’re losing to the lookup table, especially considering you’ve got fast shifts and even multiplies (another difference from the Z80). And if you’re doing 128- or 256-bit vectors, you’ve got even more compute budget—but you’re also going to spend a good portion of it just shuffling the right bytes into the right positions. Ultimately, though, one of your best tools there is going to be ... an instruction that does 16 resp. 32 lookups in a 16-entry table at a time[3].
So no, if you want to be fast on longer pieces of data, in-memory tables are not your friend. On smaller data, with just a couple of lookups, they could be[4]. In any case, you need to be thinking about your code’s performance in detail for these things to matter—I can’t think of a situation where “prefer a lookup table” is a useful heuristic. “Consider a lookup table” (then measure), maybe.
[1] https://www.unicode.org/versions/latest/ch05.pdf
[2] https://lemire.me/en/talk/perfsummit2020/
[3] http://0x80.pl/notesen/2008-05-24-sse-popcount.html
[4] https://tia.mat.br/posts/2014/06/23/integer_to_string_conver...
On the Z80 any memory access had a fixed cost of 3 clock cycles (in reality the memory system could inject wait cycles, but that was an esoteric case). Together with the instruction fetch of 4 clock cycles the fastest instruction to load an 8-bit value from an address that's already in a 16-bit register (like LD A,(HL)) takes 7 clock cycles.
The fastest instructions that didn't access memory (like adding two 8-bit registers) were 4 clock cycles, so there's really not much room to beat a memory access with computation.
Today "it depends", I still use lookup tables in some places in my home computer emulators, but only after benchmarking showed that the table is actually slightly faster.
Depends on the hardware and what you are making with that hardware. Some processors can do complicated things stupidly fast (e.g. when SIMD done right), and for some hardware platforms, a mundane operation can be very costly since they are designed for other things primarily.
My favorite story is an embedded processor which I forgot its ISA. The gist was, there was a time budget, and doing a normal memory access would consume 90% of that budget alone. The trick was to use the obscure DMA engine to pump data into the processor caches asynchronously. This way, moving data was only ~4% of the same budget, and they have beaten their performance targets by a large margin.
I've run into this on PowerQuicc network processors. It was so handy having the packets(or at least header) dropped straight into the cache
> Lookup tables are always faster than calculation - is that true
No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
However I doubt that is what is meant. For complex calculations there are a lot of it depends and tradeoffs. A lookup table will often force you to think about trade offs because the table takes up a lot more memory and so you need to decide what values are important. A lookup table is also prone to bugs - back in the 1990s someone noticed that the Intel Pentium processor didn't give the right results for division - turns out they didn't enter a few values into the table correctly - if you write a table you could have the same bug.
Calculating sin() to as many decimal places as your highest precision floating point register allows will be slow, but that is likely what the sin built into your standard library does since you might be building a bridge that the person who wrote that sin function crosses latter. If you only need sin rounded to the nearest whole number a lookup table is probably faster. If you need sin to as precise as the computer can calculate that is a lot of RAM (x86 uses 80 bits internally for floating point numbers)
> No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
Note that a round of AES is now one aesenc instruction on modern systems.
You might be surprised how much better code is than memory lookups. Modern AMD Zen5 cores have 8 instruction pipelines but only 3 load/store pipelines.
You have more AVX512 throughput on modern Zen5 cores (4x Vector pipelines) than L1 throughput.
I'd go as far out to say that table lookups are the worst they've ever been in terms of compute speed. The reason modern encryption/hashing got so fast is that XChaCha and SHA3 are add/for/rotate based rather than lookup-based (sbox based like AES or DES).
Tables are still appropriate for some operations, but really prefer calculations if at all possible. Doubly so if you are entering GPU code where you get another magnitude more compute without much memory bandwidth improvements.
Oh, if you need the best of both worlds, consider pshufb (4-bit lookup table), or if you have access to AVX512 you could use vpermi2b as an effective 7-bit lookup table.
It's not quite a full memory lookup table but these instructions get a lookup-like behavior but using the vector units (128-bit or 512-bit registers).
In this case, the lookup table is used for popcount, and there's a comment in the Z80 assembly that says "One lookup vs eight bit tests." If the code could make use of a hardware popcount instruction, the lookup table would lose, but if that isn't available, a 256-byte lookup table could be faster. So it's less "lookup tables are always faster" and more "lookup tables can be faster, this is one such case."
probably fastest popcount in z80 that does not use aligned table, would be shift A through flag, then conditional INC C, unrolled, still slower than ld l,a: ld b,(hl).
The article does mention cache friendly access patterns in the same context.
But yes, you're right. Back when I started with optimizations in the mid 90s memory _latencies_ were fairly minor compared to complex instructions so most things that wasn't additions (and multiplications on the Pentium) would be faster from a lookup table, over time memory latencies grew and grew as clock speeds and other improvements made the distance to the physical memory an actual factor and lookup tables less useful compared to recomputing things.
Still today there are things that are expensive enough that can be fit in a lookup table that is small enough that it doesn't get evicted from cache during computation, but they're few.
> Lookup tables are always faster than calculation - is that true?
I know it's not always true on the Nintendo 64, because it shared a single bus between the RAM and "GPU": https://youtu.be/t_rzYnXEQlE?t=94, https://youtu.be/Ca1hHC2EctY?t=827
Basically if the thing you are doing can be precomputed. Then you can use a lookup table. Then at that point do you have the space for it? Is the time to get it out of memory less than the operation you are caching in the lookup table. Then it could be a good candidate for a lookup table. Many times that can be true. But not always. Even if that is all true you can end up hurting something else because your lookup table evicted something else important from the l1/l2 cache. So you also have to test it in context.
You are correct and I've even ran into a situation where build-time evaluation was slower than runtime calculation, thanks to code size.
I'm a sometimes CPU architect and came here to argue just this - modern CPUs have far far slower memory access (in clocks) than z80 memory access. To be fair you can probably fit any z80 table you're using into modern L1 cache, but even so you're looking at multiple clocks rather than 1.
It's not true
Writing style and length of paragraphs strongly suggest that this is AI generated in full.
Not yet in full, it is ~45% generated.
Two reasons: 1) English is not my native tongue, 2) I hate LinkedIn article style -> let LLM convert my hadrcore-oldschool style into something like "You won't believe what my grandmother's cat taught me about ..."
What’s wrong with hardcore-oldschool style? Why on earth would you want LinkedIn style?
You can use LLMs to check grammar and style and manually go through each suggested correction; it will take more time, but your readers will appreciate the effort and your own voice and style will be there. People get upset by LLM-generated content not because they are against LLMs, but because of the awkward style that is very recognizable and leaves you wondering "If the author couldn't be bothered to write it, why should I bother to read it?"
Some folks get very annoyed by the whiff of LLMs, and will complain loudly. I guess people who aren’t annoyed have no reason to post about it, though.
Perhaps if you post your oldschool notes and their LLMified version and see which rises to the top, haha.
The AI-written text is junk. It's not doing anybody any good. The reader of your article wants to know who you are and what you think. They can't judge this when your ideas are covered in a layer of AI-generated slime.
I would rather read a non-native speaker write something hardcore-oldschool than read LLM-generated "You won't believe what my grandmother's cat taught me about..." You may get corrections about your English. You may even get complaints (not everyone on HN is nice all the time). But my impression is that you will get fewer complaints than you will from something that "feels" LLM-generated. (Or maybe that's just my personal taste.)
It is considered socially acceptable to just straightforwardly complain about an LLM-written post, while complaints about somebody’s English are usually phrased as constructive feedback. The latter takes more effort and is not as effective as a launching point for a rant, so the former might just be easier to detect based on lines-of-text.
Absolutely right. Kinda sad that this format is frowned upon now. I like data-rich articles without SEO fluff.
It was fun to read though, it flows oddly but it works. Great case study of appropriate AI use!
The Z80 example in https://github.com/oisee/zvdb-z80/blob/master/ZVDB-Z80-ABAP.... doesn't look correct.
That:
Is bugging me because 1. ix is not used in the function and 2. operations on ix are limited (and slow), can't be used as "accumulator". It is a 16-bit register to access memory as an index (and as extra two 8-bit limited registers if you use undocumented Z80 opcodes).Besides, why use b register as counter in the loop and use dec and jr when there's djnz for that?
I haven't checked anything else, but that has a bad smell.
Using IX instead of HL or B even isn't going to break the bank. This code will still beat e.g. the output of SDCC on some normal C code.
It's amusing that the writing style is akin to a LinkedIn what XYZ taught me about B2B sales.
Guilty as charged—and totally intentional.
Turns out "I want a burger" beats "we have equivalent burger at home" every time—even when the home solution is objectively better.
So yes, this reads like "What my goldfish taught me about microservices." But unlike those posts, this story has no moral—just nerdy fun with enterprise software roasting.
Sometimes you gotta speak the language they actually read there +)).
I'm sure this is interesting stuff but the obvious ChatGPT voice only distracts from your own thoughts.
Please don't forget to put a license in your repo. Based on the zvdb linked to in the readme I'm guessing you prefer MIT but explicit is always better than leaving the audience wondering
Saying you ported SAP makes no sense. You ported a program you wrote for the SAP platform.
Porting all of SAP would be equivalent to porting all known human cancers from the human genome to a different newly discovered alien genome, both in terms of scale as well as sheer evil.
It really makes me want to flag this submission because of how grossly misleading such a title is, despite it being correctly copied from the upstream source
Do I note a hint of LLM writing?
I have only one question: does the author know anything about coding ABAP like it's a Z80? I wish that they'd addressed this.
yes, it is addressed in another repo https://github.com/oisee/zvdb
and another article about zvdb-abap from ~1.5 years ago.
My comment was a reaction to the fact that you called out your experience/wisdom on the topic no less than five distinct times in a short (otherwise great!) post.
I would argue that by doing so, you made it the unintended central theme of the post.
I'm not saying this to hurt your feelings. That you didn't perceive the obvious sarcasm in my late-night comment suggests that you might not want to be perceived as pompous in your writing style.
My suggestion is that one explicit "I'm an expert" reminder per post is a perfectly good number.
oh!, now I got it =) will increase "Shy" parameter and decrease "Braggart" parameters in my "Normal speech" to "LinkedInese" translator +)
Thank you for feedback =)
(I was thinking that it is not MY wisdom i am talking, but about "Z80" wisdom - all the code I have seen on demoscene and e-zines.)
Yes, this reinforces that the article suffered from the LLM rewrite. Your comments here prove your natural voice is more compelling for this audience than the LLM "cat post" style.
That's a new one for me! What is a cat post?
Earlier, @oisee said 'I hate LinkedIn article style -> let LLM convert my hadrcore-oldschool style into something like "You won't believe what my grandmother's cat taught me about ..."'
I am (definitely) not telling you to act shy or not be proud of your accomplishments.
Ultimately, it's up to you if you want to redeclare your bona fides in almost every paragraph. It doesn't make for great reading, but it's your call either way.
If anything, I'm encouraging you to let your narrative and actions speak for your skills.
oh, I was trying to understand what are you talking about, where this impression is coming from, and then realised, that the "article" md on github was out of sync with other versions. tnx!
Nice recursive joke!
Haha, actually interesting post, but absolutely top-grade clickbait action here.
Subtitle is better as title
> Or: I built a vector database for SAP, then ported it to a 1976 processor—for science fun
But probably the clickbait is why the post has made to top.
Entertaining person: https://www.youtube.com/watch?v=3ULSvfYAmq0