The Pentium contains a complicated circuit to multiply by three

374 points by Tomte 4 months ago

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.

ngneer 4 months ago

Incredible, thanks for sharing. My understanding is that ternary computers would have been based on tri-state logic which was less reliable than say transistors or even vacuum tubes encoding binary state. Is that understanding correct?

phkahler 4 months ago

The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.

greesil 4 months ago

Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.
- phkahler 4 months ago
  
  >> Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly.
  Sure taking 8 instructions for a multiply is a lot, but microprocessors at that time didn't even have multiply and ran below 1MIPS. I just wanted to bring it up as yet another way someone implemented multiply, and it was pretty effective for its time.
  EDIT: too late to edit my original comment, but it added the memory value to the upper half of the 24-bit result.
  
  greesil 4 months ago
  
  Aside from a really weird 8051 project in the aughts I've only ever had single cycle multiplies.

philsnow 4 months ago

> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.

atq2119 4 months ago

> invented the first multi-core cpu
The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.
"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.
- philsnow 4 months ago
  
  Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.
- reaperman 4 months ago
  
  This raises my general curiosity to ask myself: among the set of things that could be said to have been truly invented by a single person (or pair/trio/tiny team) ... which inventions are the most complex ... and which are the most technologically advanced?
oidar 4 months ago

I thought Kunle Olukotun led the team for the first multi-core CPU.
- philsnow 4 months ago
  
  You may absolutely be right, I don’t know who did it “first”.
  I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...
  (In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)
nxobject 4 months ago

As a double-aside, Peter Kogge wrote a very good early textbook on pipelined microarchitectures that's worth reading if you want to learn how early supercomputer vector processors were designed: The Architecture of Pipelined Computers (1981).
mturmon 4 months ago

Peter used to consult/collaborate with my lab. He was a proponent of moving remote sensing computations closer to the sensor ("edge computing" these days).
You could definitely make the intellectual case for this approach. In cases where there's latency or costs associated with moving data to central computing, this makes sense. (In our case, it was space-based sensors so you could make the case that way.)
But AFAIK this style of processing was never systematically adopted in any space-based processing system, although many such systems (like radars) have ad hoc data reductions performed in hardware close to the sensor.
Thanks for providing that connection!

kens 4 months ago

Author here if anyone has questions...

vanderZwan 4 months ago

What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?
- phire 4 months ago
  
  Obsolete.
  You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.
  The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.
  The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.
  The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).
  In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.
  In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.
  As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.
  In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.
  Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.
  [1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/
  
  atq2119 4 months ago
  
  It seems you misunderstood the design of the multiplier described in the article.
  The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).
  The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.
  As such, I suspect (but don't know) that this trick or some variant of it is used even today.
  
  phire 4 months ago
  
  Yeah, you are right... I've misunderstood radix-8 multiplication, missed that this post was only talking about a small part of the Pentium's multiplier. and jumped to conclusions... And annoyingly, hackernews doesn't allow you to edit comments after a few hours
  On the R3000/R4000/R4200, the 3-bits-per-cycle multipliers do use radix-8 multiplication, but they don't have a dedicated 3x multiplier. Instead the the 3x result is latched during the first cycle (by adding (x << 1) + x). For the remaining cycles it can do a 3-bit multiplication with nothing more than a bit of booth recoding logic, and a single 64-bit wide adder.
  Then MIPS entirely abandoned this radix-8 encoding for the 8-bit-per-cycle multiplier in the R4400 and R4300, replacing it with a simple array of binary CSA adders. Probably because an array of base-2 adders is just much simpler. (Or at least that's what I think I can see on the R4300's die shot, I'm going to need to go back and take a much closer look at the multiplier)
  Anything I say about radix-256 in my first comment is probably nonsense, it's not radix-256 simply because it can do 8-bits in one cycle.
  What I missed is there is nothing limiting you to one radix-8 addition per cycle (like the early MIPS designs), you can combine the radix-8 encoding with an array of adders. And you only need 1/3rd the adders that a base-2 multiplier would need. The entire point of using the radix-8 encoding is that there is only one carry every 3 bits.
  You are probably right. This trick with the dedicated 3x multiplier is probably still used today.
  
  vanderZwan 4 months ago
  
  On the other hand, the misunderstanding of the question still resulted in a reply that added valuable context to the discussion, so thank you :)
  
  Aardwolf 4 months ago
  
  > Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.
  How do these manage to achieve the 3 cycle latency? If it has to go through a pipeline of 8 (or 64?) adders in a row (I don't know if it's actually that many! but I assume more than 3), how do you still manage to get just a 3 cycle latency, assuming a single adder has a latency of 1?
  Also, given the trickery needed for the radix-8 multiply, where you get lucky with most values from 1-7 except 3 and 5 which use this specialized circuit, how much such specialized circuits do you need for radix-256 multiply?
  
  addaon 4 months ago
  
  > assuming a single adder has a latency of 1?
  Bad assumption. A single adder is only forced to take a full clock cycle if you register its output. Otherwise, data can flow through several stages in a single cycle, as (worst-case) propagation delay allows, before that set of stages hits a register that forces clock alignment.
  
  brucehoult 4 months ago
  
  It's not even adder plus register that takes a clock cycle. Chains of additions are so common that it's pretty much essential for any high performance CPU to run at the maximum clock speed that allows them to proceed at one per clock, so within the clock cycle you need the bypass path from the ALU pipe stage output back to a mux selecting either the bypass (previous instruction's result) or a register, then the adder, then the register.
  Depending on the design there might be some other logic that requires a longer path in a single pipe stage, but you wouldn't want it to be too much longer.
  Based on modern ISA design I suspect that a 64 bit barrel shifter -- with its 6 layers of muxes -- takes slightly longer than a 64 bit add, and everyone is taking the adder input from after the 2nd layer of shifter muxes, giving you single-cycle instructions to do `a + (b << {0,1,2,3})` either disguised as an LEA (x86, Arm) or as a specific `sh3add` etc instruction (RISC-V).
  
  addaon 4 months ago
  
  None of these combined shift-and-add instructions need a full barrel shifter, though, so they? Usually they’re selecting from 2-4 possible shift amounts, not 64 of them.
  
  brucehoult 4 months ago
  
  My suggestion was that all adds go through the first two (of six, on 64 bit) layers of the barrel shifter, not the whole barrel shifter.
  
  phire 4 months ago
  
  Mostly. ARM actually has instructions that allow a full 64-bit shift then add.
  But I checked the Cortex A78 optimisation manual. They take 1 cycle if the shift is 4 or less and 2 cycles in other cases.
  
  brucehoult 4 months ago
  
  0-4 shift not 0-3? That is a little bit weird.
  
  phire 4 months ago
  
  Arm64 has fast 128-bit loads. Not just with NEON, but with regular integer instructions, you can quickly load 128 bits into a pair of 64-bit registers.
  So it kind of makes sense to support fast shift by four. Though, it's more likely they just profiled a bunch of code and decided fast shifts by four was worth budgeting for.
  
  ForOldHack 4 months ago
  
  This is brilliant enough to be included in schtiffs books/printouts. I knew the R4400 was a big leap in FP, but had never heard of the R4200. I would have liked to see code compile on it.
  Was there a difference in node size?
  Like the M603e that became the G3, it would have been great to see 4 of these R4200s as a quad processor server.
  As speed is paramount, there are many more things we will not learn about until the architecture is vintage.
  It's sad that speculative execution became a vulnerability.
  
  phire 4 months ago
  
  It's an interesting performance comparison, the R4200 is well under half the size and actually has a higher IPC than the R4400 on integer code... when it hits the L1 cache (and both have 16KB L1i caches).
  The early R4400s and R4200 where on the same node (I don't have 0.6 μm die area for the 4400, but the 0.3 μm R4400 was 132 mm² and the 0.35 μm R4300 was 45mm²)
  But the R4400 could hit much higher clock speeds, it had an eight stage pipeline, took three cycles to take any branch (the delay slot, then two flushed instructions), and the 64-bit shifter took 2 cycles. The R4200 was designed for much lower clock speeds, it's more or less a 5 stage "classic RISC" pipeline. Branches take 1 cycle (the delay slot) and 64 bit shifts happen in 1 cycle.
  The other problem is when it doesn't hit cache. The R4400 had an on-chip controller and on-chip tags for an external 1MB secondary cache (aka, L2), while the R4200 only had half the data cache (8KB vs 16KB) and no support at all for a secondary cache (not enough space for the IO pins). AFAIK, it doesn't even broadcast cache flushes on the bus, so attempting to bolt an external cache controller would be problematic.
  (Cache misses on are especially painful on the N64, because the R4300 uses a cut-down 32-bit bus, and there is a lot of latency and bus contention going to main memory)
  And to kill your dreams of putting four of them in a single server, they removed all multi-processor support.
  That last one feels semi-deliberate to me, like they didn't want their new low-power, low-cost laptop chip to compete with their more expensive chips. I don't think it would have taken that much effort to implement the full cache coherency protocol.
  Because I do think the R4200 could have competed on certain memory bound workloads (at least in cost-per-performance metrics).
  
  ForOldHack 4 months ago
  
  "they didn't want their new low-power, low-cost laptop chip to compete with their more expensive chips."
  DEC, IBM, Motorola, and MIPS.
  
  ForOldHack 4 months ago
  
  Also, FP adders need prior normalisation.
- kens 4 months ago
  
  There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.
dsvf 4 months ago

No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!
java-man 4 months ago

Ken, maybe it's time to publish a book?
- kens 4 months ago
  
  That would require me to focus on one subject :-)
  
  java-man 4 months ago
  
  Here is an example: a certain blogger, Dave Addey [0] published a book [1] which is basically a hardcopy of the blog.
  After the EMP flash of 2075, the centient raccoons will enjoy reading the stories of the civilization long gone.
  [0] https://typesetinthefuture.com/author/typesetinthefuture/
  [1] https://www.amazon.com/Typeset-Future-Typography-Science-Fic...
  
  fargle 4 months ago
  
  *sentient raccoons (we are pedantic)
  
  JohnKemeny 4 months ago
  
  Then publish many books?
  
  ForOldHack 4 months ago
  
  I ancient times, it would be nice to curl up to a nice scroll ,later on a nice bound book, now we are selecting the background colors on our e-readers. These articles are hard-core good.
  I would like to hear much more about the R4200i than the 286.
- ForOldHack 4 months ago
  
  I have to respect his choice to focus on the 8086 and the pentium, I would think he considers the 80286 a brain dead footnote, interesting as to what when wrong and what went terribly wrong.
  I used to teach binary math to kids, and at a county fair I was upstaged by an 11 year old girl who demonstrated both multiplication by powers of two and division of powers of two.
  "What does your father do for a living?'
  "He is the director of the laser fusion project!"
  "Oh."
  
  kristianp 4 months ago
  
  The 286 could process twice as many instructions per clock as an 8086. According to Wikipedia, an uplift similar to the 80486 and the Pentium over their predecessors.
  
  ForOldHack 4 months ago
  
  So that is wikipedias "opinion" which any one could edit. I should edit that to reflect where the performance came from. Always check sources. ( The reference comes from intel386. com)
  
  ForOldHack 4 months ago
  
  The cited website went offline this morning, and it does not discuss the execution rate of the 286, NOR does it discuss the reserved bytes in the GDT, and LDTs.
  Not a lot of confirmation....
  https://web.archive.org/web/20160204052256/http://intel80386...
  " For those descriptors that are common to both the 80286 and the 80386, the presence of zeros in the final word causes the 80386 to interpret these descriptors exactly as 80286 does; for example: "
  This is patently and documented as untrue. The final word is reserved for use as a 365 GDT, and breaks 286 code, specifically Xenix 286, on a 386.
  https://www.os2museum.com/wp/theres-more-to-the-286-xenix-st...
kristianp 4 months ago

This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?
- kens 4 months ago
  
  Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.
  https://en.wikipedia.org/wiki/Extended_precision#x86_extende...
dwedge 4 months ago

I only tenuously understand this so feel free to ignore the question if it's too asinine but:
> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)
Why can't you do the same to subtract x4 from x7 to get x3?
- thombat 4 months ago
  
  Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)
- mikequinlan 4 months ago
  
  x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.
- russdill 4 months ago
  
  The question would be why isn't it 4x - 1x?
  
  kens 4 months ago
  
  The trick is that Booth's algorithm gives you the ×8 term for free; you multiply by one more in the next base-8 digit. You can put either ×4 or -×1 into a term, but you can't put both of them into one term. So you can do ×8 - ×1 but you can't do ×4 - ×1.
  
  eru 4 months ago
  
  I guess the decimal equivalent would be multiplying by 10 vs multiplying by 5.
  Or for hexadecimal: multiplying by 0x10 (16) or by 2, 4 or 8.
  
  mjevans 4 months ago
  
  I suspect it's logically equivalent, but one bit wider in hardware.
  Binary 4x is left shift by 2, rather than 2x which is by one.
  Adding or subtracting in 2's compliment space is the same operation for the bits of the word.
rob74 4 months ago

Not really a question, just one phrase that left me scratching my head a bit:
> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial.
I think you can leave out the "almost" there - multiplying by 0 is 0 and multiplying by 1 is the number itself, as shown in the example, so I would definitely call it trivial.

thaumasiotes 4 months ago

I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.

kens 4 months ago

For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.
The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.
For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.
Hopefully this makes sense. There are a lot of digits and sums flying around.
- thaumasiotes 4 months ago
  
  How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?
  
  sifar 4 months ago
  
  If you are negating a single term you can't.
  If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.
  
  kens 4 months ago
  
  I think that the +1 is accomplished simply by setting carry-in on the adder for that term, so you don't need a separate adder. (Disclaimer: I haven't looked at that part of the circuit yet.)
  Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.

lewdwig 4 months ago

Multiplication by 3 is actually a common operation, particularly in address calculations, where a shift and an add means multiplying the index by 3. Implementing this naively would add significant latency. But using this circuitry the LEA (Load Effective Address) instruction can do it in a single cycle, so spending this much transistor budget on it was totally a good call.

quadhome 4 months ago

Is it used for that, though? As I understood the article, this circuit is used as a part of floating-point multiplication.
kazinator 4 months ago

What?
- LEA is just an instruction moves the address calculated by an addressing mode into its output operand, rather than performing a data transfer to or from that address; all the addressing that LEA can do, MOV instructions can also do.
- The indexed addressing modes of the x86 used by MOV or LEA do not support a scale factor of 3, only powers of two: 1, 2, 4, 8. So address generation has no use for multiplication by 3.
- Article makes it clear that the multiplier by 3 is part of the floating point multiplier.

mjevans 4 months ago

"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.

JumpCrisscross 4 months ago

> Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics
We’ve been at that limit for decades.
- GrumpyYoungMan 4 months ago
  
  The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.
  
  brookst 4 months ago
  
  …but that’s how it’s always worked. Moore’a law is dead, we’re at the limit of everything, oh hey, Moore’s lawn limps by again because someone did something clever.
  
  light_hue_1 4 months ago
  
  That's never how it worked. Moore's law was never dead. People are just endlessly confused about what Moore's law is.
  What ended was Dennard scaling around 2006. Roughly that frequency would keep going up as feature size went down. But because so many people are confused about what is what, you see a crappy muddled message.
  Moore's law has been going strong. It must end eventually, current predictions are that it will be in a decade or two.
  
  perching_aix 4 months ago
  
  It's starting to get a bit old that whenever I see Moore's law mentioned, I'll usually also run into a spiel about how people have the wrong idea about what it actually refers to, and that it's holding up just fine. This is despite the gen-on-gen and year-on-year performance improvements of computer hardware very clearly tapering off in recent memory.
  Maybe debating what always-somehow-wrong law to cite should not be the focus? Like it's very clear to me that being technically correct about what Moore's law or the Dennard scaling refers to is leaps and bounds less important than the actual, practical computing performance trends that have been observable in the market.
  
  Kon5ole 4 months ago
  
  What we see in the market is caused by software bloat. Chips are gaining performance faster than ever in absolute terms.
  I think Moore’s law should be avoided altogether when discussing progress in this area, because it’s hard to understand the effects of doubling intuitively. Rice grains on chessboards and all that.
  One might think ”Moore’s law is slowing down” means progress was faster before and slower now, when it is in fact completely opposite.
  If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million.
  Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.
  So in 2 years we added 500 times more transistors to our cpus than the first 20 years of “prime Moore’s law” did.
  This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat, and the fact that most users aren’t doing things with their computers where they can notice any improvements in performance.
  
  perching_aix 4 months ago
  
  > Chips are gaining performance faster than ever in absolute terms.
  But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.
  > If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million. Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.
  Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%! Those Ryzens went from 8.3B to 13.2B, a +59% difference. But even this is misleading, because we're not considering die area or any other parameter.
  
  Kon5ole 4 months ago
  
  >But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.
  The 4090 and 5090 are the same generation in reality, using the same process node. The 5090 is a bit larger but only has about 20% more transistors of the same type compared to the 4090. Which of course explains the modest performance boosts.
  Nvidia could have made the 5090 on a more advanced node but they are in a market position where they can keep making the best products on an older (cheaper) node this time.
  >Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%!
  That was my point though, to highlight how relative differences in percentages represent vastly different actual performance jumps. It quickly becomes meaningless since it's not the percentages that matter, it is the actual number of transistors.
  To put it another way - If you take the first pentium with about 3 million transistors as a baseline, you can express performance increases in "how many pentiums are we adding" instead of using percentages, and note that we are adding orders of magnitude more "pentiums of performance" per generation now than we did 10 years ago.
  
  short_sells_poo 4 months ago
  
  You seem to be completely confused about why absolute vs relative matters. Moore's law literally states: "[Moore's law] is the observation that the number of transistors in an integrated circuit (IC) doubles about every two years".
  This is literally a relative measurement. You cannot reason about Moore's Law in absolute changes.
  The other poster has laid it out for you in the simplest terms: a 1 billion transistor increase could mean anything. It could be a 1000% improvement - which is absolutely humongous - or a 10% improvement, which is basically irrelevant. If you want to measure how impactful an increase is, you have to look at relative change. 1 billion transistors on it's own means nothing. It is only interesting with respect to the number of transistors in the previous generation - which is a relative change measurement.
  Say we are at Generation 1 with 100 billion transistors. By your reasoning, if we add 1 more billion of transistors to this, that's big. 1 billion transistors are a lot. But this is absolutely incorrect! Because we started out with 100 billion transistors, the change is actually irrelevant.
  
  Kon5ole 4 months ago
  
  >You seem to be completely confused about why absolute vs relative matters.
  [..]
  >This is literally a relative measurement. You cannot reason about Moore's Law in absolute changes.
  This is exactly my point. I said exactly the same thing as you: "I think Moore’s law should be avoided altogether when discussing progress in this area"!
  Performance is an absolute thing, not a relative thing. Amount of work per second, not percentage-over-previous!
  Doubling means that each step encompasses the sum of all previous steps. Every step gives vastly more performance than any preceding step.
  If a generation gives 60% more performance and all previous generations gave 100%, the latest jump will still be the one that added the most performance.
  I think this is often overlooked and worth mentioning. People are actually complaining about performance jumps that are the biggest performance jumps ever seen because they are thinking in percentages. It makes no sense.
  
  short_sells_poo 4 months ago
  
  I disagree on this: "Performance is an absolute thing"
  I think the core misunderstanding is what is being discussed. You say "Performance is an absolute thing, not a relative thing. Amount of work per second, not percentage-over-previous!"
  But nobody here is concerned with the absolute performance number. Absolute performance is only about what we can do right now with what we have. The discussion is about the performance improvements and how they evolve over time.
  Performance improvements are inherently a relative metric. The entire industry, society and economy - everything is built on geometric growth expectations. You cannot then go back to the root metric and say "ah but what really matters is that we can now do X TFlops more", when that number is a mere 10% improvement, nobody will care.
  Built into everything around computing is the potential for geometric growth. Startups, AI, video games - it's all predicated on the expectation that our capabilities will grow in geometric fashion. The only way to benchmark geometric growth is by using % improvements. Again, it is wholly irrelevant to everyone that the 5090 can do 20 tflops more than the 4090. 20 TFlops is some number, but whether it is a big improvement or not - ie is it expanding our capabilities at the expected geometric rate or not - is not possible to divine from this number.
  And when we look it up, it turns up that we went from 80 tflops to 100 tflops, which is a paltry 25% increase, well short of the expectations of performance growth from Nvidia which was in the 50% region.
  20 tflops might or might not be a lot. The only way to judge is to compare how much of an improvement it is relative to what we could have before.
  
  Kon5ole 4 months ago
  
  >I think the core misunderstanding is what is being discussed.
  I see your point and it's valid for sure, and certainly the prevailing one, but I think it's worth thinking about the absolute values from time to time even in this context.
  In other contexts it's more obvious, let's use video. Going from full HD to 4K is roughly a doubling of resolution in each direction. It means going from about 2 megapixels to 8, a difference of 6.
  The next step up is 8k, which is going from 8 megapixels to 33. A difference of 25.
  This is a huge jump, and crosses a relevant threshold - many cameras can't even capture 33mpx, and many projectors can't display them.
  If you want to capture 8k video, it's not relevant if your camera has twice the megapixel count of the previous one - it needs to have 33mpx. If it has 16 instead of 8 it doesn't matter (also it doesn't really matter if it has 100 instead of 40).
  On the other hand, if you want to capture 4k, you need only 8 megapixels. If you have 12, 16 24 or 40 doesn't really matter.
  If we return to CPUs, absolute performance matters there too - if you want to play back video without frame drops, you need a certain absolute performance. If your computer is twice as fast and still can't do it, the doubling didn't matter. Similarly if your current computer can do it, doubling it again won't be noticeable.
  
  perching_aix 4 months ago
  
  > That was my point though, to highlight how relative differences in percentages represent vastly different actual performance jumps. It quickly becomes meaningless since it's not the percentages that matter, it is the actual number of transistors.
  But it is the percentage that matters? If I have 10B transistors and I add 1B to it, the speedup I can expect is 10%. If I have 1B transistors and add 1B to it, the speedup I can expect is 100%. Tremendous difference. Why would I ever care about the absolute number of transistors added?
  
  Kon5ole 4 months ago
  
  >But it is the percentage that matters? If I have 10B transistors and I add 1B to it, the speedup I can expect is 10%. If I have 1B transistors and add 1B to it, the speedup I can expect is 100%. Tremendous difference. Why would I ever care about the absolute number of transistors added?
  Because it's the transistors that actually matter for performance.
  If a unit of work (number of files compiled, number of triangles generated or whatever else) takes 1 billion transistors to complete in 1 second, you have gained the same amount of work per second by adding 10% to the latter as you gained by adding 100% to the former.
  How much performance you need to feel a difference in a given workload is a separate point, and note that usually the workload changes with the hardware. Compiling the linux kernel in 2025 is a different workload than it was in 2005, for example, and running quake 3 is a different workload than cyberpunk 2077.
  If you play a modern game, you don't notice a difference between a 10 year old GPU and a 12 year old GPU - even though one might be twice as fast, they might both be in the single digits of FPS wich feels equally useless.
  So we gain more from the hardware than we used to, but since the software is doing more work we're not noticing it as much.
  
  perching_aix 4 months ago
  
  I legitimately just do not see the utility of framing the topic this way.
  Benchmarking efforts usually involve taking the same amount or type of work, and comparing the runtime durations or throughputs in turn for different hardware.
  Rasterization performance benchmarks for the 5090 revealed exactly the same +20% difference we see in transistor count. This is why I do not see the utility in remarking that in absolute terms we're adding more transistors than ever, because this is basically never what matters in practice. I have a set workload and I want it to go some amount faster.
  Software sprawl is an issue no doubt, but that on its own is a separate discussion. It bears light relation with the absolute vs. relative differences discussion we're having here.
  > How much performance you need to feel a difference in a given workload is a separate point
  It was exactly the point I said at the start should be the point of focus. Maybe we're talking past one another, I don't know.
  
  Kon5ole 4 months ago
  
  >It was exactly the point I said at the start should be the point of focus. Maybe we're talking past one another, I don't know.
  I think we do - we agree that the symptom is that we don't experience the same gains now as we used to, and that is a problem.
  My issue is the notion that this is caused by a slowdown in performance gains from the hardware side, when this is clearly not the case. A common complaint is along the lines of "we only got 30% when last time we got 50%", which completely ignores that the latter 30% is way more actual new performance than the previous 50%.
  >I legitimately just do not see the utility of framing the topic this way.
  IMO it's always useful to identify the actual reason for a problem and think about the fundamentals.
  If the problem is that we're not experiencing the performance gains, we should be asking ourselves "Why does software feel slower today despite the hardware having 10x more performance".
  Instead we complain about the hardware for not managing to add the equivalent of all previous performance gains every 2 years, because Moore's law observed that it did so in the beginning of the chessboard (so to speak).
  Instead of wondering whether Moore's law is dying or not, we should question why Wirth's law seems to be immortal! ;)
  
  relaxing 4 months ago
  
  Moore’s law never mentioned die area, and not because Gordon thought die sizes would never grow.
  
  Ekaros 4 months ago
  
  Take density per mm^2 going from 122.9 to 125.3 that is 2% increase in little over 2 years. Which does not bode well for needing to double in same period.
  
  _0ffh 4 months ago
  
  > This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat
  Can't resist to throw in this quote (from one of Wirth's students I think):
  "Software gets slower faster than hardware gets faster."
  
  brookst 4 months ago
  
  The spirit of Moore’s law is alive and well. It’s just that it doesn’t cover the whole story; it ignores efficiency, because that wasn’t a huge concern back then.
  Sure, it’s not literally true of literal transistor density, but it feels arrogant to invent new names for the overall phenomenon that Moore observed. Like the way we still talk about Newton’s laws despite very valid “actually technically” incompleteness.
  
  sitkack 4 months ago
  
  It is ultimately a market effect, the technical specifics are not really important and are even conflated by industry insiders. See my sibling comment.
  
  wtallis 4 months ago
  
  Also, chip fabs keep getting more expensive and taping out a chip design for those new fabs keeps getting more expensive. That makes today's semiconductor industry work quite differently from how things worked 30 years ago and means some segments see reduced or delayed benefits from the continued progression of Moore's Law. See eg. the surprisingly long-lasting commercial relevance of 28nm, 14nm, and 7nm nodes, all of which were used in leading products like desktop GPUs and CPUs for more years than Moore's Law would lead you to expect.
  
  sitkack 4 months ago
  
  Moore's law ends when the whole universe is a computer (which it already is).
  https://hasler.ece.gatech.edu/Published_papers/Technology_ov...
  Some view it as a doubling every 18 months, or a cost per transistor (this has gone up with the smallest nodes).
  It is roughly an exponential curve in the number of transistors we can use to make a "thing" with.
  It is both a capability (can we make things of a certain number of transistors) and is it economically viable to build things of that size.
  You could stay at the current node size and halve the cost of that wafer every 18 months and you would still be on the same curve. But it is easier in a our economic system to decrease the node size, keeping the rest of the fixed wafer costs the same and get 2x or 4x the density on the same lines.
  If I get nerd sniped, I'd find the two video presentations one by Krste and another by Jim Keller where they unambiguously explain Dennard Scaling and Moore's Law in a way that is congruent with what I just said.
  
  eru 4 months ago
  
  > Moore's law ends when the whole universe is a computer (which it already is).
  I find "Moore's Second Law" interesting. At least the version I'm familiar with says that the cost of a semiconductor chip fabrication plant doubles every four years. See https://en.wikipedia.org/wiki/Moore%27s_second_law
  It's interesting to contrast that trajectory with global GDP. At some point, either global economic growth has to accelerate dramatically to even produce one fab; or we have to leave the 'globe', ie we go into space (but that's still universal GDP exploding), or that law has to break down.
  It would be exceedingly funny (to me), if the one of the first two possibilities held true, and would accurately predict either an AI singularity or some Golden space age.
  
  DrNosferatu 4 months ago
  
  Moore’s will only end when we have practical optical computing, as it fills a critical technological niche.
  It’s not impossible, but think I think quantum computing will only become practical later than optical.
  
  ForOldHack 4 months ago
  
  This is the kind of comment that will keep me laughing for weeks. Moore's lawn is in fact dead. We need to go back to calling what Moore said as his observational insight.
  
  dehrmann 4 months ago
  
  > into the third dimension
  Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.
  
  magicalhippo 4 months ago
  
  He's talking about how they've moved from planar transistors, where layers are just deposited on top of each other, to transisors with increasingly complex 3D structures[1] such as FinFET and Gate-All-Around, the latter having multiple nanowires passing through the gate like an ordered marble cake[2].
  [1]: https://www.asml.com/en/news/stories/2022/what-is-a-gate-all...
  [2]: https://anysilicon.com/the-ultimate-guide-to-gate-all-around...
  
  ForOldHack 4 months ago
  
  There are also cooling and conduction paths taken into account. It was discussed in the design of the xeon version of the i9. Which had me consider clocking down the performance core communication while throttling up the performance cores.
  Your sources are excellent. ( Thank you so much for the links. )
  
  eru 4 months ago
  
  Moore's law doesn't say anything about you having to power all your transistors for them to count.
  I'm only half-joking: the brain gets a lot of its energy efficiency out of most of its parts not working all that hard most of the time; and we are seeing some glimpses of that in mobile processors, too.
  
  WXLCKNO 4 months ago
  
  Depends, there could be 11 dimensions to expand into.
  
  skissane 4 months ago
  
  Assuming those extra dimensions really exist (it is unproven), I think we are centuries or even millennia away from being able to make technological use of them-if we ever will be at all
  
  relaxing 4 months ago
  
  Expand into the time dimension, evaluate infinite execution paths by reversing the pipeline, rerunning, and storing the results in a future accumulator.
  
  ForOldHack 4 months ago
  
  You give a whole new meaning to branch prediction. I knew you were going there, and I still did not avoid the brain twang.
  
  acchow 4 months ago
  
  > since are no more dimensions to expand into.
  Quantum computing is next, right?
  
  adastra22 4 months ago
  
  That’s not how quantum computing works.
  
  eru 4 months ago
  
  To elaborate:
  Apart from the very niche application of factoring integers into prime numbers, there's scarcely any application know where quantum computers would even theoretically outperform classical computers. And even integer factoring is only remotely useful, until people completely switch away from cryptography that's relies on it.
  The one useful application of quantum computing that I know of is: simulating quantum systems. That's less useless than it sounds: a quantum computer can simulate not just itself (trivially), but also other quantum systems. In any case, the real world use case for that is for accelerating progress in material science, not something you nor me would use everyday.
  
  adastra22 4 months ago
  
  This isn’t an elaboration on anything I said. Quantum computers are immensely useful across a whole slew of domains. Not just cryptanalysis, but also secure encryption links, chemistry simulations, weather predictions, machine learning, search, finance, logistics, classical simulations (e.g. fluid flow) and basically anywhere you have linear algebra or NP problems.
  
  eru 4 months ago
  
  Do you have any sources that give good evidence that quantum computers are useful for 'weather predictions, machine learning, search, finance, logistics, classical simulations (e.g. fluid flow) and basically anywhere you have linear algebra or NP problems'? I'm basing my skepticism mostly on the likes of Scoot Aaronson.
  I can believe that quantum computers might be useful for chemistry simulations (Quantum computers aren't really useful for encryption. But you could theoretically use them. They just don't really give you any advantage over running a quantum resistant algorithm on a classic computer.)
  I'm especially doubtful that quantum computer would be useful for arbitrary NP problems or even arbitrary linear algebra problems.
  
  adastra22 4 months ago
  
  Both answers here are a good start:
  https://cs.stackexchange.com/questions/76525/could-a-quantum...
  There are specialized algorithms for any part of it, especially search. But demonstrating that quantum computers are good for linear algebra should be enough to show that they are generally useful, I hope.
  The encryption I was referring to was quantum link encryption (technically not a quantum computer, but we are splitting hairs here; it uses the same set of underlying mechanisms). Quantum link encryption permits you to have a communications channel that if someone tries to man in the middle, all it does is break the link. Both you and the attacker only see gibberish. It’s like a one time pad that doesn’t require first exchanging pads.
  
  eru 4 months ago
  
  Thanks, but you can follow the comments on the answers to get eg https://scottaaronson.blog/?p=2196
  See https://scottaaronson.blog/?p=8329 for something more recent.
  
  adastra22 4 months ago
  
  Scott to be a bit of a Debbie Downer on quantum timelines, and you need to be able to separate that from actual criticisms of the underlying technology. If you look past the way he phrased it, his criticisms of quantum machine learning basically boil down to: there are still some things to be worked out. Not that we have no expectation on how to work those things out, just that there are still unsolved challenges to be tackled.
  That’s not really a takedown of the idea.
  The more critical challenge is that there is a massive, massive, constant factor difference between classical and quantum computing using foreseeable technology. Even in the best case (like factoring) where a quantum computer gives you a logarithmic algorithm for a classically exponential problem, it does happen to throw in slowdown factor of a trillion. Oops.
  But still, even with large constant factors, algorithmic improvements eventually went out asymptotically. It’s just a matter of building a quantum computer big enough and reliable enough to handle problems of that size.
  We are already hitting the limits of what we can train on GPUs with reasonable cost. I expect that there will be many advances in the years to come from improved training algorithms. But at some point that will run dry, and further advances will come from quantum computing. Scott is right to point out that this is far, far beyond any existing companies planning horizon. But that doesn’t mean quantum technologies won’t work, eventually.
- Joker_vD 4 months ago
  
  Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.
dlcarrier 4 months ago

The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law
I think software bloat is growing faster, though.
- _carbyau_ 4 months ago
  
  I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...
  
  klempner 4 months ago
  
  Speed of light is very relevant for actual performance sensitive software, not just shitty overly chatty trash.
  While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.
  
  ForOldHack 4 months ago
  
  Two of the brightest minds, Rt Admiral Grace Hopper and Seymore Cray considered this as well as almost every silicon designer in the last 15 years. Feel free to point it out often.
  
  dragontamer 4 months ago
  
  The internet has largely replaced CD Roms, Floppies and DVDs.
  The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.
  I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s
  
  Someone 4 months ago
  
  > Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.
  With iCloud, your primary data lives in the cloud; your SSD caches recently used data, and the OS will delete it if it runs out of space. That’s millions of users for whom “whatnot” isn’t stored locally. (That’s how a 2TB iCloud subscription can make sense even if only used with a 500 GB SSD)
  
  eru 4 months ago
  
  > Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.
  Lots of Facebook and flash games are stored 'in the cloud' (but probably cached locally).
  For many games, a big part of the game logic runs on some server somewhere, especially multiplayer games.
userbinator 4 months ago

On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.
- kens 4 months ago
  
  A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.
acchow 4 months ago

History of function calls:
- goto/jmp to instr
- vtable lookup
- hash and lookup in a dictionary
- run a Large Language Model
- CamperBob2 4 months ago
  
  Or, rather:
  - goto/jmp to instr
  - vtable lookup
  - hash and lookup in a dictionary
  - run everything through a universal approximator
HPsquared 4 months ago

Luckily there is plenty room for improvement in most applications.
- ajsnigrutin 4 months ago
  
  But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?
  /s
  
  dijit 4 months ago
  
  you know it’s funny, the first time I ever read a book about compiled languages- I think it might have even been K&R C… it talked about the fact that code is run significantly more often than it’s compiled, and read more often than it’s written. Thus we should optimise for runtime speed and afterwards readability.
  Yet on hacker news, the exact opposite sentiment is true.
  Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.
  Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.
  It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.
  
  cipheredStones 4 months ago
  
  Well, programming is a tool, right? There are clearly incorrect ways to program, but that doesn't mean there's one correct way to program - because it depends on what you're trying to make.
  It's incorrect to make a bookshelf that can't hold the weight of books, but IKEA isn't incorrect to make a crappy-but-adequate bookshelf that will get given away when you move, and a master carpenter isn't incorrect to make an expensive-but-beautiful bookshelf that will be used for many decades.
  A seed-stage startup trying to see if anyone cares about their product at all should probably accept it being 100x slower than it theoretically could be. And the Linux maintainers should be working hard to shave milliseconds.
  
  jjmarr 4 months ago
  
  I've been told it depends on management's goals.
  If you are a startup (many HN readers are), employee hours are relatively expensive. The goal is to get a product out as fast as possible, and most of the time you fail at making a working product anyways. Meanwhile, you only have to spend money on servers once you have a successful product. Deferring the problem makes sense here.
  On the other hand, I personally write code that'll be near guaranteed to run millions of times in some factory somewhere, and there's little benefit if I ship my code before the factory starts making stuff. I'll take the time to write high-performance and high-quality software because there's now a financial benefit to doing so.
  But ultimately, it's my boss who gives me insight into what I should be doing. Not a book, because the book doesn't have context that [company] will lose tens of thousands of dollars every day that I don't fix [issue].
  
  joquarky 4 months ago
  
  It feels like companies care a lot less about delivering quality.
  
  pixl97 4 months ago
  
  Because quite often users buy products based on features and not quality of those features.
  
  tehjoker 4 months ago
  
  This is just the difference between rational design and capitalist design. Capital must externalize costs to achieve profits, but it's not a rational approach to system design. This appears everywhere in the economy as capitalist pressures override rationality.
  
  Someone 4 months ago
  
  Because it saves lives. https://folklore.org/Saving_Lives.html:
  "Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”
  
  ab71e5 4 months ago
  
  That assumes booting is a synchronous action instead of asynchronous, giving time to fetch a cup of coffee
  
  speed_spread 4 months ago
  
  By that logic, Electron apps kills people. Makes sense.
  
  HPsquared 4 months ago
  
  Somewhere along the causal chain of "what went wrong leading to this death", almost certainly there will be some cases of a slow app indirectly (maybe even directly in some cases) causing deaths. It's a question of what we mean by "cause".
  That's not even counting loss of quality-adjusted life years from time spent staring at a screen waiting, like being stuck in traffic.
  
  HPsquared 4 months ago
  
  Billions of person-hours wasted... Windows Update.
Salgat 4 months ago

Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.
EncomLab 4 months ago

Jonathan Blow has entered the chat...

phkahler 4 months ago

>> Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

Does this mean there's an "adder" to add 1 to the "next digit" prior to feeding the digits to the main part of the multiplier? That itself seem like it'd be similar to the carry lookahead circuit. Also thinking about when that needs to be done:

7 = 8-1

6 = 8-2

5 = 8-3

4 = 8-4 <- he didn't say they do this, but it would mean the MSB of the 3-bit value could be used to determine if we need to add 1 to the next digit, saving a couple gates.

sifar 4 months ago

Interesting choice of radix-8 booth multiplier that needs the x3 circuit. Seems like a area/perf tradeoff in order to push the fmax which means they had a latency cycles constraint since the same could be achieved via more pipelining.

kens 4 months ago

Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.
- sifar 4 months ago
  
  Sure it takes a few iterations, but there is nothing inherently complex about it.
  You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier
  There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).

initramfs 4 months ago

https://github.com/EI2030/Low-power-E-Paper-OS/blob/master/P...

8086: 29,000 386: 275,000 486: 1.2 million Pentium: 3.1 million The NSA entered the game sometime after 2000, I recall.

artemonster 4 months ago

TLDR: 3a = (a << 1) + a

vanderZwan 4 months ago

From the article:
> Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.
So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.

brcmthrowaway 4 months ago

Can AI come up with this circuit?

goldfishgold 4 months ago

Almost certainly, insofar as it’s already known and described in a number of places, including Wikipedia.
- eru 4 months ago
  
  Though it would be interesting to run an experiment where you somehow remove all of that from the training data.
  Perhaps in 10 years training runs will be cheap enough, that one a whim you could just train a new AI model on data from only before the Pentium was released, and then see if it could come up with it?

efitz 4 months ago

A long time ago on the 6502 I was trying to figure out how to quickly multiply by 3 in assembly and I came up with:

1. Shift-left

2. Add original value

For multibyte values use rotate left which carries, and do the rotates from LSB to MSB. Then clear carry, then do the adds from LSB to MSB with carries. Should work for any word size.

Ofc I was a teenager but this seemed clever and simple to me.

Was (am) I just missing something?

bean-weevil 4 months ago

This is covered in the article. See the section "Implementing a fast ×3 circuit with carry lookahead"
- efitz 4 months ago
  
  Thank you. I stopped reading after the octal discussion.
russdill 4 months ago

For most assembly, I'm not sure how this has an advantage over two adds.
- dashtiarian 4 months ago
  
  Shift add used to execute much faster than multiplication in 8088 days and people would use it when they had to multiply an int by a known scalar (shift took 4 clocks and add took 12).
  
  funny_falcon 4 months ago
  
  Compilers still prefer LEA to multiply on 3,5 and 9, which performs shift+add I believe.
- jonsen 4 months ago
  
  A shift is faster than an add.
  
  russdill 4 months ago
  
  On 6502 they are both 2 cycles. On the vast majority of processors I'm aware of, "simple" alu operarations don't vary in execution time
  
  kragen 4 months ago
  
  On 8086 or ARM you can get a shift "for free" with your add, although on the 8086 you have to use the LEA instruction.
  
  ksherlock 4 months ago
  
  ADC #immediate is 2 cycles, as is ASL. Are you really multiplying a constant by 3 at runtime? Probably not. ADC zp is 3 cycles. Plus 2 cycles for the CLC.
  
  ForOldHack 4 months ago
  
  The M68020 added a dedicated barrel shifter, which could execute much faster then the M68000.