kabdib 2 years ago

I was taking a VLSI design course in 1981, and the professor teaching it proudly showed off some 432 chips (embedded in plastic) that he'd been given. He waxed lyrically about them, how the big boys were doing silicon in Silly Valley. (We, with our colored pencils, were learning how to do NAND gates and full adders in NMOS, on graph paper).

Later, I read Organick's book on the 432. It was kind of a mess, no idea how they expected the thing to perform.

This was also back when ADA was the up-and-coming language, which the 432 was going to run really well (if you believed the marketing). ADA was pretty intimidating, as it was complicated for the time and generics seemed to scare everyone. (Little did we know that C++ was going to be a thing in a decade or so, and it made ADA seem simple in comparison).

  • chasil 2 years ago

    ADA evolved into the procedural scripting syntax of many SQL databases.

    "SQL/PSM is derived, seemingly directly, from Oracle's PL/SQL. Oracle developed PL/SQL and released it in 1991, basing the language on the US Department of Defense's Ada programming language... IBM's SQL PL (used in DB2) and Mimer SQL's PSM were the first two products officially implementing SQL/PSM. It is commonly thought that these two languages, and perhaps also MySQL/MariaDB's procedural language, are closest to the SQL/PSM standard. However, a PostgreSQL addon implements SQL/PSM (alongside its other procedural languages like the PL/SQL-derived plpgsql), although it is not part of the core product."

    https://en.wikipedia.org/wiki/SQL/PSM

PAPPPmAc 2 years ago

The first of Intel's many expensive lessons about the problems with extremely complicated ISAs dependent on even more sophisticated compilers making good static decisions for performance. Then they did it again with the i860. Then they did it again with Itanium.

  • bri3d 2 years ago

    iAPX 432 was sort of a different failure from i860 and Itanium, no? My understanding is that the issue with iAPX 432 was that the architecture provided object-oriented instructions, but they turned out to be slow in practice, and the compiler didn't know how slow they were, so it abused them in situations where they should have used scalar ops instead, and that in tandem, the ABI relied too heavily on pass-by-value. Basically, that the iAPX was explained to compiler authors as an object-oriented CPU, when it should have been treated as a CPU with object-oriented extensions.

    Whereas i860 and Itanium were just trying to shoehorn VLIW into general-purpose computing, which is generally incredibly challenging. VLIW is great for places like DSP, where you have a defined real-time stream of data and limited context switching. In this case, you can use the spare die space you didn't spend on dispatch, prediction, and retirement on more MACs or ALUs or vectors, and the compiler can accurately predict the latency of a given operation because the source is defined. Fundamentally, compiler scheduling is intractable in a multiuser or task switching environment, because you have _no idea_ what will be in cache ahead of runtime and always end up with the i860/Itanium problem, where you stall your entire execution pipeline every time you miss cache unexpectedly.

  • bombcar 2 years ago

    Have we (finally) realized the dream? By basically putting the "smart" part of the compiler in the chip itself, or do we still run relatively simple ISAs?

    • PAPPPmAc 2 years ago

      I argue about this a lot. Some reasonably substantiated opinions:

      1. Highly sophisticated large-scale static analysis keeps getting beaten by relatively stupid tricks built into overgrown instruction decoders, working on relatively narrow windows of instructions.

      2. The primary reason for (1) is that performance is now almost completely dominated by memory behavior, and making good static predictions about the dynamic behavior of fancy memory systems in the face of multitasking, DRAM refresh cycles, multiple independent devices competing for the memory bus, layers of caches, timing variations, etc. is essentially impossible.

      3. You can give up on a bunch of your dynamic tricks and build much simpler more predictable systems that can be statically optimized effectively. You could probably find an good local maxima in that style. The dynamic tricks are, however, unreasonably effective for performance, and have the advantage that they let you have good performance with the same binaries on multiple different implementations of an ISA. That's not insurmountable (eg. the AOT compilation for ART objects on Android), but the ecosystem isn't fully set up to support that kind of thing.

      • pjmlp 2 years ago

        Note that AOT compilation on Android is a mix of JIT with PGO metadata, where the generated AOT binary is only a subset of the application.

        Changes on the execution flow, or updates render the generated binary invalid and there is again another cycle of ASM based interpreter, JIT, gathering PGO metadata, and finally new AOT compilation when device is idle.

    • AnimalMuppet 2 years ago

      By putting it on the chip, it can be dynamic rather than static. The microcode can know a lot more of what's going on than the compiler can.

  • sytse 2 years ago
    • speps 2 years ago

      Why would you do that without giving credit?

      • bobloblaw724449 2 years ago

        It's fine, it's Sid (he's a good guy).

        • generalizations 2 years ago

          Except when he palms off other people's ideas as his own.

          • bobloblaw724449 2 years ago

            It's only an HN comment and I don't see why it honestly matters. At the end of the day, more people will see his tweet and learn about these failed architectures then some random comment on some random HN post. Significantly more people read twitter than HN.

            The way you're reacting to this is like it's 2007 and he stole the blueprints to the iPhone.

chasil 2 years ago

It is amazing how many failures Intel has survived, and that their core competence really emerged from the Datapoint 2200.

nullc 2 years ago

iAPX 432's security features would be welcome in the computing world we have today, I wonder to what extent its failures doomed similar functionality in Intel?

At least there is CHERI now but we still hardly seem close to having hardware enforced capabilities-grade security in high perfomance server kit.

  • kps 2 years ago

    The i960 MX (née BiiN) had a similar tagged-memory capability system along with a fairly pleasant RISC instruction set.

  • pjmlp 2 years ago

    Not capabilities, but Solaris SPARC has tamed C for quite some time now thanks ADI.

mattst88 2 years ago

I remember reading an article about the iAPX 432 that went into extensive detail about the compounding effects of the design—I recall it describing how an operation with an small constant operand would be slow because the ISA didn't support immediates, and as a result you'd have to load it from memory, and there was not even a cache to help with that.

Does anyone know this article? I've searched and haven't been able to find it, and it was definitely worth a read.

  • kps 2 years ago

    > the ISA didn't support immediates

    I don't know the article, but have a related story. In the '90s I worked for a custom compiler shop, and a company you've heard of (not Intel) came to us with a system they wanted tools for. They had gone all-in on RISC — operations were all register-to-register, and the only memory addressing was register indirect (i.e. through an address in a register). We had to point out that it would be rather difficult to get an address into a register in the first place.

    • andrewf 2 years ago

      Could you do it with shifts and increments? Constant loads would look just like multiplies, a glorious RISC apotheosis..

      • kps 2 years ago

        Yes, you could get 0 by subtracting (or xoring) a register with itself, then -1 by complementing, then 1 by negating, then adding to itself to get any single bit. Then synthesize any constant by adding those. The code would be impractically slow and large, though.

  • linksnapzz 2 years ago

    I think I've read the same article, and also wish I had the reference-I do remember that there were no or few registers, and reads were from memory almost all the time..

    Also-does anyone know of a an actual system that shipped with a 432? Like, manufacturer and model #?

    • auroralimon 2 years ago

      no. i seem to recall only intel had a board in an S100(?) chassis? i think that’s the one i had access to.

  • mattst88 2 years ago

    bcantrill linked to it separately in the comments: http://dtrace.org/blogs/bmc/2008/07/18/revisiting-the-intel-...

    The particular parts I was recalling are:

    > The upshot of these decisions is that you have more code (because you have no immediates) accessing more memory (because you have no registers) that is dog-slow (because you have no data cache) that itself is not cached (because you have no instruction cache). Yee haw!

    Awesome.

  • auroralimon 2 years ago

    i have it’s programmers guide around here somewhere.

  • Lammy 2 years ago

    Could it be https://homes.cs.washington.edu/~levy/capabook/Chapter9.pdf ?

    Sorry for huge quote, but it's from a huge article:

    =======================================================

    From section 9.2, Segments and Objects:

    > All objects are addressed through capabilities which, on the Intel 432, are called accessdescriptors (ADS). (The vendor’s terminology is used in this chapter for compatibility with Intel literature. The notation “AD” is used throughout for “capability.“)

    > At the lowest level, objects are composed of memory segments, and a memory segment is the most fundamental object (called a generic object on the Intel 432). Each Intel 432 segment has two parts: a data part for scalars and an accesspart for ADS, as shown in Figure 9-2. Objects requiring both data and access descriptors can be stored in a single segment. Segments are addressed through ADS, as the figure illustrates. The data part grows upward (in the positive direction) from the boundary between the two parts, while the accesspart grows downward (in the negative direction) from the dividing line. The hardware ensures that only data operations are performed on the data part and that AD operations are performed on the accesspart.

    =======================================================

    From section 9.4.3, Instruction Operand Addressing:

    > At any moment during a procedure’s execution, ADS specified by instructions must be located in one of four environment objects. Environment object 0 is the context object itself. Instructions can specify any of the ADS within the context object’s accesspart; for example, to refer to the domain or the constants data segment. The three remaining environments, environments 1 through 3, are defined dynamically by the procedure.

    > Instruction objects contain only a data part. Because Intel 432 instructions are bit-addressable and can start on arbitrary bit boundaries, instructions are addressed as bit offsets into instruction objects. The first instruction in each instruction object begins at bit displacement 64, following the header of four 16-bit predefined fields. The maximum size of an instruction segment is 64K bits, or 8K bytes, due to the bit addressing. Although there is generally one instruction object for each procedure in the domain, procedures larger than 8K bytes require additional instruction objects. The BRANCH INTERSEGMENT instruction can be used to transfer control to another instruction object within the same domain.

    > The four environment segments thus provide efficient addressing of ADS. An instruction can specify an immediate 4- or g-bit access selector describing the location of an AD for an operand. Or, it can specify the location of a 16-bit accessselector located in memory or on the stack. The short direct format efficiently addresses any of the first four ADS in any of the four environments. This includes the ADS for the global constants, context message (calling parameters), and current domain within the current context. All of the processor-defined ADS within the context object’s accesspart can be addressed using an 8-bit accessselector.

    =======================================================

    Unrelated, but I love how they went for the "As Above, So Below" approach for growing the data-vs-access-parts of instruction object memory ^

jecel 2 years ago

An interesting experiment would be to re-implement the iAPX432 with the same resources as today's high end processors (caches, branch prediction, out of order execution, etc). A simulation would be good enough to run some benchmarks (though the old Ada compilers might be worth improving a bit as well).

My guess is that the performance would be very close to an ARM or x86 with the same silicon area.

twoodfin 2 years ago

I keep hoping @bcantrill & the Oxide crew will do a Twitter space on the i432, or perhaps failed architectures generally.

  • bcantrill 2 years ago

    We would love to! Maybe we could convince Robert Colwell to join us, as his paper on the 432 is one of my favorite systems papers of all time![0]

    [0] http://dtrace.org/blogs/bmc/2008/07/18/revisiting-the-intel-...

    • jaykru 2 years ago

      Rob was happy to chat with me about his 432 paper over LinkedIn (cold DM'd him) for a semester project I did on it a few months back. He might go for a podcast episode :) I'd love to listen to it!

auroralimon 2 years ago

i remember trying to use it with an Ada compiler at school. I was the only person on campus who had Ada exposure, and it was a remarkably slow machine. Could not get out of its own way. Very ambitious in architecture but too slow and too far ahead of software at the time. i do not recall actually -running- any programs on it. There might have been a monitor running (not a real OS, just BIOS like shell thing)