The Future of Big Iron: An Interview with IBM’s Christian Jacobi

morethanmoore.substack.com

84 points by rbanffy 9 months ago

froh 9 months ago

Jacobi is one of 70 IBM Fellows (think IBM internal professors, free reign over a research budget, you gain the title with technical prowess plus business acumen)

at the heart of the Mainframe success is this:

> I’d say high-availability and resiliency means many things, but in particular, two things. It means you have to catch any error that happens in the system - either because a transistor breaks down due to wear over the lifetime, or you get particle injections, or whatever can happen. You detect the stuff and then you have mechanisms to recover. You can't just add this on top after the design is done, you have to be really thinking about it from the get-go.

and then he goes into details how that is achieved. the article nicely goes into some details.

oh and combine the 99.9999999% availability "nine nines" with insane throughput. as in real time phone wiretapping throughput, or real time mass financial transactions, of course.

or a web server for an online image service.

or "your personal web server in a mouse click", sharing 10.000 such virtual machines on a single physical machine. which has a shared read only /ist partition mounted into all guests. not containers, no, virtual machines, in ca 2006...

"don't trust a computer you can lift"

wolf550e 9 months ago

The amount of throughput you can get out of AMD EPYC zen5 servers for the price of a basic mainframe is insane. Even if IBM wins in single core perf using absurd amount of cache and absurd cooling solution, the total rack throughput is definitely won by "commodity" hardware.
- neverartful 9 months ago
  
  These comments always come up with every mainframe post. It's not only about performance. If it were it would be x86 or pSystems (AIX/POWER). The reason customers buy mainframes is RAS (reliabililty, availability, scalability). Notice that performance is not part of RAS.
  
  jiggawatts 9 months ago
  
  You and the parent are both "missing the point", which is sadly not talked about by the manufacturer either (IBM).
  I used to work for Citrix, which is "software that turns Windows into a mainframe OS". Basically, you get remote thin terminals the same as you would with an IBM mainframe, but instead of showing you green text you get a Windows desktop.
  Citrix used to sell this as a "cost saving" solution that inevitably would cost 2-3x the same as traditional desktops.
  The real benefit for both IBM mainframes and Citrix is: latency.
  You can't avoid the speed of light, but centralising data and compute into "one box" or as close as you can get it (one rack, one data centre, etc...) provides enormous benefits to most kinds of applications.
  If you have some complex business workflow that needs to talk to dozens of tables in multiple logical databases, then having all of that unfold in a single mainframe will be faster than if it has to bounce around a network in a "modern" architecture.
  In real enterprise environments (i.e.: not a FAANG) any traffic that has to traverse between servers will typically use 10 Gbps NICs at best (not 100 Gbps!), have no topology optimisation of any kind, and flow through at a minimum one load balancer, one firewall, one router, and multiple switches.
  Within a mainframe you might have low double-digit microsecond latencies between processes or LPARs, across an enterprise network between services and independent servers its not unusual to get well over one millisecond -- one hundred times slower.
  This is why mainframes are still king for many orgs: They're the ultimate solution for dealing with speed-of-light delays.
  PS: I've seen multiple attempts to convert mainframe solutions to modern "racks of boxes" and it was hilarious to watch the architects be totally mystified as to why everything was running like slow treacle when on paper the total compute throughput was an order of magnitude higher than the original mainframe had. They neglected latency in their performance modelling, that's why!
  
  neverartful 9 months ago
  
  The mainframe itself (or any other platform for that matter) is not magical with regards to latency. It's all about proper architecture for the workload. Mainframes do provide a nice environment for being able to push huge volumes of IO though.
  
  throw0101c 9 months ago
  
  > The mainframe itself (or any other platform for that matter) is not magical with regards to latency.
  Traveling at c, if a signal travels 300 mm (30 cm; 12") that is one nanosecond. And data signals do not travel over fibre or copper at c, but slower. Plus add network device processing latency. Now double all of that to get the response back to you.
  When everything is with-in the distance of one rack, you save a whole lot of nanoseconds just by not having to go as far.
  
  jiggawatts 9 months ago
  
  More to the point, to transmit a 1500 byte packet at some network data rate takes time. At 10 Gbps this is 3 microseconds for the round-trip even for a hypothetical "zero length" cable.
  Then add in the switching, routing, firewall, and load balancer overheads. Don't forget the buffering, kernel-to-user-mode transitions, "work" such as packet inspection, etc...
  The net result is at least 50 microseconds in the best networks I've ever seen, such as what AWS has between modern VM SKUs in the same VPC in the same zone. Typical numbers are more like 150-300 microseconds within a data centre.[1]
  If anything ping-pongs between data centres, then add +1 milliseconds per hop.
  Don't forget the occasional 3-way TCP handshake plus the TLS handshake plus the HTTP overheads!
  I've seen PaaS services talking to each other with ~15 millisecond (not micro!) latencies.
  [1] It's possible to get down to single digit microseconds with Infiniband, but only with software written specifically for this using a specialised SDK.
  
  jiggawatts 9 months ago
  
  Again, missing the point. Just look at the numbers.
  Mainframe manufacturers talk about "huge IO throughputs" but a rack of x86 kit with ordinary SSD SAN storage will have extra zeroes on the aggregate throughput. Similarly, on a bandwidth/dollar basis, Intel-compatible generic server boxes are vastly cheaper than any mainframe. Unless you're buying the very largest mainframes ($billions!), then for the same price a single Intel box will practically always win if you spend the same budget. E.g.: just pack it full of NVMe SSDs and enjoy ~100GB/s cached read throughput on top of ~20GB/s writes to remote "persistent" storage.
  The "architecture" here is all about the latency. Sure, you can "scale" a data centre full of thousands of boxes far past the maximums of any single mainframe, but then the latency necessarily goes up because of physics, not to mention the practicalities of large-scale Ethernet networking.
  The closest you can get to the properties of a mainframe is to put everything into one rack and use RDMA with Infiniband.
  
  Spooky23 9 months ago
  
  You have to think of the mainframe as a platform like AWS or Kubernetes or VMWare. Saying “AWS has huge throughput” is meaningless.
  The features of the platform are the real technical edge. You need to use those features to get the benefits.
  I’ve moved big mainframe apps to Unix or windows systems. There’s no magic… you just need to refactor around the constraints of the target system, which are different than the mainframe.
  
  froh 9 months ago
  
  what you hint at is that most workloads today don't need most of the mainframe features any more, any you can move them to commodity hardware.
  There is much less need for most business functions to sit on a mainframe.
  However the mainframe offers some availability features in hardware and z/VM, which you need to compensate for in software and system architecture, if failure is not an option, business-wise.
  and if your organisation can build such a fail-operational system and software solution, then there is no reason today to stay on the mainframe. it's indeed more a convenience these days than anything else.
  
  neverartful 9 months ago
  
  I agree with most of this. I believe that mainframes have an advantage when you look at environmental factors (power consumption and cooling).
  
  throw4950sh06 9 months ago
  
  > The closest you can get to the properties of a mainframe is to put everything into one rack and use RDMA with Infiniband.
  Or PCIe... I really would like to try building that.
  
  jiggawatts 9 months ago
  
  I'm fairly certain you can't create a "mesh" with PCIe between multiple hosts. It's more like USB instead of Ethernet.
  
  throw4950sh06 9 months ago
  
  Don't treat the CPU board as the host but a peripheral. ;-)
  But I'd say it's much closer to Ethernet than USB. You have controllers (routers), switches and nodes... USB doesn't, not like this.
  
  pezezin 9 months ago
  
  You can't with standard PCIe, but you will be able to do it with CXL, although I don't know of any server platform that uses it yet.
  
  mmooss 9 months ago
  
  > The real benefit for ... Citrix is: latency.
  I understand your point about big iron, but where does something like Citrix reduce latency?
  My estimation would be: Compared to a desktop Citrix adds distance and layers between the user and the compute/storage/etc., and the competition for resources on the Citrix server would tend to increase latency compared to the mostly idle desktop.
  
  Spooky23 9 months ago
  
  Think WAN latency. Your shitty VB6 app using some awful circa 2002 middleware dies when you have 80ms of latency between you and the server.
  Insert Citrix, and the turd app is 3ms away in the data center and works.
  I used to run a 100k user VDI environment. The cost was easily 4x from a hardware POV, but I had 6 guys running it, and it was always consistent.
  
  mmooss 9 months ago
  
  Good points, thanks. But please tell me you didn't have 100k people using that VB6 app with the middleware!
  
  le-mark 9 months ago
  
  I’d love to read more about these projects. In particular, were they rewrites, or “rehosting”? What domain and what was the avg transaction count? Real-time or batch?
  
  jiggawatts 9 months ago
  
  Citrix is almost always used to re-host existing applications. I've only ever seen very small utility apps that were purpose designed for Citrix and always as a part of a larger solution that mostly revolved around existing applications.
  Note that Citrix or any similar Windows "terminal services" or "virtual desktop" product fills the same niche as ordinary web applications, except that Win32 GUI apps are supported instead of requiring a rewrite to HTML. The entire point is that existing apps can be hosted with the same kind of properties as a web app, minus the rewrite.
  
  le-mark 9 months ago
  
  I was referring to mainframe migrations, sorry that wasn’t clear.
  
  jiggawatts 9 months ago
  
  I watched two such mainframe to “modern” architecture transitions recently, one at a telco and one at a medical insurance company. Both replaced what were billing and ERP systems. Both used Java on Linux virtual machines using an n-tier service oriented architecture. Both had a mix of batch and interactive modules.
  Both suffered from the same issue, which is actually very common but nobody seems to know: power efficiency throttling of CPU speeds.
  The irony was that the new compute platform had such a huge capacity compared to the old mainframe (20x or more) that the CPUs were only about 1% utilised. The default setting on all such servers is to turn cores off or put them into low-power modes as slow as 400 MHz. This murders performance and especially slows down the network because of the added latency of cores having to wake up from deep sleep when a packet arrives.
  It was one of those situations where running a busy-loop script on each server would speed up the application because it keeps everything “awake”.
  The telco doubled their capacity as an attempt to fix the issues but this took them to 0.5% utilisation and things got worse.
  The health insurer also overcomplicated their network, building a ~100 server cluster as if it was the public cloud. They had nested VLANs, address translation, software defined networking, firewalls between everything, etc… Latency was predictably atrocious and the whole thing ran like it was in slow motion. They too had the CPU throttling issue until I told them about it but the network was so bad it didn’t fix the overall issue.
- rbanffy 9 months ago
  
  Maybe, but then you need to engineer the 99.99999% uptime yourself.
  If it were actually cheaper, IBM wouldn’t be selling these machines so well.
  
  Muromec 9 months ago
  
  They are mostly selling to the captive audience who is 40 years deep into COBOL and can't pull out until it falls on top of them.
  
  feurio 9 months ago
  
  Well, you know what to do - just sell a system that runs COBOL on an x86 cluster or Kubernetes or whatever, and you'll make billions!
  Be sure to allocate me a bunch of shares for giving you the idea.
  
  SideQuark 9 months ago
  
  That's demonstrably untrue. The "mostly" audience you made up is a shrinking set, and the sales of huge mainframes has CAGR more than PCs. If your claim were true the sales numbers for these mainframes would not be what they are.
  This is an expanding market, due to the needs for more and larger and faster data processing uses for new tech, new markets, new transactions, new capabilities.
  
  pjmlp 9 months ago
  
  Many of those machines are running Java workloads, also COBOL isn't the only mainframe language, and the best thing in terms of security is that they don't use C as systems language, rather saner stuff with proper arrays, strings and bounds checking.
  That is why Unisys ClearPath MCP is still a thing, tracing back to its Burroughs 1961 heritage, security above all.
  
  rbanffy 9 months ago
  
  Java workloads might be easier to port away from but aren’t MCP machines x86-based nowadays? Unisys did a ton of work for Microsoft making itself obsolete by adding a lot to their secret multiprocessor stuff to Windows.
  It’s amazing how Microsoft convinced so many companies to shoot their feet.
  
  pjmlp 9 months ago
  
  They still use NEWP as systems language, there is no C stuff to exploit.
  
  rbanffy 9 months ago
  
  I had a lot of trouble with the idiosyncratic ways of MCP, and I have no doubt Tron's villain was named after it.
  
  rbanffy 9 months ago
  
  Not just COBOL, but also CICS, IMS, Java, and DB2 (although the last two are easier to migrate away from).
  
  wolf550e 9 months ago
  
  I don't think they are actually selling those machines so well. They have a captive legacy customer base, who else is buying those?
  
  neverartful 9 months ago
  
  From TFA: "Overall, Z is growing very healthily. LinuxONE is the fastest area of growth for us right now."
  However, he didn't elaborate or give any examples. If I were the interviewer, I would have followed it with: "Oh?! Can you provide some examples for the readers who believe that you only sell to captive audiences?"
  
  rbanffy 9 months ago
  
  If I had the need to ridiculously scale up a Linux workload, I’d immediately think LinuxONE or POWER.
  If I had the R&D budget for an on-prem VM hosting box, I’d seriously consider their smaller Express 4, which is not much pricier than a similarly capable x86 machine.
  
  IanCutress 9 months ago
  
  Because I've covered that in a previous video :)
  https://www.youtube.com/watch?v=ZVO55oxJ464
  
  oldpersonintx 9 months ago
  
  [dead]
bitsandboots 9 months ago

The heart of the mainframe success was being first to make a real OS and real forward compatibility.
The heart of the mainframe's continued existence is as you say, combined with risk aversion to changing software that currently works, and a lack of people who understands how it even works to know how to change it.
But the heart of the mainframe's failure to expand is IBM's reluctance to join the modern world in pricing and availability.
There is no simple way to become a mainframe dev because there is no simple way to get access to modern zOS. There is an entry fee. And it's hard to be excited about anything they do to the silicon when you know that you don't just buy the system, you're charged for consumption as if it were a cloud device. So, nobody with a wallet wants to run anything other than the essentials there, hence the platform never grows.
graycat 9 months ago

Virtual machine changes since CP67/CMS?
- froh 9 months ago
  
  ah, a connoisseur...
  alas the last bottle of that was served in 1972
  the current incarnation is z/VM, which started 2000. it is rumored to have one or two new features.

DonHopkins 9 months ago

Here's another great historic oral history panel with a live demo of Xerox PARC's Cedar, including Eric Bier, Nick Briggs, Chris Jacobi, and Paul McJones:

Eric Bier Demonstrates Cedar

This interpretive production was created from archival footage of Eric Bier, PARC research scientist, demonstrating the Cedar integrated environment and programming language on January 24, 2019. Cedar was an evolution of the Mesa environment/language, developed at PARC’s Computer Science Laboratory originally for the Xerox Alto. Mesa was modular and strongly-typed, and influenced the later Modula family of languages. Cedar/Mesa ran on the D-machine successors to the Alto (such as the Dorado) and added features including garbage collection, and was later ported to Sun workstations. Cedar/Mesa’s integrated environment featured a graphical window system and a text editor, Tioga, which could be used for both programming and document preparation, allowing for fonts, styles, and graphics to be embedded in code files. The editor and all its commands were also available everywhere, including on the command console and in text fields. The demo itself is running through a Mac laptop remotely logged into Bier’s Sun workstation at PARC using X Windows. Bier demonstrates the Cedar development environment, Tioga editor, editing commands using three mouse buttons, sophisticated text search features, the command line, and the Gargoyle graphics editor, which was developed as part of Bier’s UC Berkeley Ph.D. dissertation. Bier is joined by Nick Briggs, Chris Jacobi, and Paul McJones.

https://www.youtube.com/watch?v=z_dt7NG38V4

jmclnx 9 months ago

I wish IBM would spin off their mainframe business like they did for kendryl.

IBM seems to be milking it in order to support their other businesses. I think if spun off, prices could be lowered and mainframes would start growing again.

I think with this AI bubble, if mainframes were not as expensive, they would have see nice growth.

throw0101d 9 months ago

TechTechPotato has an interview on some of the new chip hardware, "AI for Big Iron: IBM's Latest Telum II Processor (Interview)":

* https://www.youtube.com/watch?v=-JkJO3rNZ10

* https://en.wikipedia.org/wiki/IBM_Telum

* https://chipsandcheese.com/p/telum-ii-at-hot-chips-2024-main...

* https://www.ibm.com/blog/announcement/telum-ii/

spiritplumber 9 months ago

Big Iron speedrun any% https://www.youtube.com/watch?v=BMRCye4JhEo&list=PL5ty576dcO...

(we need a bit of silly in our lives)

RcouF1uZ4gsC 9 months ago

Honestly mainframes sound like what on-premise aims for. You get uptime and proactive maintenance and stuff just runs. Yet the machine is on your premise and the data belongs to you.

Spooky23 9 months ago

The magic were the old minicomputers.
My wife ran a billing system that was a Motorola based AS/400. It was purchased in 1989/90. An ibm CE would drop by and swap parts every now and again when the computer phoned home.
They retired it in 2013, when a key part for the printer was no longer available.
- tonyedgecombe 9 months ago
  
  There used to be a number of vendors selling boxes to connect mass market printers to IBM mid-range and mainframe systems. Axis was the one I worked with but there were many others.
  I think they had all moved on or disappeared by 2013 though.
- RaftPeople 9 months ago
  
  > My wife ran a billing system that was a Motorola based AS/400
  The as/400 has always been based on an IBM CPU, but it did use Motorola cpu's in the IO cards.

mindcrime 9 months ago

                           CEREAL
                    Oh yeah, you want a seriously righteous hack,
                    you score one of those Gibsons man. You know,
                    supercomputers they use to like, do physics,
                    and look for oil and stuff?

                                PHREAK
                    Ain't no way, man, security's too tight. The
                    big iron?

                                DADE
                    Maybe. But, if I were gonna hack some heavy
                    metal, I'd, uh, work my way back through some
                    low security, and try the back door.

stonethrowaway 9 months ago

Came for the news, stayed for the Hackers.
- mindcrime 9 months ago
  
  Hack the planet!