kristopolous a year ago

They also have a turnkey product with 256 of these things.

1 exaflop + 144TB memory

https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh20...

  • jwr a year ago

    As someone who lived through the first wave of supercomputers (I worked with Cray Y-MP models), it makes me very happy to see the second wave. For a while I thought supercomputing was dead and we would just be connecting lots of PCs with a network and calling that "supercomputers".

    I still remember how my mind was blown when I first learned that all of the memory in a Cray Y-MP was static RAM. Transistor-based flip-flops: extremely power hungry, but also very fast. Another way of looking at it is that all of its RAM was what we call "cache".

    This, finally, looks like a supercomputer.

    • com2kid a year ago

      SRAM is so stupid fun to play with.

      All of a sudden you don't care so much about the inefficiencies of walking linked lists or trees. When everything is "already in cache", you can worry less about cache efficient algorithms!

      1 cycle memory access latency is one of the reasons why tiny embedded MCUs can do things with a fraction of the MHZ of their larger counterparts.

      Now days of course it is all about tons of memory, tons of bandwidth, craptons of compute, and planning the flow of data ahead of time.

      • boringuser2 a year ago

        Only within a very small domain was this ever reasonable.

        • jwr a year ago

          But that's why there's a "super" in "supercomputers"!

          Cray loaded a ton of static memory into their computers, then liquid-cooled the whole thing. Sure, the power requirements were through the roof, and you had a whole huge chiller system which you had to run and hope it doesn't fail. If it did fail, you really wanted to shut the machine down fast. From what I recall there was also an emergency propeller inside the back case of the Y-MP 2E, and yes propeller is a much better name for this thing than a "fan". It would delay the inevitable, although dumping those tens of kW of heat into your server room was not something you ever wanted to do.

          The whole point of all this was that you could do things that you couldn't with "normal" computers. That's why those were called "supercomputers". And I'm so glad that after a hiatus of about 30 years we're getting another wave of exceptional machines, which aren't just bigger PCs.

      • Dylan16807 a year ago

        L3 cache is all SRAM but you can have pretty significant delays accessing it. Even the fastest memory cells will build up significant addressing delays as you increase in scale.

      • creato a year ago

        If you have a “large” SRAM that can be accessed in one cycle, that just means your processor is slow and/or consumes more energy than it should.

        • com2kid a year ago

          > If you have a “large” SRAM that can be accessed in one cycle, that just means your processor is slow and/or consumes more energy than it should.

          MCUs are in this category, lots of embedded stuff, including the two areas I'm familiar with: game controllers and lower spec'd wearables.

          Very low power usage, CPU speed around 100mhz, so not too slow.

          You can do plenty with 100mhz and SRAM!

    • bigbillheck a year ago

      > the first wave of supercomputers (I worked with Cray Y-MP models)

      The Y-MP came out in 1988, sixteen years after CRI was founded, which itself was several years after the CDC6600.

      • Retric a year ago

        They presumably where born 20+ years before they started working professionally on the Y-MP. So they could easily have been alive or even a teen in 1964 when the CDC 6600 was released.

        • jwr a year ago

          I worked with the Y-MP models in the 1990s, and no I was not alive in 1964, although I'm not sure how we got there :-)

  • bigyikes a year ago

    I watched Jensen’s announcement for this.

    He calls it the worlds largest GPU. It’s just one, giant compute unit.

    Unlike super computers, which are highly distributed, Nvidia says this is 140 TERABYTES of UNIFIED MEMORY.

    My mind still gets blown just thinking about it. My poor desktop GPU has 4 gigabytes of memory. Heck, it only has 2 terabytes of storage!

    • coolspot a year ago

      It may be presented as seamless unified memory, but it isn’t. Underlying framework still has to figure out how to allocate your data to minimize cross-unit talk. Each unit has independent CPU, GPU and (V)RAM, but units are interconnected via very fast network.

    • XCSme 10 months ago

      I think distributed computing will go away soon. As computers become more powerful, the cost of "distributing" and transferring the data would be more than simply executing everything locally. Yes, you can still split the task amongst different nodes, or give different problems to different nodes, but the use-case would mostly be solving distinct problems on each note, that splitting the same task across multiple computers.

      Also, with quantum computers, the parallelization/"distribution" of tasks will be done within the same machine, as it can try all solutions and the same time without having to do divide-et-impera algorithms.

      Also, in the future, the algorithms will be a lot simpler, and just have FPGA-s like AI chips, where there is no software, the model is directly modelled in the hardware, so each computation is instant (just the time it takes to propagate the electrons or light through the circuit).

    • sliken a year ago

      What's old is new again. This is basically an updated arm version of the itanium based SGI Altix.

      Keep in mind unified does not mean uniform, the ram is distributed across all the GPUs.

    • bushbaba a year ago

      There’s usecases beyond just ML. Sap Hana could theoretically run on this with greater performance. Same goes for a database. Scaling vertically solves a lot of challenges with distributed ledgers.

    • markus_zhang a year ago

      Is it similar to the mainframe in concept?

      • tiberious726 a year ago

        No, kinda the opposite: mainframes use a litany of sophisticated and special purpose hardware to achieve their tasks (eg hardware io _channels_), while this is a massively overgrown instance of a single kind of hardware (vector processor).

        • bitwize a year ago

          The thing that made the lightbulb go off in my head w.r.t. mainframes was understanding that mainframe I/O channels are computers. The mainframe had several dedicated computers that each specifically handled I/O to a terminal, printer, disk or tape drive, punchcard reader, etc. Made I/O programming a breeze, as you just had to tell the channel to read or write, specifying a block of memory to use as a buffer, and the channel would DMA out the data to be written, or DMA in read-in data.

          It also explains the reason why despite having middling CPU power, mainframes had a reputation for stinkloads of I/O bandwidth so they could process everyone's credit card transactions, airline bookings, and that: the mainframe's CPU was involved very little in I/O, that was all handled by the channel processors!

  • jasonjayr a year ago

    I had to look twice at that image, I thought it was a 2 rack-unit device, But, no, it's 24 full 42U racks!!

    • kristopolous a year ago

      it's just a rendering. I presume nvidia wouldn't be announcing something that they haven't made and confirmed, I wonder why they chose that image.

      Is it just they haven't done the molding of a production installation? Is it possible that their internal instances might not be that presentable?

  • coherentpony a year ago

    > 1 exaflop

    To be clear, this is floating point quarter-precision operations when using the FP8 tensor core arithmetic unit [1].

    [1] https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...

    • pezezin a year ago

      Came here to post exactly the same link. Not just that, but 1 exaflops of sparse FP8.

      In comparison, Frontier is 1 exaflops of dense FP64. Try to run this Nvidia system as dense FP64, and it performance will degrade two orders of magnitude.

      Don't get me wrong, the machine is really impressive, but the advertisement is quite misleading.

  • mirekrusin a year ago

    Oh my, time to upgrade my Pi 4 Model B.

    • weinzierl a year ago

      If you upgrade to a Jetson you get GPU power and you can keep the form factor, win - win.

  • RosanaAnaDana a year ago

    I want to see Linus play Doom Eternal on it.

    • ChuckNorris89 a year ago

      Is Crysis no longer a thing?

      • all2 a year ago

        Crysis is a problem because it is single threaded. That's why it was so hard on computers back in the day.

        • skhr0680 a year ago

          Being single-threaded is why the original Crysis is hard on computers now(!!). When Crysis was being developed people still thought 5-10Ghz was coming Any Day Now.

          • dtech a year ago

            2007 was well into Intel Core territory, Intel had given up on clockspeed-at-all-costs Pentium 4 Netburst, so it was generally accepted that clocks weren't going to keep up as fast as before.

            • skhr0680 a year ago

              Even if Crytek took a year off developing new games after the release of Far Cry, that still puts the start of Crysis development before the announcement of Core (early 2005 IIRC).

          • all2 a year ago

            You might could hit 5Ghz by overclocking the appropriate CPU. I remember 10 years ago looking at the liquid cooling setups needed to hit 5Ghz on a core.

            • thomastjeffery a year ago

              Stock turbo on a 13th gen i5 (performance cores) is 5.1ghz. All you need is good cooling and a stable power supply.

            • LoganDark a year ago

              My 12400F boosts to 5GHz for the majority of each day. It's literally normal for me. I've gotten it to almost 5.3GHz before, but it's not really stable above 5.2GHz.

  • MichaelZuo a year ago

    That could probably all fit in a single semi-trailer. It's amazing how dense computation is getting.

  • fennecfoxy a year ago

    Doesn't this basically shoot up the list of the TOP500 then? Wonder if they offer a >256 they could be top of the list, easy.

    • mk_stjames a year ago

      TOP500 uses FP64 performance in ranking. nVidia's 1 exaflop claim is the ~4 petaflops of FP8/INT8 * 256. FP64 performance of modern nvidia gpu's is actually far, far less. The ratio to FP32 isn't even 2:1 anymore (not since Pascal I think) since they realize most machine learning is done with FP32 or less.

      64 bit (or 'double precision') is still king in the HPC world though, as it is what you will find in large numerical solutions in fields like computation fluid dynamics, nuclear physics, etc.

      • fennecfoxy a year ago

        Ah fair, I should've known. I suppose the precision is still required for scientific purposes. Thankfully ML stuff now gets more appropriate precision for a speed increase.

NickHoff a year ago

My question here is about underlying fab capacity. This chip is made on TSMC 4N, along with the H100 and 40xx series consumer GPUs. I assume Nvidia has purchased their entire production capacity. I also assume that Nvidia is using that capacity to produce the products with the highest margins, which probably means the H100 and this new GH200. So when they release this new chip, does it mean effectively fewer H100s and 4090s? Or is that not how fabrication capacity works?

I'm asking because whenever I look at ML training in the cloud, I never see any availability - either for this architecture or the A100s. AWS and GCP have quotas set to 0, lambda labs is usually sold out, paperspace has no capacity, etc. What we need isn't faster or bigger GPUs, it's _more_ GPUs.

  • ac29 a year ago

    > This chip is made on TSMC 4N, along with the H100 and 40xx series consumer GPUs. I assume Nvidia has purchased their entire production capacity.

    I dont know why you would assume that. Qualcomm has been using TSMC N4 since last year [1]. I'm sure there are other customers as well.

    [1] https://www.anandtech.com/show/17395/qualcomm-announces-snap...

  • huijzer a year ago

    It sounds to me like the GH200 achieves more FLOPS per transistor. So, compute demand will be quicker satisfied via the GH200 than via "smaller" chips such as the H100.

    Having said that, I don’t think we’re anywhere near some kind of equilibrium for AI compute. If chip supply would magically double tomorrow, then the large companies would buy it for their datacenters and have 100% utilization in a few weeks. They all want to train larger models and scale inference to more users.

    • rcme a year ago

      In addition to training larger models, I'm sure there are many use cases that AI could serve that are currently cost prohibitive due to the cost of running inference.

  • danielmarkbruce a year ago

    I'd like bigger GPUs. A trillion parameter model at 16 bits needs 2000gb+ for inference, more for training. All kinds of things can be done to spread it across multiple GPUs, downsize to less bits etc, but it's a lot easier to just shove a model on one GPU.

    We'll likely see more efficiency from bigger GPUs and hopefully more availability as a result.

    • mcbuilder a year ago

      TBH this is what all ML researcher / engineers have wanted for the past 10 years.

      • 0cf8612b2e1e a year ago

        My question on the very slow growth of available memory: are there technical reasons they cannot trivially build a card with 100GB of RAM (even with lower performance) or has it been a business decision to milk the market for every penny?

        • Dylan16807 a year ago

          High speed I/O pins cost a lot, and GDDR generally has 32 data pins per chip and no way to attach multiple chips to the same pins. So 256 bits and 16GB is hard to exceed by much on that tech. The high end is 384 bits and 24GB.

          There is a mode to attach 16 data pins to each GDDR chip, so with some extra effort you could probably double that to 48GB. Or at least 32GB. Maybe this is a valid niche, or maybe there isn't enough demand.

          The alternative to this is HBM, which can stack up big amounts, but it's a lot more expensive.

        • Baeocystin a year ago

          I don't disagree with Dylan, but I'm more than willing to bet that the only reason Nvidia's cards (and that's who we're talking about. CUDA is a hell of a moat.) are RAM-starved is that they haven't felt the pressure to do otherwise. AMD has an institutional aversion towards good software. Intel isn't even an also-ran, yet.

          Apple and their unified memory architecture may be the prod needed to get larger levels of RAM available to single cards solutions. We'll see.

          • shaklee3 a year ago

            Nvidia has had unified memory for more than 6 years. This chip is just a faster interconnect for it.

  • bob1029 a year ago

    > Or is that not how fabrication capacity works?

    Fabs can run multiple complex designs on the same line simultaneously by sharing common tools. For example, photolithography tools can have their reticles swapped out automatically. Obviously, there is a cost to the context switching and most designs cannot be run on the same line as others.

    Ultimately, the smallest unit of fabrication capacity is probably best measured along grain of the lot/FOUP (<100 wafers).

  • ksec a year ago

    >Or is that not how fabrication capacity works?

    The basic of Supply Chain and Supply and Demand, as you should have all witness during COVID for toilet rolls are the same.

    Fab capacity is not that different to any other manufacturing. You just need to book those capacity way ahead of time. ( 6 - 9 months ) And that is also why I said 99% of news, or rumours about TSMC Capacity are pure BS.

    So to answer your question. Yes, Nvidia will likely go for the higher margin products. One of the reason why you see Nvidia working with Samsung and Intel.

  • theincredulousk a year ago

    It's my understanding from friends in the business that the actual chips do not represent any capacity issue or bottleneck, it's actually manufacturing the devices that the chips are in (e.g. the finished graphics card).

    • NotSuspicious a year ago

      Why would this be the case? I would naively think that since the chips can only be made in a fab and the rest can be made basically anywhere that that wouldn't be true.

      • archi42 a year ago

        They can not be made "anywhere"; when you can't get that PMIC from the original manufacturer, good luck getting it from someone else. And replacing an IC in a QA tested, EMV verified, FCC and CE etc. certified device will often trigger you redoing all that, possibly requiring additional iterations. If there is a similar part available at all.

        Take a look at a recent GPU and count the auxiliary components. All of them can cause supply chain difficulties.

    • refulgentis a year ago

      That's...fascinating. There's enough space on TSMC but the PCB is the hard part?

      • Yizahi a year ago

        For example my corpo hit manufacturing issues (production capacity) with flash memory, with clock oscillators, with auxiliary fpga. But main chips production was fine all the time during chip crisis as far as I know. So yeah, small critical components totally can be a blocker. Some specific voltage controller is unavailable and suddenly your whole design is paralyzed.

      • idiotsecant a year ago

        pcbs are also full of a bunch of other components, many of which are hard to get ahold of right now.

        • davrosthedalek a year ago

          I think that's it. PCB itself is rather trivial, it's the RAM, but also things like switching regulators (there are others, but then it's a redesign), maybe even stuff like connectors (which don't burn....).

          For a science project, we need to manufacture magnets. It's not easy to find a company who has the right iron right now, and it's hard to get, with long lead times. The supply crisis is real.

  • Tepix a year ago

    I see A100 80GB cloud capacity available on both runpod.io and vast.ai currently.

  • RosanaAnaDana a year ago

    You know I was wondering this the other day when NVDA's insane run up happened. I went down the road of trying to figure out if there was even enough silicon wafers, or if there even would be enough wafers in the next five years, to justify that price.

    Unless all the planet does is make silicon wafers; no.

    • xadhominemx a year ago

      Well you figured wrong - NVDA AI GPUs are a very small % of global foundry supply, if even if volume tripled, they will still be a small % of global foundry supply. NVDA’s revenue is high because their gross margins are extreme, not because their volume is high.

    • austinwade a year ago

      Can you go into more detail? So you're saying that at a 200 P/E ratio NVDA there isn't even enough wafer supply for NVDA to grow into that valuation even over 5 years?

      • RosanaAnaDana a year ago

        I mean, you've got the gist of it. I pulled some reports on silicon production, silicon waver prices and price trends, current fab capacity etc..

        My back of the napkin basically suggested that silicon production would need to 4x and fab capacity 4x (neither of which are happening) and NVDA with would have to capture all of that to justify their current price. I didn't bother writing it up, just looked at it mostly because I was on the wrong side of that play. It's something worth considering for sure.

        • austinwade a year ago

          Wouldn't NVDA just focus more on high margin datacenter products in order to grow into those higher earnings but with the the wafer limitation? Datacenter focused products are already starting to surpass gaming which is their second largest revenue source: http://www.nextplatform.com/wp-content/uploads/2022/05/nvidi...

          It seems to me that yes while a 200 P/E may be high, they certainly could keep increasing the prices on the already high margin datacenter products, of which get quickly gobbled up by companies no matter what price they are because of the immense demand.

        • refulgentis a year ago

          We're probably ~3 years out from all of those fabs gov'ts funded coming online, right?

          (n.b. that's really good work on your end and I agree with your conclusion, just idly musing about the thing that bugs me, what the heck all these non-leading edge fabs are going to do)

  • fxtentacle a year ago

    I believe availability is low because the GPUs are too expensive so those that need to scale up use the older and much more affordable models.

  • tomschwiha a year ago

    I'm using Runpod and Datacrunch regulary and they seem to always have some available.

samwillis a year ago

This may be a naive question, in "crypto" we saw a shift from GPUs to ASICs as it was more efficient to design and run chips specifically for hashing. Will we see the same in ML, will there be a shift to ASICs for training and inference of models?

Apple already have the "neural" cores, is that more or less what they are?

Could there be a theoretical LLM chip for inference that is significantly cheaper to run?

  • anonylizard a year ago

    Inference is mostly just matrix multiplications, so there's plenty of competitors.

    Problem is, inference costs do not dominate training costs. Models have a very limited lifespan, they are constantly retrained or obsoleted by new generations, so training is always going on.

    Training is not just matrix multiplications, given hundreds of experiments in model architecture, its not even obvious what operations will dominate future training. So a more general purpose GPU is just a way safer bet.

    Also, LLM talent is in extreme short supply, and you don't want to piss them off by telling them they have to spend their time debugging some crappy FPGA because you wanted to save some hardware bucks.

    • conjecTech a year ago

      The more general the model, the longer the lifetime. And the most impactful models today are incredibly general. For things like Whisper, I wouldn't be surprised if we're already at 100:1 ratio for compute spent on inference vs training. BERT and related models are probably an order of magnitude or two above that. Training infra may be a bottleneck now, but it's unclear how long it will be until improvements slow and inference becomes even more dominant.

      Capital outlays are tied to the derivative of compute capacity, so even if training just flatlines, hardware spend will drop significantly.

      • flangola7 a year ago

        Isn't whisper self hosted

        • conjecTech a year ago

          That's part of my point. There are 100s of organizations using it at scale, but it only needed to be trained once.

    • samvher a year ago

      What would be the set of skills that would put you in the category of LLM talent that is in extreme short supply?

      Just curious what the current bar is here and which of the LLM-related skills might be worth building.

      • anonylizard a year ago

        Being able to train base LLMs. This is currently an alchemical skill since you can't learn it at school. This can be further split into infrastructure engineering (managing GPU clusters aint easy), data gathering and cleaning (at terabyte scale), the training itself, etc etc.

        Being very good at fine tuning for a particular goal. Its much easier to learn fine-tuning, so standards are higher to stand out.

        Being able to come up with architectural improvements for LLMs, aka the researcher path.

        Wages start at $250k for grads at the big AI companies.

        • bwv848 a year ago

          Funny you sort of describe me

          1. For BERT scale model, all you need is a good codebase from GitHub (I had some luck with this one [0]) and a few weeks of trial and error. Want to try training T5 or LLaMA, but don't have the resources needed. Of course training models with more than 100B parameters is another level of labyrinth.

          2. Finetuning is mostly related to how well you understand the task and the data you are dealing with. Since the BERT paper focuses on the GLUE benchmark, I've become very proficient in fine-tuning GLUE and eventually got sick of it.

          3. Made some architectural improvements to BERT, got decent results so I wrote a paper, and got rejected because the reviewers want a head-on evaluation against some well funded papers from Google.

          4. Not in my country. Damn, I am envious.

          [0] https://github.com/IntelLabs/academic-budget-bert.

  • JonChesterfield a year ago

    Lots of companies are doing ASICs for machine learning. Off the top of my head, Graphcore, Cerebras, Tenstorrent, Wave. This site claims there are 187 of them (which seems unlikely) https://tracxn.com/d/trending-themes/Startups-in-AI-Processo.... Google's TPU counts and there are periodic rumours about amazon and meta building their own (might be reality now, I haven't been watching closely).

    As far as I can tell that gamble isn't work out particularly well for any of the startups but that might be money drying up before they've hit commercial viability. I know the hardware is pretty good for Graphcore, Cerebras and the software proving difficult.

    • bippingchip a year ago

      A lot of companies are indeed trying to build AI accelerator cards, but I would not necessarily call them ASICs in the narrow sense of the word, they are by necessity always quite programmable and flexible: NN workloads characteristics change much much faster than you can design and manufacture chips.

      I would say they are more like GPUs or DSPs: programmable but optimised for a specific application domain, ML/AI workloads in this case. Sometimes people call this ASIPs: application specific instruction set processors. While maybe not a very commonly used term, it is technically more correct.

    • foobiekr a year ago

      I have experience with companies doing their own chips. Often as not what seems like a good idea turns out not to be because your volume is low and your ability to get to high yield dominates and that both takes years and talent.

      As a rule companies should only do their own chips if they are certain they can solve and overcome the cogs problems that low yield and low volume penalties entail. If not you are almost certainly better off just eating the vendor margin. It is very very unlikely that you will do better.

  • huijzer a year ago

    According to Ilya Sutskever in a podcast that I heard, GPUs are already pretty close to ASIC performance for AI workloads. Nvidia can highly optimize the full stack due to their economies of scale.

    • josephg a year ago

      Right. As I understand it, training on gpus isn’t limited by the speed of matrix multiplications. It’s limited by memory bandwidth. So a faster ASIC for matrix operations won’t help. It’ll just sit idle while the system stalls waiting for data to become available.

      That’s why having 96gb (with another 480gb or whatever) available via high speed interconnect is a big deal. It means we can train bigger models faster.

  • BobbyJo a year ago

    That shift won't happen until research slows down. Nobody wants to invest significant amounts of money in hardware that will be obsolete in a year.

    • voxadam a year ago

      > Nobody wants to invest significant amounts of money in hardware that will be obsolete in a year.

      When talking about the current ML industry it's more like nobody wants to invest significant amounts of money in hardware that will be obsolete before it's even taped out.

      • kramerger a year ago

        Counterpoint: whoever is the first to make advanced ML affordable, maybe even mobile has a good chance of dominating multiple markets in near future.

        • fxtentacle a year ago

          There's open source to export fully trained AI models for efficient execution on arbitrary CPUs. Most likely, you won't be able to build up any moat in AI inference.

    • mechagodzilla a year ago

      What would be obsolete? The underlying operations are almost always just lots of matrix multiplies on lots of memory. Releasing a new set of weights doesn’t somehow change the math being done.

      • BobbyJo a year ago

        To the extent that AI architectures are just matrix multiplications, there is already an asic: GPUs.

        If you want more efficiency gain than general matrix multiplication hardware, you need to start getting specific about the NN architectures the hardware will support.

      • AstralStorm a year ago

        The amount of memory available is the obsolescence.

  • nynx a year ago

    These are “GPUs” in a pretty stretched sense. They have a lot of custom logic specifically for doing tensor operations.

    • cubefox a year ago

      Which isn't very informative when it isn't clear how close they are to TPUs, which are ASICs.

  • fxtentacle a year ago

    The main benefit of using modern GPUs is that they have large high-bandwidth memory. You need to put a lot of work into optimizations to reach >10% of the peak compute capability in practical use. That means an ASIC won't eliminate the performance bottleneck.

  • xxs a year ago

    ASICs did take on bitcoin, but not on Ethereum. The ASICs provide an optimized compute unit but they do suck when it comes to memory addressing.

    Effectively it'd require the entirely memory controller and the cache, and scheduling. At point point you got most of the GPU w/ a stuck, non-programmable interface of a designated compute. Likely you'd never have to compete for advanced nodes as well.

    • 58x14 a year ago

      I worked with an electrical engineer who disclosed to me a mid 8 figures investment made by a PE firm to develop Ethereum FPGA and ASIC hardware, and while he told me they were underwater on that deal, they did achieve (IIRC) 30%+ performance (watt to flop) relative to (at the time) top-of-the-line GPUs.

      I wonder what they’re doing with that hardware now.

      • shiftpgdn a year ago

        Still useful for other coins, though the returns may not be very great.

      • xxs a year ago

        Interesting, what memory did they manage to use - I suppose not HBM.

    • paulmd a year ago

      ethereum ASICs existed - at least a few were publicly known (antminer and other brands had a few) but others likely existed on the down-low as well. They were never about optimized compute but rather a cost-optimized way to deploy a shitload of DDR/GDDR channels reliably at minimum cost.

      Ethereum is designed to bottleneck on memory bandwidth (while being uncacheable) so at the end of the day the name of the game is how many memory channels can you slap onto a minimum-cost board. You won't drastically win on perf/w - but as mentioned by a sibling, 30-100% over a fully general-purpose gaming GPU is likely possible, because you don't have to have a whole general-purpose GPU sitting there idling (and it's not a coincidence that gaming GPUs were undervolted/etc to try and bring that power down - but you can't turn everything off). "ASIC-resistance" just means an ASIC is only 1-10x more efficient than a general-purpose device, so general-purpose hardware can still stay in the game. It doesn't mean ASIC-proof, you can still make ASICs and they still have at least some perf/w advantage.

      However, if your ASIC costs $100 to get the same performance as a 3060 Ti, that's a huge win even if you only beat the perf/w by 50%. Particularly since your ASIC is likely way easier and more stable to deploy at scale, and doesn't require a host rig with at least a couple hundred bucks of computer gear to even turn on.

      Only plebs were buying up GPUs from retailers or sniping websites, buying from ebay was for the chumpiest of chumps. Gangsters were buying them from the board partners a truckload at a time, true elites just pay someone to engineer an ASIC and do a small run of them. Eight-figures (as mentioned by a sibling) is plenty, a $50-75m run of ASICs is quite a lot of silicon even on a fairly modern node (and some mining companies were publicly known to be using TSMC 7nm and other very modern nodes). And when you invest that kind of money, you don't flash it around and scare the marks.

  • saynay a year ago

    We are already seeing chips for inference, really. It's how these models are getting into the consumer market. A lot of the big phones have an inference chip (tensor, neural core, etc), TV are getting them, most GPUs have some stuff dedicated for inference (DSS and superres).

  • lumb63 a year ago

    There are some industries that take advantage of FPGAs alongside CPU(s) to move some computations into hardware for speed gains while maintaining flexibility. Maybe something like that is possible. For examples, look at the Versal chip.

    • synthos a year ago

      Xilinx's AI has a dedicated AI accelerator in the versal so calling it taking advantage of FPGA isn't quite accurate. It's really another chip that happens to be copackaged with the FPGA

bingdig a year ago

> Grace™ Hopper™

Can anyone with more legal knowledge share how they trademarked the name of Grace Hopper?

  • adsfgiodsnrio a year ago

    Leaving aside the legality, I find it tacky to use the names of dead people in advertisements. Grace Hopper did not endorse this product. We have no idea what she would have thought of Nvidia. Yet the lawyers are now fighting over the right to use her name and legacy to "create shareholder value".

    The worst offender is Tesla, because I'm pretty sure he would have hated that company.

    • gruturo a year ago

      It's tacky but I don't think they are in any way implying an endorsement. Tesla, Ampere, Pascal, Volta, Kelvin, Turing (and quite a few more I can't remember) are all Nvidia architecture names, and are all named after historically important scientists (well, I have my reservations about Kelvin, but that's more personal opinion)

      • mk_stjames a year ago

        For Nvidia architectures it wasn't 'Kelvin', it was 'Kepler', as in Johannes Kepler the astronomer.

        • johntb86 a year ago

          Kelvin (https://en.m.wikipedia.org/wiki/Kelvin_(microarchitecture) ) was an architecture of theirs that was released in 2001.

          • jsheard a year ago

            I think they only started using the scientist codenames publicly starting from Tesla, but yeah they've used them internally almost since the founding of the company. Early on there was Fahrenheit, Celsius, Kelvin, Rankine and Curie, then the well-known Tesla, Fermi, Kepler, etc.

          • mk_stjames a year ago

            Oh wow; I did not realize their architecture naming scheme even went back that far. That was their first one. The earliest I remembered hearing them call out the name was Tesla.

            For other's reference, nvidia arch's listed on wikipedia as follows in order:

            Kelvin, Rankine, Curie, Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Lovelace + Hopper

      • toth a year ago

        Off topic, but I have to ask. What do you have against Kelvin?

        • mk_stjames a year ago

          Maybe they are a descendant of William Rankine. Maybe there is an epic 150-year Scottish Baron family feud still going.

        • HWR_14 a year ago

          My guess is that Gruturo is Irish or is of Irish descent. Kelvin took public stances against the independence (or "home rule") of Ireland.

    • anaganisk a year ago

      it's just flattery, same as naming a road, Martin Luther King drive, Washington boulevard, etc. They may or may not have approved all this, but this just signifies that you want them to be remembered in your own way.

      • bradlys a year ago

        I think naming a public road after someone is a lot different than naming a private company's product after someone.

        • geodel a year ago

          Yeah we need a law against it. A lot of people would support such law. Could be defining issue of our times.

          • jrflowers a year ago

            It would be hilarious if Tesla had to change its name to Electricity Cars LLC or whatever

            • birdyrooster a year ago

              if anything the teslabots have trained me to think of Tesla as a power company and not a automobile manufacturer.

        • jytechdevops a year ago

          they're both intended to honor said person, no?

          • anotherman554 a year ago

            If it's an internal name, maybe, but external names are generally intended to manipulate the public via psychology to make unconscious connections to the product that will make them want the product, in other words "marketing 101".

  • Symmetry a year ago

    Trademark are always about use in a particular context. Apple has a a trademark on computers using the name "Apple" even though that's been a word for a food for centuries. And if you want produce a line of bulldozers and brand them "Apple Bulldozers" you can do that and get your own trademark on the use of the word "Apple" in that context.

  • dathinab a year ago

    EDIT: To be clear I'm not a legal expert.

    Trademarks are context specific and you can trademark "common terms" IF (and at lest theoretically only if) it's used in a very narrow use-case which by itself isn't confusable with the generic term.

    The best example here is Apple which is a generic term but trademarked in context of phones/computer/music manufacturing (and by now a bunch of other things).

    Through there had been an Apple music label with a bit of back and force of legal cases (and some IMHO very questionable court rulings) which in the end Ended by Apple buying that Label.

    So theoretically it's not too bad.

    Practically big companies like Apple, Nvidia and similar can just swamp smaller companies with absurd legal fees to force their win (AFIK this is Metas strategie because I honestly have no idea how they think the term Meta for data processing is trademarkable), to make it worse local curt have often shown to not properly apply the law in such conflicts if the other party is from an other country (one or two US states are infamous for very biased legal decision in this kind of cases).

    So yeah at the core this aspect of the trademark system is not a terrible idea, but the execution is sadly often fairly lacking. And even high profile cases of trademark abuse often have no consequences if it's a "favorite big company". (For balance negative EU example do include Lego and it's 3d trademark and absurdly biased curt rulings, or Ferrero and it's Kinder (german. Children) trademark on Chocolate).

    EDIT: also not the two TM: Grace™ Hopper™ both Grace and Hopper are generic terms you can under some circumstances trademark and then use together, but while probably legal you would likely want to avoid trademarking (Grace Hopper)™

  • mk_stjames a year ago

    Something to do with Grace Hopper being an actual person (although she is deceased) and thus not being to trademark the entire name?

    • jabl a year ago

      Hopper is the name of the GPU architecture, and Grace is the name of the CPU. Combining them in a device gets you a "Grace Hopper" superchip.

      (And yes, I'd guess the codenames where chosen back in the day with an eye towards combining them in the same device.)

      • jsheard a year ago

        There's one funny exception to the rule of Nvidia naming their GPU architectures after just the surname of a famous scientist - they always refer to the "Ada Lovelace" architecture using her full name, presumably to avoid association with the other famous Lovelace.

      • ConceptJunkie a year ago

        I think it would be interesting to compare this with the Apple BHA. Although in this case, Admiral Hopper's family would have to take up the fight.

  • crazypython a year ago

    "™" has no legal meaning. "(R)" means a registered trademark.

    • HWR_14 a year ago

      That is not true. (TM) has a legal meaning. It's weaker than an (R), but it is still an enforceable trademark.

      It's similar to creating a work covered by copyright vs. registering it with the copyright office.

      • bdowling a year ago

        Affixing (TM) does not make something an enforceable trademark. Trademark rights are established by use of the mark in commerce. A (TM) is used to put others on notice that a word or design is being used as a trademark, which could potentially be evidence of the intent of an alleged infringer in a subsequent lawsuit. Intent is relevant because it affects damages and other remedies.

  • balls187 a year ago

    They didn’t.

    Grace is a trade mark. Hopper is a trade mark.

    Hence each term having it’s own TM.

Aardwolf a year ago

What kind of CPU is the CPU part? The link doesn't tell. Is it like ARM or RISC V and can it run a general purpose OS like Linux?

Do you plugin in DDR memory somewhere for the 480GB, or is this already on the board?

EDIT: found answer to my own question in the datasheet: "The NVIDIA Grace CPU combines 72 Neoverse V2 Armv9 cores with up to 480GB of server-class LPDDR5X memory with ECC."

  • tromp a year ago

    Quoting from https://www.nvidia.com/en-us/data-center/grace-cpu-superchip...

    > The NVIDIA Grace CPU Superchip uses the NVIDIA® NVLink®-C2C technology to deliver 144 Arm® Neoverse V2 cores and 1 terabyte per second (TB/s) of memory bandwidth.

    > High-performance CPU for HPC and cloud computing Superchip design with up to 144 Arm Neoverse V2 CPU cores with Scalable Vector Extensions (SVE2)

    > World’s first LPDDR5X with error-correcting code (ECC) memory, 1TB/s total bandwidth

    > 900 gigabyte per second (GB/s) coherent interface, 7X faster than PCIe Gen 5

    > NVIDIA Scalable Coherency Fabric with 3.2TB/s of aggregate bisectional bandwidth

    > 2X the packaging density of DIMM-based solutions

    > 2X the performance per watt of today’s leading CPU

  • greggsy a year ago

    72 Neoverse V2 Armv9 cores.

    Not sure how one interfaces with it, but it presumably runs an approved Linux distro, with a web server at best.

    • whatisyour a year ago

      it's a normal chip like your x64 chip.you install ARM variant of your linux distribution on it and run it natively.

      source: I have one

      • Aardwolf a year ago

        Could one in theory then game on it by installing Steam on linux? And run all the LLM's with the largest models you want straight from github repositories?

        • whatisyour a year ago

          Yes, its no different from any other server. If it was a desktop computer, you will connect your HDMI to it to a monitor and put GPUs on its PCIe slots and attach hard drives to NVMe.

          • Aardwolf a year ago

            Where does one buy one anyway? What does it cost? (I know it'll be some huge amount, just curious)

            • reaperman a year ago

              These types of items are generally only sold to large buyers with established partnerships. Eventually some stock gets allocated to wholesalers like ShopBLT[0] and CDW[1], which break tradition and sell directly to consumers rather than only selling to retailers/system integrators.

              These are for the 80GB versions, currently priced at $30,000 per GPU. It will likely be many months before this 96GB version is available to prosumers, if it ever is at all.

              0: http://www.shopblt.com/cgi-bin/shop/shop.cgi?action=thispage...

              1: https://www.cdw.com/product/nvidia-h100-gpu-computing-proces...

            • horsawlarway a year ago

              Some of the older generations run roughly the cost of 3/2 house in a medium COL city.

              Ex - the HGX A100 platforms sold as single servers usually ran around 150k, but could get up above 200k depending on loadout.

              Just getting an H100 (just the gpu) right now is ~40k new.

              There is a reason nvidia's stock is doing so well...

            • geerlingguy a year ago

              "If you have to ask..."

              It is many, many thousands.

              • pixl97 a year ago

                I think more like many tens of thousands.

  • chakintosh a year ago

    LTT posted a video a few days ago from Computex talking a bit in depth about it

tiffanyh a year ago

Dumb questions ...

- Am I wrong in understanding this is a general purpose computer (with massive graphic capabilities)?

- And if so, what CPU is it using (an NVIDIA ARM CPU)?

- And what OS does it run?

  • zucker42 a year ago

    Correct, it's called the Grace Hopper superchip because it using Nvidia Grace CPU (which are ARM) and Nvidia Hopper GPU.

    For OS, it will run some form of Linux. I'm not sure if the particular recommended build has been (or will be) publicly released.

ulrikhansen54 a year ago

More powerful chips are great, but NVIDIA really ought to focus some of their best folks on ironing out some of the quirks of using their CUDA software and actually getting stuff to run on their hardware in a simpler manner. Anyone who's ever fiddled with various CUDA device drivers and lining up PyTorch & Python versions will understand the pain.

  • IceWreck a year ago

    The solution is to not install CUDA on your base system because you need multiple versions of CUDA and some of them are often incompatible with your distro provided GCC.

    Here is what works for me:

    - Nvidia drivers on base linux system (rpmfusion/fedora in my case)

    - Install nvidia container toolkit

    - Use a cuda base container image and run all your code inside podman or docker

    • codethief a year ago

      I admit it's been a while (2 years) since I last played with Nvidia/CUDA (on Jetson) and back then running CUDA inside Docker was still somewhat arcane, but in my experience, whatever the Nvidia documentation lays out works well until you want to 1) cut down on container image size (important for caches and build pipelines) and, to this end, understand what individual deb packages and libraries do, 2) run the container on a system different from the official Nvidia Ubuntu image.

      Back then the docs were just awful. Has this really changed that much in recent times?

      • shaklee3 a year ago

        Containers have always come in different flavors that represent their sizes and capabilities. For example, runtime containers have the bare minimum to get the application running but none of the debug tools.

      • ulrikhansen54 10 months ago

        The docs are still terrible, coupled with AWS / GCP docs around these things it makes it near impossible to get this stuff to work without investing a significant amount of time.

  • thangngoc89 a year ago

    Pytorch is the most painless one because everything is bundled in the wheel. Latest stable CUDA supported by PyTorch is 11.8 and I have been running it on a CUDA 12.0 machine because CUDA is backward compatible. Tensorflow on the other hands, requires compilation with the installed CUDA library and it’s truly a pain since I can’t change the machine’s CUDA version.

  • omgJustTest a year ago

    Hardware before software!

    • paulryanrogers a year ago

      ATI/AMD GPUs supposedly have great hardware, hamstrung by less-than-great software. In fact it's the lack of some software features making me hesitate to switch despite major cost savings.

      • mrguyorama a year ago

        AMD drivers are fine if you only care about gaming. There's the occasional idiocy like the default fan curve for my graphics card refusing to run higher than 70% so that the card will cook itself if you actually use it and hard crash your system or the driver, but eh.

        The real problem is that ROCm is a fucking joke, pathetic, half assed, pretend project. Nobody with power in AMD seems to care that nobody can learn machine learning on their hardware to push it in other places, or that their GPUs that they have recently spent all this time boasting about their higher VRAM which is literally useless unless you want to play poorly optimized AAA titles ported from the PS5.

        People say it works but you basically have to be one of the engineers who wrote it to prove that. Good luck getting it to work with Windows, or any hardware that wasn't purpose built for a cluster partner. It's so stupid. Maybe they genuinely intended to make a real CUDA competitor but noticed the ways that nVidia then had to artificially segment their market through dumb decisions (the VRAM) and bios hacks that didn't work and just gave up on that path.

      • kapperchino a year ago

        In fact the amd fine wine is just them fixing their drivers from launch

indymike a year ago

Is there some connection between Nvidia and Admiral Hopper's family that makes it ok to appropriate her identity for their product?

  • vinay427 a year ago

    As far as I can tell they have frequently used the names of famous scientists, such as Kepler, Fermi, Maxwell, Pascal, Turing, Ampere, and Ada Lovelace. This has existed long before Hopper.

  • justinclift a year ago

    > appropriate her identity

    Hmmm, what's the difference between homage and appropriation for things like this?

    • indymike 10 months ago

      Claiming a trademark.

      • justinclift 10 months ago

        That's a good point. If they're trying to trademark the name, that sounds like it'd raise some ethical issues. :)

dauertewigkeit a year ago

I would like to know how fast this is compared to a model shared on 3 A100s with 32 GBs of VRAM each.

What I am also interested in, is why model sharding has to be done manually. It seems like, one should be able to write a framework that will take your forward step and distribute the amount of layers on the available GPUs, automatically. But I haven't come across such a framework yet.

  • josephg a year ago

    Probably much faster. PCIe is the limiting factor in a 3x A100 setup. Having all that memory directly accessible by the GPU will make a massive difference. Even the massive 480 gigs of cpu memory in this architecture can be accessed many times faster than pcie allows.

    • touisteur a year ago

      Hopefully if you have 3x A100 you have bridged them with NVLink and that's your bottleneck. If you want more bandwith you'll need a NVSwitch. Or a H100 GPU (Lovelace didn't get NVLink, not even the A40-little-NVLink-port - or PCIe5 for that matter) with next gen NVLink.

  • simon_acca a year ago

    > take your forward step and distribute the amount of layers on the available GPUs, automatically

    This is part of the value proposition of Mojo, Chris Lattner’s latest project. The compiler infrastructure is still in its infancy, but looks promising: https://www.modular.com/mojo

tppiotrowski a year ago

Kind of curious what WebGL report [1] would be for one of these devices. Does the extra RAM make maximum texture sizes massive and allow for thousands of textures or is that a software limitation?

[1] https://webglreport.com/

fulafel a year ago

What's a superchip? Is it their marketing term for MCM (multi-chip module)?

  • wmf a year ago

    More or less yes.

tmikaeld a year ago

Hm, that's a heft price, 200K per card.

Mac Studio M2 ULTRA has 192Gb of RAM, potentially 188Gb available for GPU, for 5K.

Wouldn't apple be able to compete with that if they scaled it up?

  • PragmaticPulp a year ago

    The compute capacity of these nVidia parts is significantly higher than that of an M2 Ultra. It’s not just about memory capacity.

    The memory bandwidth of the nVidia GPUs is also significantly higher than the M2 parts.

    The Apple silicone parts are impressive for what they are, but they don’t have a huge efficiency advantage for GPU compute. The full-size GPUs with huge memory buses and a large number of cores are still significantly more powerful.

    There’s also a matter of getting data into and out of the GPUs and across the network, which takes a lot more than 10Gbe

    The Apple silicon is great for running development workloads locally, but it’s significantly slower than full size GPUs.

    I’m kind of surprised at how quickly everyone forgot that Apple’s marketing material greatly exaggerates their GPU performance.

    • anaganisk a year ago

      *Apple Silicon, I'm no grammar Nazi, it made me chuckle.

      • mrguyorama a year ago

        Can you imagine how much Apple would charge for branded boobs?

        • anaganisk 10 months ago

          But for sure they would be impressive for what they are:p

  • ChuckNorris89 a year ago

    Go ahead and ask datacenters why they all use overpriced Nvidia chips for AI training instead of shoving cheaper Mac Studios in there. Their answer might blow your mind.

    Spoiler alert: CUDA ecosystem, Linux suport, and most importantly for data centers, Mellanox high speed interfacing with virtually infinite scalability and great virtualization support so they can rent out slices of their HW to customers in exchange for money.

    • kkielhofner a year ago

      A 15 year head start in a category they essentially defined plus an entire generation of executives, developers, and users doesn’t hurt either.

      People complain about the “Nvidia tax” but the hardware is superior (untouchable at datacenter scale) and the “tax” turns into a dividend as soon as your (very expensive) team spends hour after hour (week after week) dealing with issues on other platforms compared to anything based on CUDA often being a Docker pull away with absolutely first class support on any ML framework.

      Nvidia gets a lot of shade on HN and elsewhere but if you’ve spent any time in this field you completely understand why they have 80-90% market share of GPGPU. With Willow[0] and the Willow Inference Server[1] I'm often asked by users with no experience in the space why we don't target AMD, Coral TPUs (don't even get me started), etc. It's almost impossible to understand "why CUDA" unless you've fought these battles and spent time with "alternatives".

      I’ve been active in the space for roughly half a decade and when I look back to my early days I’m amazed what a beginner like me was able to do because of CUDA. I still routinely am. What you’re able to actually accomplish with a $1000 Nvidia card and a few lines with transformers and/or a Docker container is incredible.

      That said I am really looking forward to Apple stepping it up here - I’ve given up on AMD ever getting it together on GPGPU and Intel (with Arc) is even further behind. The space needs some real competition somewhere.

      [0] - https://github.com/toverainc/willow

      [1] - https://github.com/toverainc/willow-inference-server

      • mk_stjames a year ago

        I recently sat and thought about it, and I made the bold claim to a friend that the development of CUDA over the last 15 years.. just that activity, as a whole, is one of the largest use of human intelligence to ever be focused on any one thing in history. It's less people than worked on getting to the Moon, but if you measure the project in PhD-person-years it is probably off the charts. And if you include all the engineers and scientists that the program affects in the same way you would have included all the subcontractors for the Apollo program, it is way, way bigger.

        I think the only reason CUDA isn't talked about like the monumentally important human milestone in technological development that it is, is that it is a pretty abstract thing that is difficult for laypeople to visualize.

        • touisteur a year ago

          The sheer volume of work of 'just' creating and maintaining a C++-like compiler, highly performant cuBLAS, cufft, cuSOLVER, cuDNN with very high perf, all by their lonesome in closed source shows how far one would have to go to even eat parts of their lunch.

          We (who want a real alternative to NVIDIA) either find a way to pool global resources applicable to this level kind of effort on all new accelerator architectures, or we wait for them to stumble, Intel-like. Intel and AMD not pooling their resources on this is self-defeating.

        • kkielhofner a year ago

          This is an excellent point.

          I've been accused of being an Nvidia "fanboy" when I touch on this. I attempt to explain:

          "Nvidia made the very hard, very expensive commitment to developing and supporting CUDA 15 years ago when this space was in it's infancy (non-existent). They SUNK incredible resources into this gamble/vision to universally support CUDA on every chip and every platform - for 15 years. They didn't achieve their position through shady dealings or luck, they earned it with a decade and a half of investment, focus, and execution."

          Granted, they do somewhat abuse the position they have now (as often noted on HN and elsewhere). At the risk of whataboutism I ask: "Show me a corporation that wouldn't. Do you think if AMD had their market share they'd be nice, cuddly good guys?" Microsoft, Intel, Nvidia, Standard Oil, the Phoebus cartel, etc - it goes on and on. Always has and always will. Of course I'm not saying it's a good thing, it's just a fact of the real world.

          • mrtranscendence a year ago

            So what if other companies would similarly abuse their position? If it doesn't excuse Nvidia's behavior in some way then it hardly seems relevant to point out. We still need more competition in this space, or Nvidia parts will become increasingly bad deals (as we've seen for consumer GPUs).

            • kkielhofner a year ago

              I'm not saying it's a good thing, I'm simply pointing out it's more-or-less universal standard behavior for people and corporations to abuse positions of power unless checked externally - anti-trust action, checks and balances within government, other regulation, etc. As noted the recurring tendency on HN and elsewhere to paint Nvidia as some kind of unique diabolical actor is very strange.

              How are Nvidia consumer GPUs a bad deal? When comparing top of the line cards (I'm not going to bother looking elsewhere in the product lines) for 60% more cost (yes, significant) with an RTX 4090 you get:

              - Performance that walks all over the 7900 XTX[0].

              - The ability to use it to self-host, experiment, learn, etc a never-ending range of ML applications that (as discussed) "just work" as a docker pull away.

              With an RTX 4090 you could have stunning gaming performance one minute and then seconds later be running a local LLM, etc. That is tremendously more value than the 60% price difference. If all I wanted to do was gaming and I was price sensitive I'd save myself the $600 and be happy with an AMD GPU. But, looking at market share[1] (at least 80% across desktop gaming and GPGPU) either the value of AMD GPUs is a little known secret (it isn't) or consumers, with the ability for choice, overwhelmingly see the value in Nvidia GPUs.

              [0] - https://techguided.com/7900-xtx-vs-rtx-4090/

              [1] - https://wccftech.com/q3-2022-discrete-gpu-market-share-repor...

      • A4ET8a8uTh0 a year ago

        << I'm often asked by users with no experience in the space why we don't target AMD, Coral TPUs (don't even get me started), etc. It's almost impossible to understand "why CUDA" unless you've fought these battles and spent time with "alternatives".

        Would you be willing to elaborate ( ie. I would love to hear you get started )? I absolutely agree that some competition is needed in this space. I am absolutely not an expert so it is hard for me to understand why there is no real alternative to CUDA. Are they just too hard to set up? Not popular enough to have any support?

        • kkielhofner a year ago

          Probably the best reference is HN itself. Look at the dozens of GPGPU projects, articles, papers, etc that hit the front page in any given week. Then look to see how many of them support AMD/ROCm. Spoiler alert: virtually zero.

          That's compelling enough to justify the position here but you can do further research to explore the challenges with other platforms (like ROCm). Just glance at issue trackers for Pytorch, Tensorflow, and higher level projects that (rarely) support ROCm - you will notice a clear trend. Even though CUDA is more capable and outnumbers AMD use 10:1 the issues reported currently are:

          Pytorch

          Search for ROCm - 2,557 issues open (~10% market share)

          Search for CUDA - 5,430 issues open (at least 80% market share)

          Even in the limited cases where it's attempted the capability and experience with AMD/ROCm is significantly worse, to the point of "almost no one even bothers anymore" (see my top paragraph).

    • adsfgiodsnrio a year ago

      A "prosumer" desktop is just plain not up to the job. The Mac Studio is comparable to a high-end gaming PC. It is not really a workstation and definitely isn't a server. Its 16 performance cores and 192 GiB of non-ECC memory cannot compete with nearly 200 cores and multiple terabytes. Its GPU compute and memory bandwidth are many times less than what Nvidia can build at the high end.

      The Studio would run large workload many times slower than a high-end server, if it could run them at all.

    • edf13 a year ago

      What is the answer?

      • sterlind a year ago

        Compute. The NVIDIA cards have way more tflops than the Mac SoC. A simple RTX 4090 has 16,384 CUDA cores. The M2's GPU has... 38 cores, apparently? The M2 GPU clocks in at 3.6tflops, compared to ~100tflops for the RTX 4090.

        And this is just for a consumer GPU, I haven't even touched on the datacenter-grade stuff.

        tl;dr the M2 is an underpowered GPU with a lot of RAM close by, while NVIDIA cards are multiple orders of magnitude more powerful but most of the RAM's a bit farther away.

        seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.

        • joakleaf a year ago

          • The GPU in the base M2 has 10 GPU cores and 3.6 tflops

          • The GPU in the M2 Ultra has 76 GPU cores and corresponds to 2x M2 Max, which has 13.6 tflops; so 27.2 tflops [1,2]

          • RTX 4090 has 82.58 tflops [3] (overclocked can reach 100 tflops [4])

          While more powerful NVidia cards are not "multiple orders of magnitude more". Rather it seems a 4090 is around 3-4 faster than an M2 Ultra GPU.

          Keep in mind that the Apple Silicon chips also have low-precission Neural Engine circuits for inference of neural nets. For the M2 Ultra they claim 31.6 tops [1].

          [1] https://www.apple.com/newsroom/2023/06/apple-introduces-m2-u...

          [2] https://www.notebookcheck.net/Apple-unveils-M2-Pro-and-M2-Ma...

          [3] https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

          [4] https://videocardz.com/newz/overclocked-nvidia-rtx-4090-gpu-...

        • senko a year ago

          > seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.

          Yes, but, if you need 48GB to run inference on a model, and you only have 24GB available, you don't get to enjoy the tflops difference.

          If nVidia released somewhat lower-performance GPUs but with more VRAM, we could talk. But they're not stupid :)

          • nightski a year ago

            You can get an A6000 for cheaper than an M2 Ultra. That said it's still only 48GB.

            • senko a year ago

              I'm not sure if it's cheaper when you compare the whole machine price, not just GPU, since you can't buy piecewise with the Mac.

              • nightski a year ago

                The GPU is $4500. That gives you $2000 to build a machine (which you can build a top of the line consumer PC for cheaper than that, I just did recently w/128GB DDR5 and Zen 4 7950X3D).

                • senko a year ago

                  Cool if you can find it at that price. On my side of the pond, it's around €6,000.

          • sterlind 10 months ago

            yes you do. you can use DeepSpeed. it's only like 2x slower.

        • argsnd a year ago

          You're not wrong that the RTX 4090 will be much more powerful than an M2 Ultra, but you don't have any idea what you're talking about when you're comparing specs.

          You might be able to compare the number of CUDA cores to the ALU count of the Apple GPUs. I don't know what that is for M2 Ultra yet, but for the 64 core M1 Ultra each core had 16 execution units and each of those had 8 ALUs, for a total of 8,192 ALUs. The M1 Ultra's FP32 performance was in the ballpark of 21tflops - assuming a ~30% improvement in the M2 Ultra that takes us to ~27tflops. Google suggests that for the RTX 4090 it's 83tflops.

        • cypress66 a year ago

          The cuda cores and m2 cores aren't comparable.

          A more appropriate comparison is the fp16 performance.

          It seems to be 27 tflops for the 38 core M2, and 330 for the 4090.

          The more useful for training fp16 with fp32 accumulate is 165 for the 4090, I don't know about the apple one.

        • threeseed a year ago

          There is a video [1] benchmarking last year's M1 Ultra against 3080ti for one ML use case and it was 3x slower. Apple MX does have neural engines which are used but not sure how they compare to CUDA cores.

          Either way quite a bit better than 30x slower.

          [1] https://www.youtube.com/watch?v=k_rmHRKc0JM

          • jorgemf a year ago

            In that video they run the Linux experiments over a windows with a virtual machine. And I didn't see the model, but I bet you I can trains a model in a 4090 one 2x faster than in a old 1050 (because i can chose a model which bottle neck could be the data transfering not the actual computation).

      • throwaway485 a year ago

        I am speculating the answer is that "Nvidia just works", where Apple may be more niche & hassle to get working with their preferred frameworks/stacks/tools.

        • less_less a year ago

          Maybe that, but also the Nvidia chips have *vastly* higher performance (see https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...). They claim 4 TB/s memory bandwidth and up to 989 single-precision TFlops in tensor mode (67 TFlops for non-tensor ops).

          By contrast, M2 Ultra has 800 GB/s memory bandwidth, 31.6 half-precision TFlops in the Neural Engine, and (extrapolating from https://en.wikipedia.org/wiki/Apple_M2), about 27 single-precision TFlops on the GPU.

          So 5x memory bandwidth, more than double generic throughput, and at least 32x peak tensor throughput. Sure, the Mac Studio uses much less power, but depending on the application that usually doesn't make up for the speed difference.

  • delfinom a year ago

    Memory is the cheap part at the end of the day. It's about the rest of the chip and why GPUs are fundamentally and radically different from a CPU.

    • vlovich123 a year ago

      Memory is rapidly becoming the expensive part compared with compute and memory doesn’t get as cheap as quick as CPU. Notice how main memory price/TiB has basically flat lined for the past 10 years:

      https://ourworldindata.org/grapher/historical-cost-of-comput...

      Sure, they’ve gotten a bit faster but it’s still fairly expensive to outfit more and more RAM.

      GPUs though are indeed much more than memory. However, Apple has a unique unified memory model that no one else has matched yet where the memory is the heart of the machine. That means you don’t even need as much bandwidth because you can transparently access the same data by any chip at the same speed. That’s a pretty powerful design. I doubt Apple will really go into the training side of things because that’s not germane to their use cases because that happens in the cloud where they don’t have a presence yet. Inference is so expect more LLM / stablediffusion acceleration. Now if fine tuning models with additional training becomes a thing, then you’ll see acceleration of that modality on Apple’s machines. But training won’t be a focus because Apple doesn’t care about fickle nerd points that aren’t relevant to their business.

  • jakjak123 a year ago

    In datacenters, sparse rack space is rather expensive. Plus the compute per watt of these nvidia chips are off the scale compared to anyone else.

  • gmm1990 a year ago

    Probably apple are the only ones doing the newer tsmc process too

  • wahahah a year ago

    192GB (1,536Gb)

  • Synaesthesia a year ago

    Apple could also market their fast efficient chip to datacentres and servers if they were so inclined. They choose not to compete in certain areas, focusing on consumer products. I don't know why but it's their choice.

throwaway290 a year ago

> Global hyperscalers and supercomputing centers in Europe and the U.S. are among several customers that will have access to GH200-powered systems.

So, not for everyone.

  • keyme a year ago

    If you had a machine that prints money, you wouldn't sell it, would you? You would run it yourself.

    It's the same case as with high-end electronic components since 10-15 years ago. No one can/could produce their own smartphone because the high end components are sold only to the largest incumbents. A medium-size startup "has the money" to buy these, even in volume. But qualcomm won't sell.

    Same fate expects us here in general purpose computing. Hard to believe, but so was what happened to the electronics supply chain, at the time.

  • fennecfoxy a year ago

    Well there are restrictions on certain countries buying systems this powerful with the idea that they might be used for bad things.

    Granted, we can use them for bad things, too, but nobody wants to get stabbed with their own knife.

bilsbie a year ago

Why does it have cpu ram?

  • Synaesthesia a year ago

    Because it's still got separate Ram for CPU and GPU which is how most PC's wor. It's not an integrated circuit which uses shared memory.

    • bilsbie a year ago

      Isn’t this a gpu card though? Should the cpu and system ram be separate?

washadjeffmad a year ago

What's the form factor, just two fused SXM5s?

schlupfknoten a year ago

What would something like that cost?

  • laweijfmvo a year ago

    $200,000

    • drtgh a year ago

      I was going to talk about the power consumption ranging from 450W to 1000W (expensive), but I guess with such price level it will not mind at all.

tmaly a year ago

Can I get one of these for my desktop so I can run the 65B models?

seydor a year ago

Laptops should have this in a year

  • pixl97 a year ago

    Only if you want your lap to catch on fire.

  • Hamuko a year ago

    >Module thermal design power (TDP): Programmable from 450W to 1000W (CPU + GPU + memory)

    I don't think so Tim.

    • ed25519FUUU a year ago

      Nothing like programming with a 1 kW power plant on your crotch.

      • krylon a year ago

        They could sell it as a contraceptive for men. (SCNR)

      • z29LiTp5qUC30n a year ago

        Yeah, that was like trying to program when my ex-wife wanted attention.

        Just wish I could have typed half as well as Hugh Jackman in Swordfish; then I wouldn't feel so sorry for all of the bugs I introduced into production because of her.

    • seydor a year ago

      Unfoldable solar panel. Apple said yesterday that sitting in the sun is good for your eyes

sylware a year ago

... and still so small and slow compared to a human brain.

  • KeplerBoy a year ago

    I don't know about you guys but computers have always been faster at math than my brain.

    • tstrimple a year ago

      For conscious math yeah that's true. We also do a lot of unconscious math. Catching an arcing ball flying through the air for example. We don't think of that as math because it feels so natural to us, but our brain is tracking trajectory and speed and you're coordinating your body to intercept based on those details. Similar story with throwing a pass to where you know someone will be. Those sorts of skills become ingrained to the point where they become a reflex, but it's math computation behind the scenes.

      • casey2 a year ago

        What math exactly does the brain do while you are trying to catch a ball? Just because you can mathematically model the trajectory of a ball doesn't mean your brain is using that model. The standard method humans use to catch a ball is stabilization at a few updates every second, while the classic method used to tract the trajectory of a ball is newton's method of fluxions. Humans suck at both compared to computers; though the former isn't as much a shameful a loss as the latter.

    • sylware a year ago

      You totally missed the point: the connectome of a human brain is still orders of magnitude bigger and faster than that.

      Power efficiency does not matter for those.

      Those are interesting only for the implementation of specialized cognitive functions.

      Additionnaly, the connectomes of those chips are 2D and very localized. Human brains are 3D and much less localized. Simulating 3D connectome with those 2D chips slows everything down by a lot.

    • hardware2win a year ago

      How about learning e.g car driving?

      • echelon a year ago

        It took humans 300,000 years to learn to drive the car. Much longer than that, if you consider our lineage.

        • hardware2win a year ago

          >It took humans 300,000 years to learn to drive the car.

          False, to invent car.

          People manage to learn to operate cars in less than 50 hours

          • detrites a year ago

            You're comparing training a model from scratch in ML, to the equivalent of model fine-tuning in humans. It's unfair and incorrect. Eg, no human can drive without first learning to operate their limbs, recognise shapes etc.

            The GP is pointing out that training in fine muscle motor skills, self-awareness and ability to project self-awareness to other objects under ones control etc, all took many thousands of years to develop. AI is faster.

            However, it's again unfair, as AI only knows what it knows from us, so in that sense any comparison is built on shaky ground.

            But for the purposes of comparing a stock human brain as hardware, versus a current high-end GPU specifically in terms of ingesting information and then perform tasks, the GPU beats the human brain "hands-down" in any category.

            The only categories it doesn't are simply ones that no one has trained it to yet - so the argument on a pure hardware capability basis stands.

            • incrudible a year ago

              Still, the 300,000 years figure is way off. That's an anatomically modern human and they most likely could be trained to operate a car much like a human of today can. Getting to that point took billions of years, but producing the driver of a car was never the goal of that process.

              It's just the wrong way to look at the problem. You're not trying to develop the generic system that can learn how to drive a car, you're trying to develop the specific system that can safely drive a car occupied by humans, naturally employing machine learning.

              I would argue that we're 95% there, but solving those last 5% is exponentially more expensive, but not commensurately more valuable. There's a "profit ceiling" imposed by the cost of a human driver, which appeats to make solving the problem economically intractable.

      • gzer0 a year ago

        https://www.youtube.com/watch?v=DjZSZTKYEU4

        Here's a video of Tesla FSD driving through the complicated streets of Los Angeles for an hour straight, with 0 human intervention.

        • hardware2win a year ago

          And how many learning-years did it need to achieve that?

          In compare to human's tens of hours?

          • TN1ck a year ago

            Tens of hours? I wasn't aware that newborns are allowed to drive :D. Jokes aside, if you want to compare the training time, you should also include the time it takes us to learn our surroundings and navigate the world.

            • qwytw a year ago

              Even if we assume 10-15 years a human brain is still astronomically more efficient at this. Unfortunately we can't yet copy-paste of our brain so everyone has to learn it from scratch...

              • danielbln a year ago

                More like a few million years. Humans come pre-trained via DNA, the 10-15 years are mainly fine-tuning.

                • fennecfoxy a year ago

                  Just came to leave the same comment aha.

                  We have to include our evolutionary process because a lot of our brain is pretrained, especially visual/motor neurons.

                  We seem to be pretrained to pick up language, for example, and the language(s) we hear after being born fill that space, our brains are plastic for a reason.

          • TeMPOraL a year ago

            > In compare to human's tens of hours?

            On top of typically around 18 years of learning to process and fuse vision, sound, proprioception and other inputs, to navigate the world and reason about it.

            • hardware2win a year ago

              18 years because this is more due to cultural/law requirements, not brain capability

              • TeMPOraL a year ago

                And those cultural/law requirements are not arbitrary - they've been set at around the brain capability is sufficient for the task. There's both science and a ton of experimentation behind them.

          • anonylizard a year ago

            Humans have 1. Pre-trained fine-motor control and visual perception that is encoded in genetics (think base model training). Who knows how many millenium this took.

            2. 16 years of fine-tuning to adapt to the current modern world.

            3. 50-100 hours of specific task based fine-tuning for driving, think LORA training.

        • neatze a year ago

          This type of self driving existed since 90's, as far I understand hard part is driving in traffic and pedestrians.

      • firecall a year ago

        It's just a matter of knowing how to teach it...

        • ed25519FUUU a year ago

          Why not have it teach itself if it’s so smart!

    • MrBuddyCasino a year ago

      call me back when they can iron my shirts

      • TeMPOraL a year ago

        They can, you just won't like the price tag.

        • gtirloni a year ago

          The original comment was about the human brain, not a single specialized task.

          • TeMPOraL a year ago

            I'm not going to bet against GPT-4, or its multi-modal successor that's supposed to be out soon, being up for this task. The hard bit is manipulator hardware, which is out of scope for this topic, and the tricky bit is teaching a generative model how to use it, which I think is entirely within capabilities of current state-of-the-art models.

  • mschuster91 a year ago

    Give development a few years and it'll be faster than a human.

    • jacquesm a year ago

      Depending on the task it may well already be faster than a human at equivalent performance.

    • MrBuddyCasino a year ago

      Neurons are very complex and not just on/off switches. We can't even fully simulate the ~300 neurons of c. elegans, not even close.

      • TeMPOraL a year ago

        A lot of that complexity comes from them being living cells, optimized for and functioning in a different environment than our silicon-based machines. We don't need to model it all.

        (Though we do need to pay attention to evolution cheating by overfitting relative to what we'd consider a clean design. Some of the complexity may be doing double duty.)

        • mrguyorama a year ago

          Slime molds and single celled creatures can learn things despite having ZERO neurons. Neurons are built on top of an already incredibly complex machine evaluating hundreds of thousands of chemical and physical interactions per second that ALL effect how the cell works.

          We aren't likely ever going to reduce that to a model as simple as the one used in machine learning, because it probably isn't that simple period.

          Neurons are not "just" electrical signalling devices. They are complicated processors and systems in their own right.

        • MrBuddyCasino a year ago

          > Some of the complexity may be doing double duty.

          Since we have not succeeded in imitating even the most primitive brains, even though computationally we should have enough juice by now, it would seem that complexity can't be discarded at all, no?

          • TeMPOraL a year ago

            > Since we have not succeeded in imitating even the most primitive brains, even though computationally we should have enough juice by now

            looks at the browser tab with GPT-4 in it

            looks back again here

            ... we didn't?

            • AstralStorm a year ago

              We didn't. Try to get ChatGPT through a maze a C. Elegans can solve in minutes. (That slow mostly because pond scum moves slowly.)

              Seriously, get it to successfully play through a text adventure maze game.

              • TeMPOraL a year ago

                > Seriously, get it to successfully play through a text adventure maze game.

                I did exactly that the other day in response to a different objection on a HN thread. Or at least similar enough.

                https://cloud.typingmind.com/share/c0a68cb2-5f59-4e83-b383-b...

                Now, the goal there wasn't to get it to solve a maze, but rather to see how it can come up with a plan of action and adjust it on the fly. But I see no reason a variant of that wouldn't work with a traditional maze game - provided you remember this is a stateless model without volatile memory, so it needs to be fed its memory with every request.

              • mschuster91 a year ago

                The problem with ChatGPT is it lacks the context depth for more complex tasks - but that is a computation resource limit, not a technical one.

            • MrBuddyCasino a year ago

              No we haven't. ChatGPT is basically a sophisticated Markov chain. It is very good at pattern matching, but it has no understanding of anything, or its own will. People who think is even close to AGI are deluded, fooled by an elaborate Mechanical Turk.

              This is also the reason why its output sounds convincing, but is very often factually wrong.

              • TeMPOraL a year ago

                I disagree, but that's beside the point here. You yourself narrowed the scope to:

                "imitating even the most primitive brains, even though computationally we should have enough juice by now"

                Which is kind of weird to claim today. GPT-4 may be the strongest counterexample to date, but it's far from the only one.

                Of course, you need to remember not to confuse the brain with attached peripherals. Just because we can't replicate a perfect worm or fly body, complete with bioelectrical and biomechanical components, doesn't mean we can't do better than their brains in silico.

                • fennecfoxy a year ago

                  I'd also say that some of it might be more computing power required, but much of it is us cracking the "puzzle" to it, we haven't figured out the exact right architecture/structure for creating say an AGI.

                  Just like with transformers revolutionising text generation and now things like LoRa and other fine tuning methods are helping us find a better solution to that puzzle, the same will happen for the development of AGIs.

                  We will do it, one day.

                • mrtranscendence a year ago

                  GPT-4 does not "imitate" a "brain"; it does not function like a brain, nor is it even really analogous to a brain in any useful sense. What it imitates is human speech.

                  • TeMPOraL a year ago

                    "Imitating human speech" is not a trivial thing. You can't do it by a lookup table, or by a Markov chain. Not properly, not in open-ended, unscripted situations. It requires capabilities and structures that, if they aren't a world model and basic abstract reasoning skill, then they at least start to look strikingly similar in practice. This is where we are with GPT-4. It doesn't imitate speech. It imitates reasoning.

                    And if it walks like a duck, and quacks like a duck, ...

                    GPT-4 is a good example because it's pretty clear that the model isn't merely a stochastic parrot (or, if it is in some sense, then in that sense so are we). But it's not the only game in town. Not all generative transformers deal with language. All seem to be powerful association machines, drawing their capabilities from simple algorithms in absurdly high-dimensional spaces. There are many parallels you can draw to brains here, not the least of which is that the overall architecture is simple enough and scalable, that it's exactly the kind of thing evolution could reach and then get railroaded into building on.

                  • MrBuddyCasino a year ago

                    Thanks, I gave up halfway writing this.