Nvidia releases new AI chip with 480GB CPU RAM, 96GB GPU RAM

381 points by logicchains 2 years ago

They also have a turnkey product with 256 of these things.

1 exaflop + 144TB memory

https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh20...

jwr 2 years ago

As someone who lived through the first wave of supercomputers (I worked with Cray Y-MP models), it makes me very happy to see the second wave. For a while I thought supercomputing was dead and we would just be connecting lots of PCs with a network and calling that "supercomputers".
I still remember how my mind was blown when I first learned that all of the memory in a Cray Y-MP was static RAM. Transistor-based flip-flops: extremely power hungry, but also very fast. Another way of looking at it is that all of its RAM was what we call "cache".
This, finally, looks like a supercomputer.
- com2kid 2 years ago
  
  SRAM is so stupid fun to play with.
  All of a sudden you don't care so much about the inefficiencies of walking linked lists or trees. When everything is "already in cache", you can worry less about cache efficient algorithms!
  1 cycle memory access latency is one of the reasons why tiny embedded MCUs can do things with a fraction of the MHZ of their larger counterparts.
  Now days of course it is all about tons of memory, tons of bandwidth, craptons of compute, and planning the flow of data ahead of time.
  
  boringuser2 2 years ago
  
  Only within a very small domain was this ever reasonable.
  
  jwr 2 years ago
  
  But that's why there's a "super" in "supercomputers"!
  Cray loaded a ton of static memory into their computers, then liquid-cooled the whole thing. Sure, the power requirements were through the roof, and you had a whole huge chiller system which you had to run and hope it doesn't fail. If it did fail, you really wanted to shut the machine down fast. From what I recall there was also an emergency propeller inside the back case of the Y-MP 2E, and yes propeller is a much better name for this thing than a "fan". It would delay the inevitable, although dumping those tens of kW of heat into your server room was not something you ever wanted to do.
  The whole point of all this was that you could do things that you couldn't with "normal" computers. That's why those were called "supercomputers". And I'm so glad that after a hiatus of about 30 years we're getting another wave of exceptional machines, which aren't just bigger PCs.
  
  Dylan16807 2 years ago
  
  L3 cache is all SRAM but you can have pretty significant delays accessing it. Even the fastest memory cells will build up significant addressing delays as you increase in scale.
  
  creato 2 years ago
  
  If you have a “large” SRAM that can be accessed in one cycle, that just means your processor is slow and/or consumes more energy than it should.
  
  com2kid 2 years ago
  
  > If you have a “large” SRAM that can be accessed in one cycle, that just means your processor is slow and/or consumes more energy than it should.
  MCUs are in this category, lots of embedded stuff, including the two areas I'm familiar with: game controllers and lower spec'd wearables.
  Very low power usage, CPU speed around 100mhz, so not too slow.
  You can do plenty with 100mhz and SRAM!
  
  SmoothBrain12 2 years ago
  
  [dead]
- bigbillheck 2 years ago
  
  > the first wave of supercomputers (I worked with Cray Y-MP models)
  The Y-MP came out in 1988, sixteen years after CRI was founded, which itself was several years after the CDC6600.
  
  Retric 2 years ago
  
  They presumably where born 20+ years before they started working professionally on the Y-MP. So they could easily have been alive or even a teen in 1964 when the CDC 6600 was released.
  
  jwr 2 years ago
  
  I worked with the Y-MP models in the 1990s, and no I was not alive in 1964, although I'm not sure how we got there :-)
bigyikes 2 years ago

I watched Jensen’s announcement for this.
He calls it the worlds largest GPU. It’s just one, giant compute unit.
Unlike super computers, which are highly distributed, Nvidia says this is 140 TERABYTES of UNIFIED MEMORY.
My mind still gets blown just thinking about it. My poor desktop GPU has 4 gigabytes of memory. Heck, it only has 2 terabytes of storage!
- coolspot 2 years ago
  
  It may be presented as seamless unified memory, but it isn’t. Underlying framework still has to figure out how to allocate your data to minimize cross-unit talk. Each unit has independent CPU, GPU and (V)RAM, but units are interconnected via very fast network.
  
  westurner 2 years ago
  
  How does this compare to HBM High Bandwidth Memory (and GDDR5)? https://en.wikipedia.org/wiki/High_Bandwidth_Memory
  
  westurner 2 years ago
  
  From https://news.ycombinator.com/item?id=36211785 :
  > EDIT: found answer to my own question in the datasheet: "The NVIDIA Grace CPU combines 72 Neoverse V2 Armv9 cores with up to 480GB of server-class LPDDR5X memory with ECC."
  So, this is not stacked RAM like HBM, it's LPDDR5X which a quick search says is 8.5Gbps.
- XCSme 2 years ago
  
  I think distributed computing will go away soon. As computers become more powerful, the cost of "distributing" and transferring the data would be more than simply executing everything locally. Yes, you can still split the task amongst different nodes, or give different problems to different nodes, but the use-case would mostly be solving distinct problems on each note, that splitting the same task across multiple computers.
  Also, with quantum computers, the parallelization/"distribution" of tasks will be done within the same machine, as it can try all solutions and the same time without having to do divide-et-impera algorithms.
  Also, in the future, the algorithms will be a lot simpler, and just have FPGA-s like AI chips, where there is no software, the model is directly modelled in the hardware, so each computation is instant (just the time it takes to propagate the electrons or light through the circuit).
- sliken 2 years ago
  
  What's old is new again. This is basically an updated arm version of the itanium based SGI Altix.
  Keep in mind unified does not mean uniform, the ram is distributed across all the GPUs.
- bushbaba 2 years ago
  
  There’s usecases beyond just ML. Sap Hana could theoretically run on this with greater performance. Same goes for a database. Scaling vertically solves a lot of challenges with distributed ledgers.
- markus_zhang 2 years ago
  
  Is it similar to the mainframe in concept?
  
  tiberious726 2 years ago
  
  No, kinda the opposite: mainframes use a litany of sophisticated and special purpose hardware to achieve their tasks (eg hardware io _channels_), while this is a massively overgrown instance of a single kind of hardware (vector processor).
  
  bitwize 2 years ago
  
  The thing that made the lightbulb go off in my head w.r.t. mainframes was understanding that mainframe I/O channels are computers. The mainframe had several dedicated computers that each specifically handled I/O to a terminal, printer, disk or tape drive, punchcard reader, etc. Made I/O programming a breeze, as you just had to tell the channel to read or write, specifying a block of memory to use as a buffer, and the channel would DMA out the data to be written, or DMA in read-in data.
  It also explains the reason why despite having middling CPU power, mainframes had a reputation for stinkloads of I/O bandwidth so they could process everyone's credit card transactions, airline bookings, and that: the mainframe's CPU was involved very little in I/O, that was all handled by the channel processors!
jasonjayr 2 years ago

I had to look twice at that image, I thought it was a 2 rack-unit device, But, no, it's 24 full 42U racks!!
- kristopolous 2 years ago
  
  it's just a rendering. I presume nvidia wouldn't be announcing something that they haven't made and confirmed, I wonder why they chose that image.
  Is it just they haven't done the molding of a production installation? Is it possible that their internal instances might not be that presentable?
coherentpony 2 years ago

> 1 exaflop
To be clear, this is floating point quarter-precision operations when using the FP8 tensor core arithmetic unit [1].
[1] https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...
- pezezin 2 years ago
  
  Came here to post exactly the same link. Not just that, but 1 exaflops of sparse FP8.
  In comparison, Frontier is 1 exaflops of dense FP64. Try to run this Nvidia system as dense FP64, and it performance will degrade two orders of magnitude.
  Don't get me wrong, the machine is really impressive, but the advertisement is quite misleading.
mirekrusin 2 years ago

Oh my, time to upgrade my Pi 4 Model B.
- weinzierl 2 years ago
  
  If you upgrade to a Jetson you get GPU power and you can keep the form factor, win - win.
RosanaAnaDana 2 years ago

I want to see Linus play Doom Eternal on it.
- ChuckNorris89 2 years ago
  
  Is Crysis no longer a thing?
  
  all2 2 years ago
  
  Crysis is a problem because it is single threaded. That's why it was so hard on computers back in the day.
  
  skhr0680 2 years ago
  
  Being single-threaded is why the original Crysis is hard on computers now(!!). When Crysis was being developed people still thought 5-10Ghz was coming Any Day Now.
  
  dtech 2 years ago
  
  2007 was well into Intel Core territory, Intel had given up on clockspeed-at-all-costs Pentium 4 Netburst, so it was generally accepted that clocks weren't going to keep up as fast as before.
  
  skhr0680 2 years ago
  
  Even if Crytek took a year off developing new games after the release of Far Cry, that still puts the start of Crysis development before the announcement of Core (early 2005 IIRC).
  
  all2 2 years ago
  
  You might could hit 5Ghz by overclocking the appropriate CPU. I remember 10 years ago looking at the liquid cooling setups needed to hit 5Ghz on a core.
  
  thomastjeffery 2 years ago
  
  Stock turbo on a 13th gen i5 (performance cores) is 5.1ghz. All you need is good cooling and a stable power supply.
  
  LoganDark 2 years ago
  
  My 12400F boosts to 5GHz for the majority of each day. It's literally normal for me. I've gotten it to almost 5.3GHz before, but it's not really stable above 5.2GHz.
MichaelZuo 2 years ago

That could probably all fit in a single semi-trailer. It's amazing how dense computation is getting.
fennecfoxy 2 years ago

Doesn't this basically shoot up the list of the TOP500 then? Wonder if they offer a >256 they could be top of the list, easy.
- mk_stjames 2 years ago
  
  TOP500 uses FP64 performance in ranking. nVidia's 1 exaflop claim is the ~4 petaflops of FP8/INT8 * 256. FP64 performance of modern nvidia gpu's is actually far, far less. The ratio to FP32 isn't even 2:1 anymore (not since Pascal I think) since they realize most machine learning is done with FP32 or less.
  64 bit (or 'double precision') is still king in the HPC world though, as it is what you will find in large numerical solutions in fields like computation fluid dynamics, nuclear physics, etc.
  
  fennecfoxy 2 years ago
  
  Ah fair, I should've known. I suppose the precision is still required for scientific purposes. Thankfully ML stuff now gets more appropriate precision for a speed increase.

NickHoff 2 years ago

My question here is about underlying fab capacity. This chip is made on TSMC 4N, along with the H100 and 40xx series consumer GPUs. I assume Nvidia has purchased their entire production capacity. I also assume that Nvidia is using that capacity to produce the products with the highest margins, which probably means the H100 and this new GH200. So when they release this new chip, does it mean effectively fewer H100s and 4090s? Or is that not how fabrication capacity works?

I'm asking because whenever I look at ML training in the cloud, I never see any availability - either for this architecture or the A100s. AWS and GCP have quotas set to 0, lambda labs is usually sold out, paperspace has no capacity, etc. What we need isn't faster or bigger GPUs, it's _more_ GPUs.

ac29 2 years ago

> This chip is made on TSMC 4N, along with the H100 and 40xx series consumer GPUs. I assume Nvidia has purchased their entire production capacity.
I dont know why you would assume that. Qualcomm has been using TSMC N4 since last year [1]. I'm sure there are other customers as well.
[1] https://www.anandtech.com/show/17395/qualcomm-announces-snap...
huijzer 2 years ago

It sounds to me like the GH200 achieves more FLOPS per transistor. So, compute demand will be quicker satisfied via the GH200 than via "smaller" chips such as the H100.
Having said that, I don’t think we’re anywhere near some kind of equilibrium for AI compute. If chip supply would magically double tomorrow, then the large companies would buy it for their datacenters and have 100% utilization in a few weeks. They all want to train larger models and scale inference to more users.
- rcme 2 years ago
  
  In addition to training larger models, I'm sure there are many use cases that AI could serve that are currently cost prohibitive due to the cost of running inference.
danielmarkbruce 2 years ago

I'd like bigger GPUs. A trillion parameter model at 16 bits needs 2000gb+ for inference, more for training. All kinds of things can be done to spread it across multiple GPUs, downsize to less bits etc, but it's a lot easier to just shove a model on one GPU.
We'll likely see more efficiency from bigger GPUs and hopefully more availability as a result.
- mcbuilder 2 years ago
  
  TBH this is what all ML researcher / engineers have wanted for the past 10 years.
  
  0cf8612b2e1e 2 years ago
  
  My question on the very slow growth of available memory: are there technical reasons they cannot trivially build a card with 100GB of RAM (even with lower performance) or has it been a business decision to milk the market for every penny?
  
  Dylan16807 2 years ago
  
  High speed I/O pins cost a lot, and GDDR generally has 32 data pins per chip and no way to attach multiple chips to the same pins. So 256 bits and 16GB is hard to exceed by much on that tech. The high end is 384 bits and 24GB.
  There is a mode to attach 16 data pins to each GDDR chip, so with some extra effort you could probably double that to 48GB. Or at least 32GB. Maybe this is a valid niche, or maybe there isn't enough demand.
  The alternative to this is HBM, which can stack up big amounts, but it's a lot more expensive.
  
  Baeocystin 2 years ago
  
  I don't disagree with Dylan, but I'm more than willing to bet that the only reason Nvidia's cards (and that's who we're talking about. CUDA is a hell of a moat.) are RAM-starved is that they haven't felt the pressure to do otherwise. AMD has an institutional aversion towards good software. Intel isn't even an also-ran, yet.
  Apple and their unified memory architecture may be the prod needed to get larger levels of RAM available to single cards solutions. We'll see.
  
  shaklee3 2 years ago
  
  Nvidia has had unified memory for more than 6 years. This chip is just a faster interconnect for it.
bob1029 2 years ago

> Or is that not how fabrication capacity works?
Fabs can run multiple complex designs on the same line simultaneously by sharing common tools. For example, photolithography tools can have their reticles swapped out automatically. Obviously, there is a cost to the context switching and most designs cannot be run on the same line as others.
Ultimately, the smallest unit of fabrication capacity is probably best measured along grain of the lot/FOUP (<100 wafers).
ksec 2 years ago

>Or is that not how fabrication capacity works?
The basic of Supply Chain and Supply and Demand, as you should have all witness during COVID for toilet rolls are the same.
Fab capacity is not that different to any other manufacturing. You just need to book those capacity way ahead of time. ( 6 - 9 months ) And that is also why I said 99% of news, or rumours about TSMC Capacity are pure BS.
So to answer your question. Yes, Nvidia will likely go for the higher margin products. One of the reason why you see Nvidia working with Samsung and Intel.
theincredulousk 2 years ago

It's my understanding from friends in the business that the actual chips do not represent any capacity issue or bottleneck, it's actually manufacturing the devices that the chips are in (e.g. the finished graphics card).
- NotSuspicious 2 years ago
  
  Why would this be the case? I would naively think that since the chips can only be made in a fab and the rest can be made basically anywhere that that wouldn't be true.
  
  archi42 2 years ago
  
  They can not be made "anywhere"; when you can't get that PMIC from the original manufacturer, good luck getting it from someone else. And replacing an IC in a QA tested, EMV verified, FCC and CE etc. certified device will often trigger you redoing all that, possibly requiring additional iterations. If there is a similar part available at all.
  Take a look at a recent GPU and count the auxiliary components. All of them can cause supply chain difficulties.
- refulgentis 2 years ago
  
  That's...fascinating. There's enough space on TSMC but the PCB is the hard part?
  
  Yizahi 2 years ago
  
  For example my corpo hit manufacturing issues (production capacity) with flash memory, with clock oscillators, with auxiliary fpga. But main chips production was fine all the time during chip crisis as far as I know. So yeah, small critical components totally can be a blocker. Some specific voltage controller is unavailable and suddenly your whole design is paralyzed.
  
  idiotsecant 2 years ago
  
  pcbs are also full of a bunch of other components, many of which are hard to get ahold of right now.
  
  davrosthedalek 2 years ago
  
  I think that's it. PCB itself is rather trivial, it's the RAM, but also things like switching regulators (there are others, but then it's a redesign), maybe even stuff like connectors (which don't burn....).
  For a science project, we need to manufacture magnets. It's not easy to find a company who has the right iron right now, and it's hard to get, with long lead times. The supply crisis is real.
Tepix 2 years ago

I see A100 80GB cloud capacity available on both runpod.io and vast.ai currently.
RosanaAnaDana 2 years ago

You know I was wondering this the other day when NVDA's insane run up happened. I went down the road of trying to figure out if there was even enough silicon wafers, or if there even would be enough wafers in the next five years, to justify that price.
Unless all the planet does is make silicon wafers; no.
- xadhominemx 2 years ago
  
  Well you figured wrong - NVDA AI GPUs are a very small % of global foundry supply, if even if volume tripled, they will still be a small % of global foundry supply. NVDA’s revenue is high because their gross margins are extreme, not because their volume is high.
- austinwade 2 years ago
  
  Can you go into more detail? So you're saying that at a 200 P/E ratio NVDA there isn't even enough wafer supply for NVDA to grow into that valuation even over 5 years?
  
  RosanaAnaDana 2 years ago
  
  I mean, you've got the gist of it. I pulled some reports on silicon production, silicon waver prices and price trends, current fab capacity etc..
  My back of the napkin basically suggested that silicon production would need to 4x and fab capacity 4x (neither of which are happening) and NVDA with would have to capture all of that to justify their current price. I didn't bother writing it up, just looked at it mostly because I was on the wrong side of that play. It's something worth considering for sure.
  
  austinwade 2 years ago
  
  Wouldn't NVDA just focus more on high margin datacenter products in order to grow into those higher earnings but with the the wafer limitation? Datacenter focused products are already starting to surpass gaming which is their second largest revenue source: http://www.nextplatform.com/wp-content/uploads/2022/05/nvidi...
  It seems to me that yes while a 200 P/E may be high, they certainly could keep increasing the prices on the already high margin datacenter products, of which get quickly gobbled up by companies no matter what price they are because of the immense demand.
  
  refulgentis 2 years ago
  
  We're probably ~3 years out from all of those fabs gov'ts funded coming online, right?
  (n.b. that's really good work on your end and I agree with your conclusion, just idly musing about the thing that bugs me, what the heck all these non-leading edge fabs are going to do)
fxtentacle 2 years ago

I believe availability is low because the GPUs are too expensive so those that need to scale up use the older and much more affordable models.
tomschwiha 2 years ago

I'm using Runpod and Datacrunch regulary and they seem to always have some available.

samwillis 2 years ago

This may be a naive question, in "crypto" we saw a shift from GPUs to ASICs as it was more efficient to design and run chips specifically for hashing. Will we see the same in ML, will there be a shift to ASICs for training and inference of models?

Apple already have the "neural" cores, is that more or less what they are?

Could there be a theoretical LLM chip for inference that is significantly cheaper to run?

anonylizard 2 years ago

Inference is mostly just matrix multiplications, so there's plenty of competitors.
Problem is, inference costs do not dominate training costs. Models have a very limited lifespan, they are constantly retrained or obsoleted by new generations, so training is always going on.
Training is not just matrix multiplications, given hundreds of experiments in model architecture, its not even obvious what operations will dominate future training. So a more general purpose GPU is just a way safer bet.
Also, LLM talent is in extreme short supply, and you don't want to piss them off by telling them they have to spend their time debugging some crappy FPGA because you wanted to save some hardware bucks.
- conjecTech 2 years ago
  
  The more general the model, the longer the lifetime. And the most impactful models today are incredibly general. For things like Whisper, I wouldn't be surprised if we're already at 100:1 ratio for compute spent on inference vs training. BERT and related models are probably an order of magnitude or two above that. Training infra may be a bottleneck now, but it's unclear how long it will be until improvements slow and inference becomes even more dominant.
  Capital outlays are tied to the derivative of compute capacity, so even if training just flatlines, hardware spend will drop significantly.
  
  flangola7 2 years ago
  
  Isn't whisper self hosted
  
  conjecTech 2 years ago
  
  That's part of my point. There are 100s of organizations using it at scale, but it only needed to be trained once.
- samvher 2 years ago
  
  What would be the set of skills that would put you in the category of LLM talent that is in extreme short supply?
  Just curious what the current bar is here and which of the LLM-related skills might be worth building.
  
  anonylizard 2 years ago
  
  Being able to train base LLMs. This is currently an alchemical skill since you can't learn it at school. This can be further split into infrastructure engineering (managing GPU clusters aint easy), data gathering and cleaning (at terabyte scale), the training itself, etc etc.
  Being very good at fine tuning for a particular goal. Its much easier to learn fine-tuning, so standards are higher to stand out.
  Being able to come up with architectural improvements for LLMs, aka the researcher path.
  Wages start at $250k for grads at the big AI companies.
  
  bwv848 2 years ago
  
  Funny you sort of describe me
  1. For BERT scale model, all you need is a good codebase from GitHub (I had some luck with this one [0]) and a few weeks of trial and error. Want to try training T5 or LLaMA, but don't have the resources needed. Of course training models with more than 100B parameters is another level of labyrinth.
  2. Finetuning is mostly related to how well you understand the task and the data you are dealing with. Since the BERT paper focuses on the GLUE benchmark, I've become very proficient in fine-tuning GLUE and eventually got sick of it.
  3. Made some architectural improvements to BERT, got decent results so I wrote a paper, and got rejected because the reviewers want a head-on evaluation against some well funded papers from Google.
  4. Not in my country. Damn, I am envious.
  [0] https://github.com/IntelLabs/academic-budget-bert.
JonChesterfield 2 years ago

Lots of companies are doing ASICs for machine learning. Off the top of my head, Graphcore, Cerebras, Tenstorrent, Wave. This site claims there are 187 of them (which seems unlikely) https://tracxn.com/d/trending-themes/Startups-in-AI-Processo.... Google's TPU counts and there are periodic rumours about amazon and meta building their own (might be reality now, I haven't been watching closely).
As far as I can tell that gamble isn't work out particularly well for any of the startups but that might be money drying up before they've hit commercial viability. I know the hardware is pretty good for Graphcore, Cerebras and the software proving difficult.
- SoapSeller 2 years ago
  
  Amazon have Inferentia[0] and Trainium[1]. You can use them today on AWS.
  [0] https://aws.amazon.com/machine-learning/inferentia/
  [1] https://aws.amazon.com/machine-learning/trainium/
- bippingchip 2 years ago
  
  A lot of companies are indeed trying to build AI accelerator cards, but I would not necessarily call them ASICs in the narrow sense of the word, they are by necessity always quite programmable and flexible: NN workloads characteristics change much much faster than you can design and manufacture chips.
  I would say they are more like GPUs or DSPs: programmable but optimised for a specific application domain, ML/AI workloads in this case. Sometimes people call this ASIPs: application specific instruction set processors. While maybe not a very commonly used term, it is technically more correct.
- foobiekr 2 years ago
  
  I have experience with companies doing their own chips. Often as not what seems like a good idea turns out not to be because your volume is low and your ability to get to high yield dominates and that both takes years and talent.
  As a rule companies should only do their own chips if they are certain they can solve and overcome the cogs problems that low yield and low volume penalties entail. If not you are almost certainly better off just eating the vendor margin. It is very very unlikely that you will do better.
huijzer 2 years ago

According to Ilya Sutskever in a podcast that I heard, GPUs are already pretty close to ASIC performance for AI workloads. Nvidia can highly optimize the full stack due to their economies of scale.
- josephg 2 years ago
  
  Right. As I understand it, training on gpus isn’t limited by the speed of matrix multiplications. It’s limited by memory bandwidth. So a faster ASIC for matrix operations won’t help. It’ll just sit idle while the system stalls waiting for data to become available.
  That’s why having 96gb (with another 480gb or whatever) available via high speed interconnect is a big deal. It means we can train bigger models faster.
BobbyJo 2 years ago

That shift won't happen until research slows down. Nobody wants to invest significant amounts of money in hardware that will be obsolete in a year.
- voxadam 2 years ago
  
  > Nobody wants to invest significant amounts of money in hardware that will be obsolete in a year.
  When talking about the current ML industry it's more like nobody wants to invest significant amounts of money in hardware that will be obsolete before it's even taped out.
  
  kramerger 2 years ago
  
  Counterpoint: whoever is the first to make advanced ML affordable, maybe even mobile has a good chance of dominating multiple markets in near future.
  
  fxtentacle 2 years ago
  
  There's open source to export fully trained AI models for efficient execution on arbitrary CPUs. Most likely, you won't be able to build up any moat in AI inference.
- mechagodzilla 2 years ago
  
  What would be obsolete? The underlying operations are almost always just lots of matrix multiplies on lots of memory. Releasing a new set of weights doesn’t somehow change the math being done.
  
  BobbyJo 2 years ago
  
  To the extent that AI architectures are just matrix multiplications, there is already an asic: GPUs.
  If you want more efficiency gain than general matrix multiplication hardware, you need to start getting specific about the NN architectures the hardware will support.
  
  AstralStorm 2 years ago
  
  The amount of memory available is the obsolescence.
PUSH_AX 2 years ago

I think this is already a thing.
https://en.wikipedia.org/wiki/Tensor_Processing_Unit
nynx 2 years ago

These are “GPUs” in a pretty stretched sense. They have a lot of custom logic specifically for doing tensor operations.
- cubefox 2 years ago
  
  Which isn't very informative when it isn't clear how close they are to TPUs, which are ASICs.
fxtentacle 2 years ago

The main benefit of using modern GPUs is that they have large high-bandwidth memory. You need to put a lot of work into optimizations to reach >10% of the peak compute capability in practical use. That means an ASIC won't eliminate the performance bottleneck.
xxs 2 years ago

ASICs did take on bitcoin, but not on Ethereum. The ASICs provide an optimized compute unit but they do suck when it comes to memory addressing.
Effectively it'd require the entirely memory controller and the cache, and scheduling. At point point you got most of the GPU w/ a stuck, non-programmable interface of a designated compute. Likely you'd never have to compete for advanced nodes as well.
- 58x14 2 years ago
  
  I worked with an electrical engineer who disclosed to me a mid 8 figures investment made by a PE firm to develop Ethereum FPGA and ASIC hardware, and while he told me they were underwater on that deal, they did achieve (IIRC) 30%+ performance (watt to flop) relative to (at the time) top-of-the-line GPUs.
  I wonder what they’re doing with that hardware now.
  
  shiftpgdn 2 years ago
  
  Still useful for other coins, though the returns may not be very great.
  
  xxs 2 years ago
  
  Interesting, what memory did they manage to use - I suppose not HBM.
- paulmd 2 years ago
  
  ethereum ASICs existed - at least a few were publicly known (antminer and other brands had a few) but others likely existed on the down-low as well. They were never about optimized compute but rather a cost-optimized way to deploy a shitload of DDR/GDDR channels reliably at minimum cost.
  Ethereum is designed to bottleneck on memory bandwidth (while being uncacheable) so at the end of the day the name of the game is how many memory channels can you slap onto a minimum-cost board. You won't drastically win on perf/w - but as mentioned by a sibling, 30-100% over a fully general-purpose gaming GPU is likely possible, because you don't have to have a whole general-purpose GPU sitting there idling (and it's not a coincidence that gaming GPUs were undervolted/etc to try and bring that power down - but you can't turn everything off). "ASIC-resistance" just means an ASIC is only 1-10x more efficient than a general-purpose device, so general-purpose hardware can still stay in the game. It doesn't mean ASIC-proof, you can still make ASICs and they still have at least some perf/w advantage.
  However, if your ASIC costs $100 to get the same performance as a 3060 Ti, that's a huge win even if you only beat the perf/w by 50%. Particularly since your ASIC is likely way easier and more stable to deploy at scale, and doesn't require a host rig with at least a couple hundred bucks of computer gear to even turn on.
  Only plebs were buying up GPUs from retailers or sniping websites, buying from ebay was for the chumpiest of chumps. Gangsters were buying them from the board partners a truckload at a time, true elites just pay someone to engineer an ASIC and do a small run of them. Eight-figures (as mentioned by a sibling) is plenty, a $50-75m run of ASICs is quite a lot of silicon even on a fairly modern node (and some mining companies were publicly known to be using TSMC 7nm and other very modern nodes). And when you invest that kind of money, you don't flash it around and scare the marks.
saynay 2 years ago

We are already seeing chips for inference, really. It's how these models are getting into the consumer market. A lot of the big phones have an inference chip (tensor, neural core, etc), TV are getting them, most GPUs have some stuff dedicated for inference (DSS and superres).
lumb63 2 years ago

There are some industries that take advantage of FPGAs alongside CPU(s) to move some computations into hardware for speed gains while maintaining flexibility. Maybe something like that is possible. For examples, look at the Versal chip.
- synthos 2 years ago
  
  Xilinx's AI has a dedicated AI accelerator in the versal so calling it taking advantage of FPGA isn't quite accurate. It's really another chip that happens to be copackaged with the FPGA

bingdig 2 years ago

> Grace™ Hopper™

Can anyone with more legal knowledge share how they trademarked the name of Grace Hopper?

adsfgiodsnrio 2 years ago

Leaving aside the legality, I find it tacky to use the names of dead people in advertisements. Grace Hopper did not endorse this product. We have no idea what she would have thought of Nvidia. Yet the lawyers are now fighting over the right to use her name and legacy to "create shareholder value".
The worst offender is Tesla, because I'm pretty sure he would have hated that company.
- gruturo 2 years ago
  
  It's tacky but I don't think they are in any way implying an endorsement. Tesla, Ampere, Pascal, Volta, Kelvin, Turing (and quite a few more I can't remember) are all Nvidia architecture names, and are all named after historically important scientists (well, I have my reservations about Kelvin, but that's more personal opinion)
  
  mk_stjames 2 years ago
  
  For Nvidia architectures it wasn't 'Kelvin', it was 'Kepler', as in Johannes Kepler the astronomer.
  
  johntb86 2 years ago
  
  Kelvin (https://en.m.wikipedia.org/wiki/Kelvin_(microarchitecture) ) was an architecture of theirs that was released in 2001.
  
  jsheard 2 years ago
  
  I think they only started using the scientist codenames publicly starting from Tesla, but yeah they've used them internally almost since the founding of the company. Early on there was Fahrenheit, Celsius, Kelvin, Rankine and Curie, then the well-known Tesla, Fermi, Kepler, etc.
  
  mk_stjames 2 years ago
  
  Oh wow; I did not realize their architecture naming scheme even went back that far. That was their first one. The earliest I remembered hearing them call out the name was Tesla.
  For other's reference, nvidia arch's listed on wikipedia as follows in order:
  Kelvin, Rankine, Curie, Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Lovelace + Hopper
  
  toth 2 years ago
  
  Off topic, but I have to ask. What do you have against Kelvin?
  
  mk_stjames 2 years ago
  
  Maybe they are a descendant of William Rankine. Maybe there is an epic 150-year Scottish Baron family feud still going.
  
  HWR_14 2 years ago
  
  My guess is that Gruturo is Irish or is of Irish descent. Kelvin took public stances against the independence (or "home rule") of Ireland.
- anaganisk 2 years ago
  
  it's just flattery, same as naming a road, Martin Luther King drive, Washington boulevard, etc. They may or may not have approved all this, but this just signifies that you want them to be remembered in your own way.
  
  bradlys 2 years ago
  
  I think naming a public road after someone is a lot different than naming a private company's product after someone.
  
  geodel 2 years ago
  
  Yeah we need a law against it. A lot of people would support such law. Could be defining issue of our times.
  
  jrflowers 2 years ago
  
  It would be hilarious if Tesla had to change its name to Electricity Cars LLC or whatever
  
  birdyrooster 2 years ago
  
  if anything the teslabots have trained me to think of Tesla as a power company and not a automobile manufacturer.
  
  jytechdevops 2 years ago
  
  they're both intended to honor said person, no?
  
  anotherman554 2 years ago
  
  If it's an internal name, maybe, but external names are generally intended to manipulate the public via psychology to make unconscious connections to the product that will make them want the product, in other words "marketing 101".
  
  CamperBob2 2 years ago
  
  I wonder if it's safe to name something "Sagan" now [1].
  1: https://en.wikipedia.org/wiki/Power_Macintosh_7100#Codename_...
- jrockway 2 years ago
  
  It's more fun when the people are alive to complain: https://en.wikipedia.org/wiki/Litigation_involving_Apple_Inc...
  Who knew that one of the most profitable companies on Earth would get there by calling Carl Sagan a "butt head astronomer"!
Symmetry 2 years ago

Trademark are always about use in a particular context. Apple has a a trademark on computers using the name "Apple" even though that's been a word for a food for centuries. And if you want produce a line of bulldozers and brand them "Apple Bulldozers" you can do that and get your own trademark on the use of the word "Apple" in that context.
- voakbasda 2 years ago
  
  I really would like to see someone try this, but I do not think it will end well. Apple would find a way to squash the mark. They have sued many businesses that dared to use "Apple" in their clearly unrelated businesses, and I have no reason to believe that this case would be any different.
  Practically speaking, trademarks cover whatever can be litigated successfully.
  
  mousetree 2 years ago
  
  See the disputes between Apple Corps (Beatles) and Apple: https://en.wikipedia.org/wiki/Apple_Corps_v_Apple_Computer
  
  fennecfoxy 2 years ago
  
  And of course the famous agreement from the Beatles of "okay fine, but you can only use the name if you don't get into music"
  Cue Apple Music, lmao. It's a suit-and-tie'd cult, I swear.
  
  speg 2 years ago
  
  Perhaps, but there are indeed some existing that use Apple in their name:
  https://www.applebank.com/
  http://www.applecorps.com/
  https://www.appleleisuregroup.com/
  
  internetter 2 years ago
  
  https://www.theguardian.com/world/2013/oct/07/apfelkind-cafe... :)
- erk__ 2 years ago
  
  Interesting that you used Apple as an example since they have fought with Apple Corp (Beatles owned company) over it for years https://en.wikipedia.org/wiki/Apple_Corps_v_Apple_Computer
  
  bioemerl 2 years ago
  
  Less "fought with" and more "apple corp continually sued apple for money every time apple tried to release a product associated to music."
dathinab 2 years ago

EDIT: To be clear I'm not a legal expert.
Trademarks are context specific and you can trademark "common terms" IF (and at lest theoretically only if) it's used in a very narrow use-case which by itself isn't confusable with the generic term.
The best example here is Apple which is a generic term but trademarked in context of phones/computer/music manufacturing (and by now a bunch of other things).
Through there had been an Apple music label with a bit of back and force of legal cases (and some IMHO very questionable court rulings) which in the end Ended by Apple buying that Label.
So theoretically it's not too bad.
Practically big companies like Apple, Nvidia and similar can just swamp smaller companies with absurd legal fees to force their win (AFIK this is Metas strategie because I honestly have no idea how they think the term Meta for data processing is trademarkable), to make it worse local curt have often shown to not properly apply the law in such conflicts if the other party is from an other country (one or two US states are infamous for very biased legal decision in this kind of cases).
So yeah at the core this aspect of the trademark system is not a terrible idea, but the execution is sadly often fairly lacking. And even high profile cases of trademark abuse often have no consequences if it's a "favorite big company". (For balance negative EU example do include Lego and it's 3d trademark and absurdly biased curt rulings, or Ferrero and it's Kinder (german. Children) trademark on Chocolate).
EDIT: also not the two TM: Grace™ Hopper™ both Grace and Hopper are generic terms you can under some circumstances trademark and then use together, but while probably legal you would likely want to avoid trademarking (Grace Hopper)™
adolph 2 years ago

It looks like two separate trademarks.
https://en.wikipedia.org/wiki/Salami_slicing_tactics
- jabl 2 years ago
  
  I think it's more due to this device combines a "Grace" CPU and a "Hopper" GPU, thus creating a "Grace Hopper" superchip.
mk_stjames 2 years ago

Something to do with Grace Hopper being an actual person (although she is deceased) and thus not being to trademark the entire name?
- jabl 2 years ago
  
  Hopper is the name of the GPU architecture, and Grace is the name of the CPU. Combining them in a device gets you a "Grace Hopper" superchip.
  (And yes, I'd guess the codenames where chosen back in the day with an eye towards combining them in the same device.)
  
  jsheard 2 years ago
  
  There's one funny exception to the rule of Nvidia naming their GPU architectures after just the surname of a famous scientist - they always refer to the "Ada Lovelace" architecture using her full name, presumably to avoid association with the other famous Lovelace.
  
  ConceptJunkie 2 years ago
  
  I think it would be interesting to compare this with the Apple BHA. Although in this case, Admiral Hopper's family would have to take up the fight.
crazypython 2 years ago

"™" has no legal meaning. "(R)" means a registered trademark.
- HWR_14 2 years ago
  
  That is not true. (TM) has a legal meaning. It's weaker than an (R), but it is still an enforceable trademark.
  It's similar to creating a work covered by copyright vs. registering it with the copyright office.
  
  bdowling 2 years ago
  
  Affixing (TM) does not make something an enforceable trademark. Trademark rights are established by use of the mark in commerce. A (TM) is used to put others on notice that a word or design is being used as a trademark, which could potentially be evidence of the intent of an alleged infringer in a subsequent lawsuit. Intent is relevant because it affects damages and other remedies.
balls187 2 years ago

They didn’t.
Grace is a trade mark. Hopper is a trade mark.
Hence each term having it’s own TM.

Aardwolf 2 years ago

What kind of CPU is the CPU part? The link doesn't tell. Is it like ARM or RISC V and can it run a general purpose OS like Linux?

Do you plugin in DDR memory somewhere for the 480GB, or is this already on the board?

EDIT: found answer to my own question in the datasheet: "The NVIDIA Grace CPU combines 72 Neoverse V2 Armv9 cores with up to 480GB of server-class LPDDR5X memory with ECC."

tromp 2 years ago

Quoting from https://www.nvidia.com/en-us/data-center/grace-cpu-superchip...
> The NVIDIA Grace CPU Superchip uses the NVIDIA® NVLink®-C2C technology to deliver 144 Arm® Neoverse V2 cores and 1 terabyte per second (TB/s) of memory bandwidth.
> High-performance CPU for HPC and cloud computing Superchip design with up to 144 Arm Neoverse V2 CPU cores with Scalable Vector Extensions (SVE2)
> World’s first LPDDR5X with error-correcting code (ECC) memory, 1TB/s total bandwidth
> 900 gigabyte per second (GB/s) coherent interface, 7X faster than PCIe Gen 5
> NVIDIA Scalable Coherency Fabric with 3.2TB/s of aggregate bisectional bandwidth
> 2X the packaging density of DIMM-based solutions
> 2X the performance per watt of today’s leading CPU
- jabl 2 years ago
  
  This is about the "Grace Hopper Superchip", which has one of the Grace CPU's replaced with a GPU. Thus, 72 Neoverse V2 cores.
  https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...
greggsy 2 years ago

72 Neoverse V2 Armv9 cores.
Not sure how one interfaces with it, but it presumably runs an approved Linux distro, with a web server at best.
- whatisyour 2 years ago
  
  it's a normal chip like your x64 chip.you install ARM variant of your linux distribution on it and run it natively.
  source: I have one
  
  Aardwolf 2 years ago
  
  Could one in theory then game on it by installing Steam on linux? And run all the LLM's with the largest models you want straight from github repositories?
  
  whatisyour 2 years ago
  
  Yes, its no different from any other server. If it was a desktop computer, you will connect your HDMI to it to a monitor and put GPUs on its PCIe slots and attach hard drives to NVMe.
  
  Aardwolf 2 years ago
  
  Where does one buy one anyway? What does it cost? (I know it'll be some huge amount, just curious)
  
  reaperman 2 years ago
  
  These types of items are generally only sold to large buyers with established partnerships. Eventually some stock gets allocated to wholesalers like ShopBLT[0] and CDW[1], which break tradition and sell directly to consumers rather than only selling to retailers/system integrators.
  These are for the 80GB versions, currently priced at $30,000 per GPU. It will likely be many months before this 96GB version is available to prosumers, if it ever is at all.
  0: http://www.shopblt.com/cgi-bin/shop/shop.cgi?action=thispage...
  1: https://www.cdw.com/product/nvidia-h100-gpu-computing-proces...
  
  horsawlarway 2 years ago
  
  Some of the older generations run roughly the cost of 3/2 house in a medium COL city.
  Ex - the HGX A100 platforms sold as single servers usually ran around 150k, but could get up above 200k depending on loadout.
  Just getting an H100 (just the gpu) right now is ~40k new.
  There is a reason nvidia's stock is doing so well...
  
  geerlingguy 2 years ago
  
  "If you have to ask..."
  It is many, many thousands.
  
  pixl97 2 years ago
  
  I think more like many tens of thousands.
chakintosh 2 years ago

LTT posted a video a few days ago from Computex talking a bit in depth about it

tiffanyh 2 years ago

Dumb questions ...

- Am I wrong in understanding this is a general purpose computer (with massive graphic capabilities)?

- And if so, what CPU is it using (an NVIDIA ARM CPU)?

- And what OS does it run?

zucker42 2 years ago

Correct, it's called the Grace Hopper superchip because it using Nvidia Grace CPU (which are ARM) and Nvidia Hopper GPU.
For OS, it will run some form of Linux. I'm not sure if the particular recommended build has been (or will be) publicly released.

ulrikhansen54 2 years ago

More powerful chips are great, but NVIDIA really ought to focus some of their best folks on ironing out some of the quirks of using their CUDA software and actually getting stuff to run on their hardware in a simpler manner. Anyone who's ever fiddled with various CUDA device drivers and lining up PyTorch & Python versions will understand the pain.

IceWreck 2 years ago

The solution is to not install CUDA on your base system because you need multiple versions of CUDA and some of them are often incompatible with your distro provided GCC.
Here is what works for me:
- Nvidia drivers on base linux system (rpmfusion/fedora in my case)
- Install nvidia container toolkit
- Use a cuda base container image and run all your code inside podman or docker
- codethief 2 years ago
  
  I admit it's been a while (2 years) since I last played with Nvidia/CUDA (on Jetson) and back then running CUDA inside Docker was still somewhat arcane, but in my experience, whatever the Nvidia documentation lays out works well until you want to 1) cut down on container image size (important for caches and build pipelines) and, to this end, understand what individual deb packages and libraries do, 2) run the container on a system different from the official Nvidia Ubuntu image.
  Back then the docs were just awful. Has this really changed that much in recent times?
  
  shaklee3 2 years ago
  
  Containers have always come in different flavors that represent their sizes and capabilities. For example, runtime containers have the bare minimum to get the application running but none of the debug tools.
  
  ulrikhansen54 2 years ago
  
  The docs are still terrible, coupled with AWS / GCP docs around these things it makes it near impossible to get this stuff to work without investing a significant amount of time.
thangngoc89 2 years ago

Pytorch is the most painless one because everything is bundled in the wheel. Latest stable CUDA supported by PyTorch is 11.8 and I have been running it on a CUDA 12.0 machine because CUDA is backward compatible. Tensorflow on the other hands, requires compilation with the installed CUDA library and it’s truly a pain since I can’t change the machine’s CUDA version.
omgJustTest 2 years ago

Hardware before software!
- paulryanrogers 2 years ago
  
  ATI/AMD GPUs supposedly have great hardware, hamstrung by less-than-great software. In fact it's the lack of some software features making me hesitate to switch despite major cost savings.
  
  mrguyorama 2 years ago
  
  AMD drivers are fine if you only care about gaming. There's the occasional idiocy like the default fan curve for my graphics card refusing to run higher than 70% so that the card will cook itself if you actually use it and hard crash your system or the driver, but eh.
  The real problem is that ROCm is a fucking joke, pathetic, half assed, pretend project. Nobody with power in AMD seems to care that nobody can learn machine learning on their hardware to push it in other places, or that their GPUs that they have recently spent all this time boasting about their higher VRAM which is literally useless unless you want to play poorly optimized AAA titles ported from the PS5.
  People say it works but you basically have to be one of the engineers who wrote it to prove that. Good luck getting it to work with Windows, or any hardware that wasn't purpose built for a cluster partner. It's so stupid. Maybe they genuinely intended to make a real CUDA competitor but noticed the ways that nVidia then had to artificially segment their market through dumb decisions (the VRAM) and bios hacks that didn't work and just gave up on that path.
  
  kapperchino 2 years ago
  
  In fact the amd fine wine is just them fixing their drivers from launch

indymike 2 years ago

Is there some connection between Nvidia and Admiral Hopper's family that makes it ok to appropriate her identity for their product?

vinay427 2 years ago

As far as I can tell they have frequently used the names of famous scientists, such as Kepler, Fermi, Maxwell, Pascal, Turing, Ampere, and Ada Lovelace. This has existed long before Hopper.
justinclift 2 years ago

> appropriate her identity
Hmmm, what's the difference between homage and appropriation for things like this?
- indymike 2 years ago
  
  Claiming a trademark.
  
  justinclift 2 years ago
  
  That's a good point. If they're trying to trademark the name, that sounds like it'd raise some ethical issues. :)

dauertewigkeit 2 years ago

I would like to know how fast this is compared to a model shared on 3 A100s with 32 GBs of VRAM each.

What I am also interested in, is why model sharding has to be done manually. It seems like, one should be able to write a framework that will take your forward step and distribute the amount of layers on the available GPUs, automatically. But I haven't come across such a framework yet.

ttul 2 years ago

Jax has auto-sharding: https://jax.readthedocs.io/en/latest/notebooks/Distributed_a...
josephg 2 years ago

Probably much faster. PCIe is the limiting factor in a 3x A100 setup. Having all that memory directly accessible by the GPU will make a massive difference. Even the massive 480 gigs of cpu memory in this architecture can be accessed many times faster than pcie allows.
- touisteur 2 years ago
  
  Hopefully if you have 3x A100 you have bridged them with NVLink and that's your bottleneck. If you want more bandwith you'll need a NVSwitch. Or a H100 GPU (Lovelace didn't get NVLink, not even the A40-little-NVLink-port - or PCIe5 for that matter) with next gen NVLink.
simon_acca 2 years ago

> take your forward step and distribute the amount of layers on the available GPUs, automatically
This is part of the value proposition of Mojo, Chris Lattner’s latest project. The compiler infrastructure is still in its infancy, but looks promising: https://www.modular.com/mojo
yeahwhatever10 2 years ago

DeepSpeed?

tppiotrowski 2 years ago

Kind of curious what WebGL report [1] would be for one of these devices. Does the extra RAM make maximum texture sizes massive and allow for thousands of textures or is that a software limitation?

[1] https://webglreport.com/

fulafel 2 years ago

What's a superchip? Is it their marketing term for MCM (multi-chip module)?

wmf 2 years ago

More or less yes.

tmikaeld 2 years ago

Hm, that's a heft price, 200K per card.

Mac Studio M2 ULTRA has 192Gb of RAM, potentially 188Gb available for GPU, for 5K.

Wouldn't apple be able to compete with that if they scaled it up?

PragmaticPulp 2 years ago

The compute capacity of these nVidia parts is significantly higher than that of an M2 Ultra. It’s not just about memory capacity.
The memory bandwidth of the nVidia GPUs is also significantly higher than the M2 parts.
The Apple silicone parts are impressive for what they are, but they don’t have a huge efficiency advantage for GPU compute. The full-size GPUs with huge memory buses and a large number of cores are still significantly more powerful.
There’s also a matter of getting data into and out of the GPUs and across the network, which takes a lot more than 10Gbe
The Apple silicon is great for running development workloads locally, but it’s significantly slower than full size GPUs.
I’m kind of surprised at how quickly everyone forgot that Apple’s marketing material greatly exaggerates their GPU performance.
- anaganisk 2 years ago
  
  *Apple Silicon, I'm no grammar Nazi, it made me chuckle.
  
  mrguyorama 2 years ago
  
  Can you imagine how much Apple would charge for branded boobs?
  
  anaganisk 2 years ago
  
  But for sure they would be impressive for what they are:p
ChuckNorris89 2 years ago

Go ahead and ask datacenters why they all use overpriced Nvidia chips for AI training instead of shoving cheaper Mac Studios in there. Their answer might blow your mind.
Spoiler alert: CUDA ecosystem, Linux suport, and most importantly for data centers, Mellanox high speed interfacing with virtually infinite scalability and great virtualization support so they can rent out slices of their HW to customers in exchange for money.
- kkielhofner 2 years ago
  
  A 15 year head start in a category they essentially defined plus an entire generation of executives, developers, and users doesn’t hurt either.
  People complain about the “Nvidia tax” but the hardware is superior (untouchable at datacenter scale) and the “tax” turns into a dividend as soon as your (very expensive) team spends hour after hour (week after week) dealing with issues on other platforms compared to anything based on CUDA often being a Docker pull away with absolutely first class support on any ML framework.
  Nvidia gets a lot of shade on HN and elsewhere but if you’ve spent any time in this field you completely understand why they have 80-90% market share of GPGPU. With Willow[0] and the Willow Inference Server[1] I'm often asked by users with no experience in the space why we don't target AMD, Coral TPUs (don't even get me started), etc. It's almost impossible to understand "why CUDA" unless you've fought these battles and spent time with "alternatives".
  I’ve been active in the space for roughly half a decade and when I look back to my early days I’m amazed what a beginner like me was able to do because of CUDA. I still routinely am. What you’re able to actually accomplish with a $1000 Nvidia card and a few lines with transformers and/or a Docker container is incredible.
  That said I am really looking forward to Apple stepping it up here - I’ve given up on AMD ever getting it together on GPGPU and Intel (with Arc) is even further behind. The space needs some real competition somewhere.
  [0] - https://github.com/toverainc/willow
  [1] - https://github.com/toverainc/willow-inference-server
  
  mk_stjames 2 years ago
  
  I recently sat and thought about it, and I made the bold claim to a friend that the development of CUDA over the last 15 years.. just that activity, as a whole, is one of the largest use of human intelligence to ever be focused on any one thing in history. It's less people than worked on getting to the Moon, but if you measure the project in PhD-person-years it is probably off the charts. And if you include all the engineers and scientists that the program affects in the same way you would have included all the subcontractors for the Apollo program, it is way, way bigger.
  I think the only reason CUDA isn't talked about like the monumentally important human milestone in technological development that it is, is that it is a pretty abstract thing that is difficult for laypeople to visualize.
  
  touisteur 2 years ago
  
  The sheer volume of work of 'just' creating and maintaining a C++-like compiler, highly performant cuBLAS, cufft, cuSOLVER, cuDNN with very high perf, all by their lonesome in closed source shows how far one would have to go to even eat parts of their lunch.
  We (who want a real alternative to NVIDIA) either find a way to pool global resources applicable to this level kind of effort on all new accelerator architectures, or we wait for them to stumble, Intel-like. Intel and AMD not pooling their resources on this is self-defeating.
  
  kkielhofner 2 years ago
  
  This is an excellent point.
  I've been accused of being an Nvidia "fanboy" when I touch on this. I attempt to explain:
  "Nvidia made the very hard, very expensive commitment to developing and supporting CUDA 15 years ago when this space was in it's infancy (non-existent). They SUNK incredible resources into this gamble/vision to universally support CUDA on every chip and every platform - for 15 years. They didn't achieve their position through shady dealings or luck, they earned it with a decade and a half of investment, focus, and execution."
  Granted, they do somewhat abuse the position they have now (as often noted on HN and elsewhere). At the risk of whataboutism I ask: "Show me a corporation that wouldn't. Do you think if AMD had their market share they'd be nice, cuddly good guys?" Microsoft, Intel, Nvidia, Standard Oil, the Phoebus cartel, etc - it goes on and on. Always has and always will. Of course I'm not saying it's a good thing, it's just a fact of the real world.
  
  mrtranscendence 2 years ago
  
  So what if other companies would similarly abuse their position? If it doesn't excuse Nvidia's behavior in some way then it hardly seems relevant to point out. We still need more competition in this space, or Nvidia parts will become increasingly bad deals (as we've seen for consumer GPUs).
  
  kkielhofner 2 years ago
  
  I'm not saying it's a good thing, I'm simply pointing out it's more-or-less universal standard behavior for people and corporations to abuse positions of power unless checked externally - anti-trust action, checks and balances within government, other regulation, etc. As noted the recurring tendency on HN and elsewhere to paint Nvidia as some kind of unique diabolical actor is very strange.
  How are Nvidia consumer GPUs a bad deal? When comparing top of the line cards (I'm not going to bother looking elsewhere in the product lines) for 60% more cost (yes, significant) with an RTX 4090 you get:
  - Performance that walks all over the 7900 XTX[0].
  - The ability to use it to self-host, experiment, learn, etc a never-ending range of ML applications that (as discussed) "just work" as a docker pull away.
  With an RTX 4090 you could have stunning gaming performance one minute and then seconds later be running a local LLM, etc. That is tremendously more value than the 60% price difference. If all I wanted to do was gaming and I was price sensitive I'd save myself the $600 and be happy with an AMD GPU. But, looking at market share[1] (at least 80% across desktop gaming and GPGPU) either the value of AMD GPUs is a little known secret (it isn't) or consumers, with the ability for choice, overwhelmingly see the value in Nvidia GPUs.
  [0] - https://techguided.com/7900-xtx-vs-rtx-4090/
  [1] - https://wccftech.com/q3-2022-discrete-gpu-market-share-repor...
  
  A4ET8a8uTh0 2 years ago
  
  << I'm often asked by users with no experience in the space why we don't target AMD, Coral TPUs (don't even get me started), etc. It's almost impossible to understand "why CUDA" unless you've fought these battles and spent time with "alternatives".
  Would you be willing to elaborate ( ie. I would love to hear you get started )? I absolutely agree that some competition is needed in this space. I am absolutely not an expert so it is hard for me to understand why there is no real alternative to CUDA. Are they just too hard to set up? Not popular enough to have any support?
  
  kkielhofner 2 years ago
  
  Probably the best reference is HN itself. Look at the dozens of GPGPU projects, articles, papers, etc that hit the front page in any given week. Then look to see how many of them support AMD/ROCm. Spoiler alert: virtually zero.
  That's compelling enough to justify the position here but you can do further research to explore the challenges with other platforms (like ROCm). Just glance at issue trackers for Pytorch, Tensorflow, and higher level projects that (rarely) support ROCm - you will notice a clear trend. Even though CUDA is more capable and outnumbers AMD use 10:1 the issues reported currently are:
  Pytorch
  Search for ROCm - 2,557 issues open (~10% market share)
  Search for CUDA - 5,430 issues open (at least 80% market share)
  Even in the limited cases where it's attempted the capability and experience with AMD/ROCm is significantly worse, to the point of "almost no one even bothers anymore" (see my top paragraph).
- adsfgiodsnrio 2 years ago
  
  A "prosumer" desktop is just plain not up to the job. The Mac Studio is comparable to a high-end gaming PC. It is not really a workstation and definitely isn't a server. Its 16 performance cores and 192 GiB of non-ECC memory cannot compete with nearly 200 cores and multiple terabytes. Its GPU compute and memory bandwidth are many times less than what Nvidia can build at the high end.
  The Studio would run large workload many times slower than a high-end server, if it could run them at all.
- edf13 2 years ago
  
  What is the answer?
  
  sterlind 2 years ago
  
  Compute. The NVIDIA cards have way more tflops than the Mac SoC. A simple RTX 4090 has 16,384 CUDA cores. The M2's GPU has... 38 cores, apparently? The M2 GPU clocks in at 3.6tflops, compared to ~100tflops for the RTX 4090.
  And this is just for a consumer GPU, I haven't even touched on the datacenter-grade stuff.
  tl;dr the M2 is an underpowered GPU with a lot of RAM close by, while NVIDIA cards are multiple orders of magnitude more powerful but most of the RAM's a bit farther away.
  seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.
  
  joakleaf 2 years ago
  
  • The GPU in the base M2 has 10 GPU cores and 3.6 tflops
  • The GPU in the M2 Ultra has 76 GPU cores and corresponds to 2x M2 Max, which has 13.6 tflops; so 27.2 tflops [1,2]
  • RTX 4090 has 82.58 tflops [3] (overclocked can reach 100 tflops [4])
  While more powerful NVidia cards are not "multiple orders of magnitude more". Rather it seems a 4090 is around 3-4 faster than an M2 Ultra GPU.
  Keep in mind that the Apple Silicon chips also have low-precission Neural Engine circuits for inference of neural nets. For the M2 Ultra they claim 31.6 tops [1].
  [1] https://www.apple.com/newsroom/2023/06/apple-introduces-m2-u...
  [2] https://www.notebookcheck.net/Apple-unveils-M2-Pro-and-M2-Ma...
  [3] https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889
  [4] https://videocardz.com/newz/overclocked-nvidia-rtx-4090-gpu-...
  
  lostmsu 2 years ago
  
  Since we are talking about AI, you should use Tensor Core Tops instead of regular flops. That number for 4090 is 330. So more than 10x of M2 Ultra
  
  mosshammer 2 years ago
  
  The RTX 4090 has 330.3 TOPS with INT8 precision, so for inference workloads it is still a magnitude faster
  
  fyzix 2 years ago
  
  Max to ultra doesn't scale linearly. Max tech on YouTube covered this extensively
  
  joakleaf 2 years ago
  
  Probably true that performance doesn't scale linearly (and I do remember the Max Tech analysis of M1), but I don't have the tflops number for the M2 ultra.
  For the M1 the stated values are:
  M1 Ultra [1]: 21 TFLOPs M1 Max [2]: 10.6 TFLOPS
  Actual performance obviously depends on the type of work-load etc. E.g. Geekbench Ultra is only 50% faster [3]:
  M1 Ultra Geekbench 6: 150260 M1 Max Geekbench 6: 108198
  [1] https://videocardz.com/newz/despite-apples-claims-m1-ultra-g... [2] https://wccftech.com/m1-max-teraflops-performance-higher-tha... [3] https://browser.geekbench.com/metal-benchmarks
  
  senko 2 years ago
  
  > seriously, measuring graphics cards in gigabytes is like measuring battery life in volts. it's the wrong unit.
  Yes, but, if you need 48GB to run inference on a model, and you only have 24GB available, you don't get to enjoy the tflops difference.
  If nVidia released somewhat lower-performance GPUs but with more VRAM, we could talk. But they're not stupid :)
  
  nightski 2 years ago
  
  You can get an A6000 for cheaper than an M2 Ultra. That said it's still only 48GB.
  
  senko 2 years ago
  
  I'm not sure if it's cheaper when you compare the whole machine price, not just GPU, since you can't buy piecewise with the Mac.
  
  nightski 2 years ago
  
  The GPU is $4500. That gives you $2000 to build a machine (which you can build a top of the line consumer PC for cheaper than that, I just did recently w/128GB DDR5 and Zen 4 7950X3D).
  
  senko 2 years ago
  
  Cool if you can find it at that price. On my side of the pond, it's around €6,000.
  
  sterlind 2 years ago
  
  yes you do. you can use DeepSpeed. it's only like 2x slower.
  
  argsnd 2 years ago
  
  You're not wrong that the RTX 4090 will be much more powerful than an M2 Ultra, but you don't have any idea what you're talking about when you're comparing specs.
  You might be able to compare the number of CUDA cores to the ALU count of the Apple GPUs. I don't know what that is for M2 Ultra yet, but for the 64 core M1 Ultra each core had 16 execution units and each of those had 8 ALUs, for a total of 8,192 ALUs. The M1 Ultra's FP32 performance was in the ballpark of 21tflops - assuming a ~30% improvement in the M2 Ultra that takes us to ~27tflops. Google suggests that for the RTX 4090 it's 83tflops.
  
  cypress66 2 years ago
  
  The cuda cores and m2 cores aren't comparable.
  A more appropriate comparison is the fp16 performance.
  It seems to be 27 tflops for the 38 core M2, and 330 for the 4090.
  The more useful for training fp16 with fp32 accumulate is 165 for the 4090, I don't know about the apple one.
  
  threeseed 2 years ago
  
  There is a video [1] benchmarking last year's M1 Ultra against 3080ti for one ML use case and it was 3x slower. Apple MX does have neural engines which are used but not sure how they compare to CUDA cores.
  Either way quite a bit better than 30x slower.
  [1] https://www.youtube.com/watch?v=k_rmHRKc0JM
  
  jorgemf 2 years ago
  
  In that video they run the Linux experiments over a windows with a virtual machine. And I didn't see the model, but I bet you I can trains a model in a 4090 one 2x faster than in a old 1050 (because i can chose a model which bottle neck could be the data transfering not the actual computation).
  
  throwaway485 2 years ago
  
  I am speculating the answer is that "Nvidia just works", where Apple may be more niche & hassle to get working with their preferred frameworks/stacks/tools.
  
  less_less 2 years ago
  
  Maybe that, but also the Nvidia chips have *vastly* higher performance (see https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-su...). They claim 4 TB/s memory bandwidth and up to 989 single-precision TFlops in tensor mode (67 TFlops for non-tensor ops).
  By contrast, M2 Ultra has 800 GB/s memory bandwidth, 31.6 half-precision TFlops in the Neural Engine, and (extrapolating from https://en.wikipedia.org/wiki/Apple_M2), about 27 single-precision TFlops on the GPU.
  So 5x memory bandwidth, more than double generic throughput, and at least 32x peak tensor throughput. Sure, the Mac Studio uses much less power, but depending on the application that usually doesn't make up for the speed difference.
  
  zirgs 2 years ago
  
  CUDA.
delfinom 2 years ago

Memory is the cheap part at the end of the day. It's about the rest of the chip and why GPUs are fundamentally and radically different from a CPU.
- vlovich123 2 years ago
  
  Memory is rapidly becoming the expensive part compared with compute and memory doesn’t get as cheap as quick as CPU. Notice how main memory price/TiB has basically flat lined for the past 10 years:
  https://ourworldindata.org/grapher/historical-cost-of-comput...
  Sure, they’ve gotten a bit faster but it’s still fairly expensive to outfit more and more RAM.
  GPUs though are indeed much more than memory. However, Apple has a unique unified memory model that no one else has matched yet where the memory is the heart of the machine. That means you don’t even need as much bandwidth because you can transparently access the same data by any chip at the same speed. That’s a pretty powerful design. I doubt Apple will really go into the training side of things because that’s not germane to their use cases because that happens in the cloud where they don’t have a presence yet. Inference is so expect more LLM / stablediffusion acceleration. Now if fine tuning models with additional training becomes a thing, then you’ll see acceleration of that modality on Apple’s machines. But training won’t be a focus because Apple doesn’t care about fickle nerd points that aren’t relevant to their business.
jakjak123 2 years ago

In datacenters, sparse rack space is rather expensive. Plus the compute per watt of these nvidia chips are off the scale compared to anyone else.
gmm1990 2 years ago

Probably apple are the only ones doing the newer tsmc process too
- giobox 2 years ago
  
  While Apple likely gets the first production batches, there are several major vendors lined up as customers for TSMC's latest N3 (used on M2 etc) process:
  > https://www.guru3d.com/news-story/tsmc-has-secured-seven-maj...
  There will very likely be products from Intel, Qualcomm, Nvidia and AMD stuff using it too this year.
wahahah 2 years ago

192GB (1,536Gb)
Synaesthesia 2 years ago

Apple could also market their fast efficient chip to datacentres and servers if they were so inclined. They choose not to compete in certain areas, focusing on consumer products. I don't know why but it's their choice.

throwaway290 2 years ago

> Global hyperscalers and supercomputing centers in Europe and the U.S. are among several customers that will have access to GH200-powered systems.

So, not for everyone.

keyme 2 years ago

If you had a machine that prints money, you wouldn't sell it, would you? You would run it yourself.
It's the same case as with high-end electronic components since 10-15 years ago. No one can/could produce their own smartphone because the high end components are sold only to the largest incumbents. A medium-size startup "has the money" to buy these, even in volume. But qualcomm won't sell.
Same fate expects us here in general purpose computing. Hard to believe, but so was what happened to the electronics supply chain, at the time.
fennecfoxy 2 years ago

Well there are restrictions on certain countries buying systems this powerful with the idea that they might be used for bad things.
Granted, we can use them for bad things, too, but nobody wants to get stabbed with their own knife.

bilsbie 2 years ago

Why does it have cpu ram?

Synaesthesia 2 years ago

Because it's still got separate Ram for CPU and GPU which is how most PC's wor. It's not an integrated circuit which uses shared memory.
- bilsbie 2 years ago
  
  Isn’t this a gpu card though? Should the cpu and system ram be separate?

washadjeffmad 2 years ago

What's the form factor, just two fused SXM5s?

schlupfknoten 2 years ago

What would something like that cost?

laweijfmvo 2 years ago

$200,000
- drtgh 2 years ago
  
  I was going to talk about the power consumption ranging from 450W to 1000W (expensive), but I guess with such price level it will not mind at all.

tmaly 2 years ago

Can I get one of these for my desktop so I can run the 65B models?

seydor 2 years ago

Laptops should have this in a year

pixl97 2 years ago

Only if you want your lap to catch on fire.
Hamuko 2 years ago

>Module thermal design power (TDP): Programmable from 450W to 1000W (CPU + GPU + memory)
I don't think so Tim.
- ed25519FUUU 2 years ago
  
  Nothing like programming with a 1 kW power plant on your crotch.
  
  krylon 2 years ago
  
  They could sell it as a contraceptive for men. (SCNR)
  
  z29LiTp5qUC30n 2 years ago
  
  Yeah, that was like trying to program when my ex-wife wanted attention.
  Just wish I could have typed half as well as Hugh Jackman in Swordfish; then I wouldn't feel so sorry for all of the bugs I introduced into production because of her.
- seydor 2 years ago
  
  Unfoldable solar panel. Apple said yesterday that sitting in the sun is good for your eyes
doubleorseven 2 years ago

Yes, but will it blend?

gitcommitco 2 years ago

[flagged]

sylware 2 years ago

... and still so small and slow compared to a human brain.

KeplerBoy 2 years ago

I don't know about you guys but computers have always been faster at math than my brain.
- tstrimple 2 years ago
  
  For conscious math yeah that's true. We also do a lot of unconscious math. Catching an arcing ball flying through the air for example. We don't think of that as math because it feels so natural to us, but our brain is tracking trajectory and speed and you're coordinating your body to intercept based on those details. Similar story with throwing a pass to where you know someone will be. Those sorts of skills become ingrained to the point where they become a reflex, but it's math computation behind the scenes.
  
  casey2 2 years ago
  
  What math exactly does the brain do while you are trying to catch a ball? Just because you can mathematically model the trajectory of a ball doesn't mean your brain is using that model. The standard method humans use to catch a ball is stabilization at a few updates every second, while the classic method used to tract the trajectory of a ball is newton's method of fluxions. Humans suck at both compared to computers; though the former isn't as much a shameful a loss as the latter.
  
  tstrimple 2 years ago
  
  Because that's what studies into this suggest.
  https://news.mit.edu/2018/study-reveals-how-brain-tracks-obj...
  
  boringuser2 2 years ago
  
  This is simply ridiculous.
- sylware 2 years ago
  
  You totally missed the point: the connectome of a human brain is still orders of magnitude bigger and faster than that.
  Power efficiency does not matter for those.
  Those are interesting only for the implementation of specialized cognitive functions.
  Additionnaly, the connectomes of those chips are 2D and very localized. Human brains are 3D and much less localized. Simulating 3D connectome with those 2D chips slows everything down by a lot.
- hardware2win 2 years ago
  
  How about learning e.g car driving?
  
  echelon 2 years ago
  
  It took humans 300,000 years to learn to drive the car. Much longer than that, if you consider our lineage.
  
  hardware2win 2 years ago
  
  >It took humans 300,000 years to learn to drive the car.
  False, to invent car.
  People manage to learn to operate cars in less than 50 hours
  
  detrites 2 years ago
  
  You're comparing training a model from scratch in ML, to the equivalent of model fine-tuning in humans. It's unfair and incorrect. Eg, no human can drive without first learning to operate their limbs, recognise shapes etc.
  The GP is pointing out that training in fine muscle motor skills, self-awareness and ability to project self-awareness to other objects under ones control etc, all took many thousands of years to develop. AI is faster.
  However, it's again unfair, as AI only knows what it knows from us, so in that sense any comparison is built on shaky ground.
  But for the purposes of comparing a stock human brain as hardware, versus a current high-end GPU specifically in terms of ingesting information and then perform tasks, the GPU beats the human brain "hands-down" in any category.
  The only categories it doesn't are simply ones that no one has trained it to yet - so the argument on a pure hardware capability basis stands.
  
  incrudible 2 years ago
  
  Still, the 300,000 years figure is way off. That's an anatomically modern human and they most likely could be trained to operate a car much like a human of today can. Getting to that point took billions of years, but producing the driver of a car was never the goal of that process.
  It's just the wrong way to look at the problem. You're not trying to develop the generic system that can learn how to drive a car, you're trying to develop the specific system that can safely drive a car occupied by humans, naturally employing machine learning.
  I would argue that we're 95% there, but solving those last 5% is exponentially more expensive, but not commensurately more valuable. There's a "profit ceiling" imposed by the cost of a human driver, which appeats to make solving the problem economically intractable.
  
  gzer0 2 years ago
  
  https://www.youtube.com/watch?v=DjZSZTKYEU4
  Here's a video of Tesla FSD driving through the complicated streets of Los Angeles for an hour straight, with 0 human intervention.
  
  hardware2win 2 years ago
  
  And how many learning-years did it need to achieve that?
  In compare to human's tens of hours?
  
  TN1ck 2 years ago
  
  Tens of hours? I wasn't aware that newborns are allowed to drive :D. Jokes aside, if you want to compare the training time, you should also include the time it takes us to learn our surroundings and navigate the world.
  
  qwytw 2 years ago
  
  Even if we assume 10-15 years a human brain is still astronomically more efficient at this. Unfortunately we can't yet copy-paste of our brain so everyone has to learn it from scratch...
  
  danielbln 2 years ago
  
  More like a few million years. Humans come pre-trained via DNA, the 10-15 years are mainly fine-tuning.
  
  fennecfoxy 2 years ago
  
  Just came to leave the same comment aha.
  We have to include our evolutionary process because a lot of our brain is pretrained, especially visual/motor neurons.
  We seem to be pretrained to pick up language, for example, and the language(s) we hear after being born fill that space, our brains are plastic for a reason.
  
  TeMPOraL 2 years ago
  
  > In compare to human's tens of hours?
  On top of typically around 18 years of learning to process and fuse vision, sound, proprioception and other inputs, to navigate the world and reason about it.
  
  hardware2win 2 years ago
  
  18 years because this is more due to cultural/law requirements, not brain capability
  
  TeMPOraL 2 years ago
  
  And those cultural/law requirements are not arbitrary - they've been set at around the brain capability is sufficient for the task. There's both science and a ton of experimentation behind them.
  
  anonylizard 2 years ago
  
  Humans have 1. Pre-trained fine-motor control and visual perception that is encoded in genetics (think base model training). Who knows how many millenium this took.
  2. 16 years of fine-tuning to adapt to the current modern world.
  3. 50-100 hours of specific task based fine-tuning for driving, think LORA training.
  
  neatze 2 years ago
  
  This type of self driving existed since 90's, as far I understand hard part is driving in traffic and pedestrians.
  
  firecall 2 years ago
  
  It's just a matter of knowing how to teach it...
  
  ed25519FUUU 2 years ago
  
  Why not have it teach itself if it’s so smart!
- MrBuddyCasino 2 years ago
  
  call me back when they can iron my shirts
  
  TeMPOraL 2 years ago
  
  They can, you just won't like the price tag.
  
  gtirloni 2 years ago
  
  The original comment was about the human brain, not a single specialized task.
  
  TeMPOraL 2 years ago
  
  I'm not going to bet against GPT-4, or its multi-modal successor that's supposed to be out soon, being up for this task. The hard bit is manipulator hardware, which is out of scope for this topic, and the tricky bit is teaching a generative model how to use it, which I think is entirely within capabilities of current state-of-the-art models.
mschuster91 2 years ago

Give development a few years and it'll be faster than a human.
- jacquesm 2 years ago
  
  Depending on the task it may well already be faster than a human at equivalent performance.
- MrBuddyCasino 2 years ago
  
  Neurons are very complex and not just on/off switches. We can't even fully simulate the ~300 neurons of c. elegans, not even close.
  
  TeMPOraL 2 years ago
  
  A lot of that complexity comes from them being living cells, optimized for and functioning in a different environment than our silicon-based machines. We don't need to model it all.
  (Though we do need to pay attention to evolution cheating by overfitting relative to what we'd consider a clean design. Some of the complexity may be doing double duty.)
  
  mrguyorama 2 years ago
  
  Slime molds and single celled creatures can learn things despite having ZERO neurons. Neurons are built on top of an already incredibly complex machine evaluating hundreds of thousands of chemical and physical interactions per second that ALL effect how the cell works.
  We aren't likely ever going to reduce that to a model as simple as the one used in machine learning, because it probably isn't that simple period.
  Neurons are not "just" electrical signalling devices. They are complicated processors and systems in their own right.
  
  MrBuddyCasino 2 years ago
  
  > Some of the complexity may be doing double duty.
  Since we have not succeeded in imitating even the most primitive brains, even though computationally we should have enough juice by now, it would seem that complexity can't be discarded at all, no?
  
  TeMPOraL 2 years ago
  
  > Since we have not succeeded in imitating even the most primitive brains, even though computationally we should have enough juice by now
  looks at the browser tab with GPT-4 in it
  looks back again here
  ... we didn't?
  
  AstralStorm 2 years ago
  
  We didn't. Try to get ChatGPT through a maze a C. Elegans can solve in minutes. (That slow mostly because pond scum moves slowly.)
  Seriously, get it to successfully play through a text adventure maze game.
  
  TeMPOraL 2 years ago
  
  > Seriously, get it to successfully play through a text adventure maze game.
  I did exactly that the other day in response to a different objection on a HN thread. Or at least similar enough.
  https://cloud.typingmind.com/share/c0a68cb2-5f59-4e83-b383-b...
  Now, the goal there wasn't to get it to solve a maze, but rather to see how it can come up with a plan of action and adjust it on the fly. But I see no reason a variant of that wouldn't work with a traditional maze game - provided you remember this is a stateless model without volatile memory, so it needs to be fed its memory with every request.
  
  mschuster91 2 years ago
  
  The problem with ChatGPT is it lacks the context depth for more complex tasks - but that is a computation resource limit, not a technical one.
  
  MrBuddyCasino 2 years ago
  
  No we haven't. ChatGPT is basically a sophisticated Markov chain. It is very good at pattern matching, but it has no understanding of anything, or its own will. People who think is even close to AGI are deluded, fooled by an elaborate Mechanical Turk.
  This is also the reason why its output sounds convincing, but is very often factually wrong.
  
  TeMPOraL 2 years ago
  
  I disagree, but that's beside the point here. You yourself narrowed the scope to:
  "imitating even the most primitive brains, even though computationally we should have enough juice by now"
  Which is kind of weird to claim today. GPT-4 may be the strongest counterexample to date, but it's far from the only one.
  Of course, you need to remember not to confuse the brain with attached peripherals. Just because we can't replicate a perfect worm or fly body, complete with bioelectrical and biomechanical components, doesn't mean we can't do better than their brains in silico.
  
  fennecfoxy 2 years ago
  
  I'd also say that some of it might be more computing power required, but much of it is us cracking the "puzzle" to it, we haven't figured out the exact right architecture/structure for creating say an AGI.
  Just like with transformers revolutionising text generation and now things like LoRa and other fine tuning methods are helping us find a better solution to that puzzle, the same will happen for the development of AGIs.
  We will do it, one day.
  
  mrtranscendence 2 years ago
  
  GPT-4 does not "imitate" a "brain"; it does not function like a brain, nor is it even really analogous to a brain in any useful sense. What it imitates is human speech.
  
  TeMPOraL 2 years ago
  
  "Imitating human speech" is not a trivial thing. You can't do it by a lookup table, or by a Markov chain. Not properly, not in open-ended, unscripted situations. It requires capabilities and structures that, if they aren't a world model and basic abstract reasoning skill, then they at least start to look strikingly similar in practice. This is where we are with GPT-4. It doesn't imitate speech. It imitates reasoning.
  And if it walks like a duck, and quacks like a duck, ...
  GPT-4 is a good example because it's pretty clear that the model isn't merely a stochastic parrot (or, if it is in some sense, then in that sense so are we). But it's not the only game in town. Not all generative transformers deal with language. All seem to be powerful association machines, drawing their capabilities from simple algorithms in absurdly high-dimensional spaces. There are many parallels you can draw to brains here, not the least of which is that the overall architecture is simple enough and scalable, that it's exactly the kind of thing evolution could reach and then get railroaded into building on.
  
  MrBuddyCasino 2 years ago
  
  Thanks, I gave up halfway writing this.