I'm trying to understand the implications of this thing. Please forgive my ignorance.
So, a DPU (data progressing unit) is basically a daughter board that has multiple CPUs, big network, modest RAM, but also NVMe capabilities? There's a bit of architectural stuff i didn't understated in the article.
Is this the world going full circle and re introducing mini-pcs of the 80's and 90's era?
I understand, I think, the usefulness of having these in a data centre, each customer having their own DPU which would present to them as a bare metal device.
I understand, I think, the crypto guys loving this for compute power, easily expandible.
I also understand this is not for average consumers... but this is HN... what other uses can we put through this/ what advantages does this physical architecture give us?
If anyone could elucidate ....?
It looks exciting, but I'm not sure of the scope.
The founder of Annapurna-Nitro is now at https://www.lightbitslabs.com/, which has created the NVMEoF (NVME over Fabric, i.e. Ethernet or FiberChannel) standard. This is implemented via a software driver or hardware accelerator. Both DPUs and NVMEoF can be viewed as an attempt to standardize "composable data center" architectures pioneered by AWS Nitro.
> but this is HN... what other uses can we put through this / what advantages does this physical architecture give us?
A good place to start is the decade-old NetFPGA SmartNIC research project from University of Cambridge, now in the 5th generation of hardware, with earlier boards sometimes available on eBay. https://netfpga.org/
A line-rate, flexible, and open platform for research,
and classroom experimentation. More than 3,500 NetFPGA
systems have been deployed at over 300 institutions in
over 60 countries around the world.
Also just to add a tiny bit to this wealth of information:
I just recently did an experiment on the difference between non-nitro and nitro-enabled (m4.xlarge vs. m5.xlarge) instances, in a production-ish trendy setup -- Ceph running on Kubernetes (managed by Rook), leveraged to run postgres and pgbench.
The increase in performance was around +45% TPS, just from switching to the nitro enabled instance. The absolute amount of TPS wasn't high because the setup was untuned but simply switching created quite the difference.
On top of all that, the Nitro instance was actually just slightly cheaper than the non-nitro instance.
AWS hit it out of park in the making -- they've been working on it since 2014 supposedly and they certainly built something worth replicating.
just playing devil's advocate, that 45% is a little skewed though... the M4 instance was a "Intel Xeon E5-2676 v3 Haswell processor" (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bo...) and the M5 is a "Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz" (https://aws.amazon.com/blogs/aws/m5-the-next-generation-of-g...). so, you are jumping 2 generations from E5v3 to Gen 1 SP processors, plus going to custom processors for AWS (higher and more sustained turbo). I do agree that Nitro is a big change (less overhead for the hypervisor, faster IO and Networking, etc) but I wonder what the perf diff is just CPU upgrades...
Hey thanks for the note -- that's a good confounding factor I didn't take into account.
> Based on Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz, the M5 instances are designed for highly demanding workloads and will deliver 14% better price/performance than the M4 instances on a per-core basis.
If we take AWS at their word here, all other things equal 30% is a pretty decent result (with absolutely no tuning) -- I'll update the post with this caveat!
All together though, this does still contribute to the point that Nitro is well worth the money.
"DPUs/SmartNICs are a partial industry response to AWS Nitro"
This comment (as well as the article) is quite heavy on storage(?) specific terminology AND products so as someone who isn't into that area it all appears very opaque in meaning since there isn't any context on the what particularly this solves and why it's needed. I know I've made this error a bunch of times myself when writing, and it's ok when you have a very specific audience but for a wider audience like here you lose a lot of potential interest.
Line-rate (10G, 25G, 100G, 400Gbps) packet processing enables software control ("virtualization") of networking and storage. Instead of needing a human to make a physical connection between server and storage or network device, it can be automated via software. This saves money for data centers and allows new products to be launched quickly. It allows customer self-service for purchase and management of storage/network capacity.
Line-rate packet processing also enables automated (authorized) intercept of storage and network traffic.
The DPU can act as a PCI host or as a PCI device. In this case, it's placed into a chassis where there is no host (the backplane is only for power), acts as its own PCI host and talks to NVME SSDs on the bus.
This is a demo and is otherwise pretty absurd, but the main use-case of these devices can be to do some pre-processing on the network traffic before handing off to the host. In addition to the ARM cores, these cards have hardware capable of packet filtering/etc so it can be useful in scenarios such as DDoS mitigation where the card can easily filter out malicious traffic even at rates of millions of packets per second, and only pass through legitimate traffic to the host server.
> what advantages does this physical architecture give us?
It also significantly decreases load on the main CPU. Makes a huge difference when doing any file/network operations that saturate the CPU for just handling I/O.
Your summary is about right, but these can also be used to present (virtualized) nvme devices to the host computer. Imagine presenting arbitrary hardware to a host, and being able to control all of that entirely out of band from the host.
Another decent comparison would be older SBCs. Going that route you could build a big storage jbod for presenting blockdevices or blob storage over a network. That’s much cheaper than building a big x86 box with big chassis, x86 cpu, ram etc. And now you have cheaper, fast, NAS for your dc full of virtual hosts.
I am guessing here, but these boards seem to be geared towards accelerated data transfers to feed GPUs. Some GPU cluster workloads are limited by node to node data transfers that need to go through several slow bus transfers between GPUs in different nodes with traditional computer architectures. These DPUs are spec'ed to act as a kind of programmable beefy DMA controller bridging directly between GPUs and some fast interconnect and storage.
What is old is new again. IBM mainframes had a concept called a "channel controller". Everything connected to the mainframe basically was a computer itself that offloaded the main system. Every DASD (disk), communication link was its own system.
Random - the Commodore 1541 drive (all of their drives really) had their own 65xx CPU and some RAM (2K IIRC for the 1541). There was copy/backup software where you could hook up 2 devices, load the program and then disconnect the drives from the computer. You could put the master in the first drive, blank disk in the second and every time you swapped the backup it would make a new copy.
AFAIK, oldest popular art would be the 1964 CDC 6600 and it's peripheral processors, designed around the idea that memory speed >> cpu speed (at the time, true).
An electrical engineer looks at a computer design and says "If only we had faster switching transistors!" A computer scientist looks at a computer design and says "If only my compiler could optimize code better!"
A computer engineer looks at state of the art across the board and says "This architecture will allow us to maximize performance with currently available components."
If networking or IO is being held back by the CPU (or vice versus)... there's performance to be gained by doing things differently.
It's nowhere near the Nvidia one, but it does have hardware acceleration for network-related tasks so if you're looking to offload networking/packet filtering/etc, it will do the job perfectly.
A retail Bluefield-2 you can get for anywhere from $1.5k-to-$2k USD from a variety of online sellers; just look it up, there are a couple SKUs. You can pick up a second-hand off eBay for about $1k USD right now. It's pretty pricey no matter how you slice it. The real question with a second-hand version is if you get access to the software or not; AFAIK neither Nvidia nor Mellanox have been too stingy about this stuff in the past I think (outside of some EULAs), but who knows...
Also JFYI, all of the Bluefield SKUs are passively cooled, so they normally aren't a great fit for a homelab due to the airflow requirements...
You have to create an account to get at the OS images, but they don't require any proof of purchase to download things. The boards I have came with an Ubuntu image, but I've downloaded updated images from their site and flashed them. They also have RHEL, and some containers that are supposed to help you build your own images.
They're expensive. I think the high-end BF2 cards I use at work (2x100Gbps Eth/IB) were about $2.2k, and 25Gb/s go for $1.1k. Nvidia's store says the msrp is more like $3.5k. Make sure you read the power/cooling requirements. I wouldn't use these in something that isn't rack mounted and can push a lot of air over them.
Effectively what Turing Pi, Jetson and similar are, but I don't understand why computers, as a class, are always designed the way they are, and there are no serious LTS module&backplane-based options for SOHO. Everything should scale up by offering all resources in physically being able to plug into something else. I feel like if you just happened to have 15K iPhones laying around, you should be able to readily pool and abstract all their resources for an effective supercomputer cluster... with a bunch of Apple iPhone backplanes, but I don't see them for sale anywhere.
In this particular case, the setup they have is absurd and is only there as a tech demo. They are actually using 2 DPUs, one talking to the NVME SSDs and exposing them as iSCSI targets and the second one acting as as iSCSI initiator and then running ZFS on top of that. That's an insane amount of overhead and moving parts that can go wrong (the recommendation is to connect your storage devices to ZFS directly so it becomes aware of storage device errors as early as possible) so no surprise they didn't benchmark it.
DPUs are more expensive than blade servers. They are like application-specific middleboxes ("transforms, inspects, filters, or manipulates traffic for purposes other than packet forwarding") on a PCI card, with gradual standardization for high-value data flows.
I'm trying to understand the implications of this thing. Please forgive my ignorance. So, a DPU (data progressing unit) is basically a daughter board that has multiple CPUs, big network, modest RAM, but also NVMe capabilities? There's a bit of architectural stuff i didn't understated in the article.
Is this the world going full circle and re introducing mini-pcs of the 80's and 90's era?
I understand, I think, the usefulness of having these in a data centre, each customer having their own DPU which would present to them as a bare metal device.
I understand, I think, the crypto guys loving this for compute power, easily expandible.
I also understand this is not for average consumers... but this is HN... what other uses can we put through this/ what advantages does this physical architecture give us?
If anyone could elucidate ....? It looks exciting, but I'm not sure of the scope.
DPUs/SmartNICs are a partial industry response to AWS Nitro (based on Annapurna acquisition) which performs I/O (network/storage) virtualization.
There was a 2018 talk on AWS Nitro, https://www.youtube.com/watch?v=e8DVmwj3OEsCadence summary: https://community.cadence.com/cadence_blogs_8/b/breakfast-by...
James Hamilton summary: https://perspectives.mvdirona.com/2019/02/aws-nitro-system/
The founder of Annapurna-Nitro is now at https://www.lightbitslabs.com/, which has created the NVMEoF (NVME over Fabric, i.e. Ethernet or FiberChannel) standard. This is implemented via a software driver or hardware accelerator. Both DPUs and NVMEoF can be viewed as an attempt to standardize "composable data center" architectures pioneered by AWS Nitro.
> but this is HN... what other uses can we put through this / what advantages does this physical architecture give us?
A good place to start is the decade-old NetFPGA SmartNIC research project from University of Cambridge, now in the 5th generation of hardware, with earlier boards sometimes available on eBay. https://netfpga.org/
Several hundred papers have been published: https://netfpga.org/Publications.htmlAlso just to add a tiny bit to this wealth of information:
I just recently did an experiment on the difference between non-nitro and nitro-enabled (m4.xlarge vs. m5.xlarge) instances, in a production-ish trendy setup -- Ceph running on Kubernetes (managed by Rook), leveraged to run postgres and pgbench.
The increase in performance was around +45% TPS, just from switching to the nitro enabled instance. The absolute amount of TPS wasn't high because the setup was untuned but simply switching created quite the difference.
On top of all that, the Nitro instance was actually just slightly cheaper than the non-nitro instance.
AWS hit it out of park in the making -- they've been working on it since 2014 supposedly and they certainly built something worth replicating.
just playing devil's advocate, that 45% is a little skewed though... the M4 instance was a "Intel Xeon E5-2676 v3 Haswell processor" (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bo...) and the M5 is a "Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz" (https://aws.amazon.com/blogs/aws/m5-the-next-generation-of-g...). so, you are jumping 2 generations from E5v3 to Gen 1 SP processors, plus going to custom processors for AWS (higher and more sustained turbo). I do agree that Nitro is a big change (less overhead for the hypervisor, faster IO and Networking, etc) but I wonder what the perf diff is just CPU upgrades...
Hey thanks for the note -- that's a good confounding factor I didn't take into account.
> Based on Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz, the M5 instances are designed for highly demanding workloads and will deliver 14% better price/performance than the M4 instances on a per-core basis.
If we take AWS at their word here, all other things equal 30% is a pretty decent result (with absolutely no tuning) -- I'll update the post with this caveat!
All together though, this does still contribute to the point that Nitro is well worth the money.
"DPUs/SmartNICs are a partial industry response to AWS Nitro"
This comment (as well as the article) is quite heavy on storage(?) specific terminology AND products so as someone who isn't into that area it all appears very opaque in meaning since there isn't any context on the what particularly this solves and why it's needed. I know I've made this error a bunch of times myself when writing, and it's ok when you have a very specific audience but for a wider audience like here you lose a lot of potential interest.
The article includes an explainer: https://www.servethehome.com/what-is-a-dpu-a-data-processing...
Line-rate (10G, 25G, 100G, 400Gbps) packet processing enables software control ("virtualization") of networking and storage. Instead of needing a human to make a physical connection between server and storage or network device, it can be automated via software. This saves money for data centers and allows new products to be launched quickly. It allows customer self-service for purchase and management of storage/network capacity.
Line-rate packet processing also enables automated (authorized) intercept of storage and network traffic.
The DPU can act as a PCI host or as a PCI device. In this case, it's placed into a chassis where there is no host (the backplane is only for power), acts as its own PCI host and talks to NVME SSDs on the bus.
This is a demo and is otherwise pretty absurd, but the main use-case of these devices can be to do some pre-processing on the network traffic before handing off to the host. In addition to the ARM cores, these cards have hardware capable of packet filtering/etc so it can be useful in scenarios such as DDoS mitigation where the card can easily filter out malicious traffic even at rates of millions of packets per second, and only pass through legitimate traffic to the host server.
> what advantages does this physical architecture give us?
It also significantly decreases load on the main CPU. Makes a huge difference when doing any file/network operations that saturate the CPU for just handling I/O.
More - https://www.redhat.com/en/blog/optimizing-server-utilization...
Your summary is about right, but these can also be used to present (virtualized) nvme devices to the host computer. Imagine presenting arbitrary hardware to a host, and being able to control all of that entirely out of band from the host.
Another decent comparison would be older SBCs. Going that route you could build a big storage jbod for presenting blockdevices or blob storage over a network. That’s much cheaper than building a big x86 box with big chassis, x86 cpu, ram etc. And now you have cheaper, fast, NAS for your dc full of virtual hosts.
I am guessing here, but these boards seem to be geared towards accelerated data transfers to feed GPUs. Some GPU cluster workloads are limited by node to node data transfers that need to go through several slow bus transfers between GPUs in different nodes with traditional computer architectures. These DPUs are spec'ed to act as a kind of programmable beefy DMA controller bridging directly between GPUs and some fast interconnect and storage.
What is old is new again. IBM mainframes had a concept called a "channel controller". Everything connected to the mainframe basically was a computer itself that offloaded the main system. Every DASD (disk), communication link was its own system.
Random - the Commodore 1541 drive (all of their drives really) had their own 65xx CPU and some RAM (2K IIRC for the 1541). There was copy/backup software where you could hook up 2 devices, load the program and then disconnect the drives from the computer. You could put the master in the first drive, blank disk in the second and every time you swapped the backup it would make a new copy.
AFAIK, oldest popular art would be the 1964 CDC 6600 and it's peripheral processors, designed around the idea that memory speed >> cpu speed (at the time, true).
https://en.m.wikipedia.org/wiki/CDC_6600
An electrical engineer looks at a computer design and says "If only we had faster switching transistors!" A computer scientist looks at a computer design and says "If only my compiler could optimize code better!"
A computer engineer looks at state of the art across the board and says "This architecture will allow us to maximize performance with currently available components."
If networking or IO is being held back by the CPU (or vice versus)... there's performance to be gained by doing things differently.
Does anyone know the price of these cards? Trying this out in a home setup would be extremely awesome, but most likely prohibitively expensive.
Mikrotik has a budget equivalent: https://mikrotik.com/product/ccr2004_1g_2xs_pcie
It's nowhere near the Nvidia one, but it does have hardware acceleration for network-related tasks so if you're looking to offload networking/packet filtering/etc, it will do the job perfectly.
Thanks a lot!
A retail Bluefield-2 you can get for anywhere from $1.5k-to-$2k USD from a variety of online sellers; just look it up, there are a couple SKUs. You can pick up a second-hand off eBay for about $1k USD right now. It's pretty pricey no matter how you slice it. The real question with a second-hand version is if you get access to the software or not; AFAIK neither Nvidia nor Mellanox have been too stingy about this stuff in the past I think (outside of some EULAs), but who knows...
Also JFYI, all of the Bluefield SKUs are passively cooled, so they normally aren't a great fit for a homelab due to the airflow requirements...
You have to create an account to get at the OS images, but they don't require any proof of purchase to download things. The boards I have came with an Ubuntu image, but I've downloaded updated images from their site and flashed them. They also have RHEL, and some containers that are supposed to help you build your own images.
They're expensive. I think the high-end BF2 cards I use at work (2x100Gbps Eth/IB) were about $2.2k, and 25Gb/s go for $1.1k. Nvidia's store says the msrp is more like $3.5k. Make sure you read the power/cooling requirements. I wouldn't use these in something that isn't rack mounted and can push a lot of air over them.
Are there any other old-farts in here who are thinking "S-100 Bus"???
Everything was run from the backplane. Want a new CPU? Plug one in!
Effectively what Turing Pi, Jetson and similar are, but I don't understand why computers, as a class, are always designed the way they are, and there are no serious LTS module&backplane-based options for SOHO. Everything should scale up by offering all resources in physically being able to plug into something else. I feel like if you just happened to have 15K iPhones laying around, you should be able to readily pool and abstract all their resources for an effective supercomputer cluster... with a bunch of Apple iPhone backplanes, but I don't see them for sale anywhere.
I’ve found ZFS performance on NVME to be somewhat disappointing, it’s a shame there’s no benchmarks.
In this particular case, the setup they have is absurd and is only there as a tech demo. They are actually using 2 DPUs, one talking to the NVME SSDs and exposing them as iSCSI targets and the second one acting as as iSCSI initiator and then running ZFS on top of that. That's an insane amount of overhead and moving parts that can go wrong (the recommendation is to connect your storage devices to ZFS directly so it becomes aware of storage device errors as early as possible) so no surprise they didn't benchmark it.
Even with no overhead, I also found ZFS to be much slower than other filesystems (not just XFS, but EXT4) on raids of NVMe.
DPUs are cheap blade servers you can mount in a tower? Is that a somewhat good analogy?
DPUs are more expensive than blade servers. They are like application-specific middleboxes ("transforms, inspects, filters, or manipulates traffic for purposes other than packet forwarding") on a PCI card, with gradual standardization for high-value data flows.
If that's so, then:
AWS needs Nitro/DPU for isolation and security, and self hosted should go with the more blades over beefy boxes + DPUs?
Because "DPU" is a thing...