> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.
Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?
I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.
Zuckerberg should stop claiming Meta is open sourcing AI (they are even running TV ads) when they are only releasing the weights, and not the code. Only DeepSeek is real OSS AI.
1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
DeepSeek is definitely not real OSS. To be open source, you need to use a real open source license (like the ones OSI lists), and you need to share all pre and post training code, any code related to tuning, any evaluation code, everything related to safety/censorship/etc, and probably the full training data as well. Otherwise you can't reproduce their weights. Sharing weights is like sharing a compiled program.
As far as I know the only true open source model that is competitive is the OLMo 2 model from AI2:
Yes, releasing training source code code is like releasing the source code of a compiler used to compile and link the binary.
Lets say you took GCC, modified its sources, compiled your code with it and released your binaries along with modified GCC source code. And you are claiming that your software is open source. Well, it wouldn’t be.
Releasing training data is extremely hard, as licensing and redistribution rights for that data are difficult to tackle. And it is not clear, what exactly are the benefits in releasing it.
Come on... Meta has been refining pytorch for more than a decade. It basically contains all that you need to train LLMs, including the latest technologies. What more do you need? The part of the code that is specific to Meta infrastructure?
PyTorch had the "first thing that didn't suck" advantage and now it has a completely dominant marketshare that prevents better alternatives from emerging. Where it sucks (e.g. on macOS) there are popular alternatives. But it's hard to be enthusiastic about a DL framework in 2025 which does not have native high performance quantization support, for example. Or one where FSDP is crudely bolted onto the side. They say "usability above all else", but I consider such things to be major usability deficiencies, which need to be addressed. But because PyTorch does not have to fight for marketshare, it'll be years before we see anything usable there.
There's hilariously nothing open about OpenAI, and that was the plan from the start. From the email by Ilya Sutsekver, OpenAI was always going to keep all it's research and code as proprietary information. Open supposedly meant the benefits would be shared. So they basically just became a SaaS with a free tier, like most of them. Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
> Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
I believe that he was right, because he of all people should recognize when someone is working from his own playbook of lies and misrepresentation.
Musk is pretty obviously upset because he got outfoxed and cut out of OpenAI, not because of some supposed ideal he holds about safe use of gen AI models.
It would be hilarious if this scenario played out.
OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."
As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"
Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.
Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.
the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.
Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.
That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.
Strategically, they know that needs to run at the edge, and they want users to send them requests to their API without incurring latency or bad user experience.
That is still a fair point, though, and it should be commended. And that hasn't been their only contribution, either.
They literally share everything you need to recreate their model, including the data itself. This is what they say on that link above:
> Because fully open science requires more than just open weights, we are excited to share a new round of OLMo updates–including weights, data, code, recipes, intermediate checkpoints, and instruction–tuned models—with the broader language modeling community!
- Efficient and optimized all-to-all communication
- Both intranode and internode support with NVLink and RDMA
- High-throughput kernels for training and inference prefilling
- Low-latency kernels for inference decoding
- Native FP8 dispatch support
- Flexible GPU resource control for computation-communication overlapping
X: https://x.com/deepseek_ai/status/1894211757604049133
The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil) but the net result for everyone on the planet is simply fantastic.
So even in the worst case (doing this for the wrong reasons): thank you DeepSeek, you are actually doing what OpenAI lied through their teeth to the whole world about doing for years.
> The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil)
In the space of international relations, right and wrong don't apply nearly as much. Is open sourcing this any more "wrong" than the export ban on high end Nvidia GPUs?
The open sourcing by DeepSeek (presumably with CCP consent) just happens to be good for both the CCP and the broader open source AI community at the same time, but don't take it as some kind of principled stance by them.
Finding ways to take away other countries' competitive advantages is a major activity off all governments, large and small.
so while the US is chasing GPU receipts in Singapore just to ensure DeepSeek was using H800s only, the rest of the world can run these optimizations on the full H100s?
while we also pretend that H100s were difficult to get or access because of the US sanctions and their hubris to believe their edicts blanket the globe?
This feels like the 80s/90s when people hacking assembly or finding undocumented instructions to squeeze CPU for performance. Until one day either the compiler will be highly optimized enough or the GPU will be so powerful that such tricks won’t make much difference anymore, like CPUs nowadays :D
Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs.
They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
Gonna check what SASS it get translated to and whether it makes any sense.
I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.
"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...
> For extreme performance, we discover and use an out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better.
Practically speaking, is it possible for NVIDIA to "pull the rug" later, intentionally or otherwise, by subtly changing the behaviour of this out-of-doc instruction on new architectures?
They could. That's why there is a switch to disable it.
> If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue.
I feel like a kid in a candy shop. Some of these tricks would take way too long to reverse engineer correctly based on the papers. I hope that the releases this week start a renaissance of the use of MoE as baseline academic models.
From this point of view I don't understand what's happening between the actual SOTA models practice and the academic models. The former at this point are all MoEs, starting with GPT4. But then the open models, if not for DeepSeek V3 and Mixtral, are always dense models.
MoEs require less computation and more memory, so they're harder to setup in small labs
I assumed gpt 4o wasn't MOE, being a smaller version of gpt-4, but I've never heard either way.
Zuckerberg should stop claiming Meta is open sourcing AI (they are even running TV ads) when they are only releasing the weights, and not the code. Only DeepSeek is real OSS AI.
Well technically even DeepSeek is not as OSS as OLMo or Open Euro, because they didn't open the data.
We're 2/3rds of the way there.
We need:
1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Open AI = Data + Code + Paper + Weights
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
Sounds like a job for AI.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
For understandable reasons
It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
DeepSeek is definitely not real OSS. To be open source, you need to use a real open source license (like the ones OSI lists), and you need to share all pre and post training code, any code related to tuning, any evaluation code, everything related to safety/censorship/etc, and probably the full training data as well. Otherwise you can't reproduce their weights. Sharing weights is like sharing a compiled program.
As far as I know the only true open source model that is competitive is the OLMo 2 model from AI2:
https://allenai.org/blog/olmo2
They even released an app recently, which is also open source, that does on-device inference:
https://allenai.org/blog/olmoe-app
They also have this other model called Tülu 3, which outperforms DeepSeek V3:
https://allenai.org/blog/tulu-3-405B
Yes, releasing training source code code is like releasing the source code of a compiler used to compile and link the binary.
Lets say you took GCC, modified its sources, compiled your code with it and released your binaries along with modified GCC source code. And you are claiming that your software is open source. Well, it wouldn’t be.
Releasing training data is extremely hard, as licensing and redistribution rights for that data are difficult to tackle. And it is not clear, what exactly are the benefits in releasing it.
Come on... Meta has been refining pytorch for more than a decade. It basically contains all that you need to train LLMs, including the latest technologies. What more do you need? The part of the code that is specific to Meta infrastructure?
does pytorch count
PyTorch had the "first thing that didn't suck" advantage and now it has a completely dominant marketshare that prevents better alternatives from emerging. Where it sucks (e.g. on macOS) there are popular alternatives. But it's hard to be enthusiastic about a DL framework in 2025 which does not have native high performance quantization support, for example. Or one where FSDP is crudely bolted onto the side. They say "usability above all else", but I consider such things to be major usability deficiencies, which need to be addressed. But because PyTorch does not have to fight for marketshare, it'll be years before we see anything usable there.
I wonder how Meta trains its models. On vanilla Pytorch or they actually have some closed tools and frameworks?
Open Weights = Binary Blob
It's a return to the FREEWARE / SHAREWARE model.
This is the language we need to use for "open" weights.
[dead]
You gotta love these guys, they're really pushing the open source frontier for all of us, thanks for sharing
Open AI™ (with a space)
There's hilariously nothing open about OpenAI, and that was the plan from the start. From the email by Ilya Sutsekver, OpenAI was always going to keep all it's research and code as proprietary information. Open supposedly meant the benefits would be shared. So they basically just became a SaaS with a free tier, like most of them. Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
> Musk was right when he called them out for fishing for money as if they were a non profit, but always had plans to become a company
I believe that he was right, because he of all people should recognize when someone is working from his own playbook of lies and misrepresentation.
Musk is pretty obviously upset because he got outfoxed and cut out of OpenAI, not because of some supposed ideal he holds about safe use of gen AI models.
Kind of ironic that DeepSeek is more Open than ChatGPT
They do it for their own reasons, but OpenAI are straight up liars and they are neither open nor give a fuck about humanity.
It would be hilarious if this scenario played out.
OpenAI starts as a nonprofit, aiming to benefit all humanity. Eventually, they discover a path to AGI and engage in intense internal debates: Should they abandon their original mission and chase profit, knowing it could bring generational wealth? They ultimately decide, "To hell with humanity—let’s go for the money."
As they pivot to prioritizing profit, DeepSeek emerges. Staying true to OpenAI’s original vision, DeepSeek open-sources everything, benefiting humanity and earning global admiration. Unintentionally, this move tanks OpenAI’s valuation. In the end, OpenAI fails to become the hero or secure the massive profits they chased. Instead, they leave behind a legacy rebranded as "ClosedAI"
Admittedly I'm a sideline observer but it feels like the first half of your scenario is already happening (sans the agi).
"I don't want to live in a world where someone else is making the world a better place better than we are"
- Silicon Valley Season 2
OpenAyyyyI swear babe I’m gonna open it up any day. Yeah for that grated good or whatever it is you keep yappin about.
Well, they do give us a great free tool to use, but that's where it ends and probably has some agenda behind it.
> Kind of ironic that DeepSeek is more Open than ChatGPT
Not ironic at all.
You've simply be lied to by OpenAI.
Nothing ironic about being naive.
Now. It’s amazing to me that everyone is like fuck OpenAI deepseek is the savior, when OpenAI’s papers and code jump started an AI revolution just a few years ago. Let’s wait the same number of years and see what deepseek does.
I thought the papers that jump started the revolution came from Google?
Indeed. And the papers were about doing better translation of char sequences, essentially the tech emerged as linguistics improvement for language. Then someone realised the parrot learns enough ZIP and JPEG alongside and can spit back hazy memories of it all.
the one still super useful thing OpenAI ever released must’ve been Whisper. But they could’ve been much more open for sure.
Hinton. And if you'd ask himself probably Schmidthuber.
I hope you're reading this Sam Altman:
Make Open AI open.
Or else you'll lose to the ecosystem.
Too late, there is no more innovation from openai all the people that were the drivers left for Anthropic and the others. They had some of the biggest funding, had the advance... And yet they lost it.
> I hope you're reading this Sam Altman
I hope he's not.
All he deserves at this point is to go down as hard as possible.
That’s an impossible ask. Sam is the pinnacle of capitalist ruling class, he’s a pure businessman. He has no interest in giving anything for free unless there’s a business plan. He doesn’t care about humanity. He’ll pretend to change the world and tell you that they’re inventing AGI, Q*, strawberry or whatever they’re branding it, but the reality is he knows it’s all over and unless there’s a major breakthrough this company will be in major financial trouble. Sorry for the rant but he doesn’t deserve much respect for turning all this science to grift. He’s actually the person the old openai board warned everyone about.
Their state-of-the-art speech to text model, Whisper, is available as open weights for free.
Strategically, they know that needs to run at the edge, and they want users to send them requests to their API without incurring latency or bad user experience.
That is still a fair point, though, and it should be commended. And that hasn't been their only contribution, either.
They could've made it a trusted-computing-only model distributed with a proprietary encryption, unlocked with an expensive licence key if they wanted.
Sam is busy with his new kiddo
Not really open source. For a truly open source model, check out OLMo 2 from AI2:
https://allenai.org/blog/olmo2
They literally share everything you need to recreate their model, including the data itself. This is what they say on that link above:
> Because fully open science requires more than just open weights, we are excited to share a new round of OLMo updates–including weights, data, code, recipes, intermediate checkpoints, and instruction–tuned models—with the broader language modeling community!
[dead]
- Efficient and optimized all-to-all communication - Both intranode and internode support with NVLink and RDMA - High-throughput kernels for training and inference prefilling - Low-latency kernels for inference decoding - Native FP8 dispatch support - Flexible GPU resource control for computation-communication overlapping X: https://x.com/deepseek_ai/status/1894211757604049133
The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil) but the net result for everyone on the planet is simply fantastic.
So even in the worst case (doing this for the wrong reasons): thank you DeepSeek, you are actually doing what OpenAI lied through their teeth to the whole world about doing for years.
You rock.
> The incentive behind the work of DeepSeek might very well be wrong (something along the lines of a state-sponsored attempt at shrinking the US first mover advantage in AI to nil)
In the space of international relations, right and wrong don't apply nearly as much. Is open sourcing this any more "wrong" than the export ban on high end Nvidia GPUs?
The open sourcing by DeepSeek (presumably with CCP consent) just happens to be good for both the CCP and the broader open source AI community at the same time, but don't take it as some kind of principled stance by them.
Finding ways to take away other countries' competitive advantages is a major activity off all governments, large and small.
It seems CCP is less hate worthy than they were two months ago. Comparing fake democracy with real authoritarian is kinda funny.
Round 2 of open source releases from an actual "Open AI™" company and licensed under MIT.
Once again, DeepSeek is more open than the $157B+ one that is claiming to be "Open".
Almost no-one is talking about Meta's Llama and everyone should expect them to release Llama 4 with reasoning.
The objective is to not be squeezed in the middle of the race to zero.
https://www.llama.com/events/llamacon/signup/
so while the US is chasing GPU receipts in Singapore just to ensure DeepSeek was using H800s only, the rest of the world can run these optimizations on the full H100s?
while we also pretend that H100s were difficult to get or access because of the US sanctions and their hubris to believe their edicts blanket the globe?
am I understanding this correctly?
This feels like the 80s/90s when people hacking assembly or finding undocumented instructions to squeeze CPU for performance. Until one day either the compiler will be highly optimized enough or the GPU will be so powerful that such tricks won’t make much difference anymore, like CPUs nowadays :D
Is the PTX that everyone was looking forward to included this time?
Yes, there's some in the csrc/kernels directory. Search for 'asm' to find uses of it.
> the PTX that everyone was looking forward to
explanation for the rest of us why this is so important?
Parallel Thread Execution. Think of them as opcodes for the Nvidia GPUs. They are a bit more complex that your traditional opcodes (the lowest level of abstraction accessible to users) in CPUs, as you can specify cache parameters, memory barriers etc.
There are documented combinations of parameters for those instructions but if you fuzz (search new combinations in a random or organized way because you hope some will work the way you want) you can find new ones with unexpected effects or with advantages (in various ways like not polluting caches, speed...)
Which is the case for example for ld.global.nc.L1::no_allocate.L2::256B that they use in deepseek that provides significant acceleration while beeing reliable (although not working on all architectures so they have ways to disable it)
Gonna check what SASS it get translated to and whether it makes any sense.
I wonder if they had SASS assembler for Hopper (either by reverse engineering nvdisasm or by fuzzing instructions + nvdisasm + stare hard) and don't want to say it out :p
You'd be looking at ptxas here. FWIW, it looks like it generates LDG.E.NA.LTC256B.U8.CONSTANT on my machine.
CPUs have instructions with similar semantics.
Much of the hype around DeepSeek is due to their extraordinarily low training and inference costs. They achieved this by optimizing their training code, apparently using PTX in addition to CUDA. PTX is kind of an intermediate assembly language for NVIDIA GPUs and people are eager to see how it was used.
The PTX instructions they talked about in the tech report should be pointing to the code here?
"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."
So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?
Or did I get that wrong?
My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.
Yeah that's about right
this might help: https://x.com/main_horse/status/1894215779521794058/photo/1
Spring showers bring may flowers!
[dead]
[dead]
Now it includes the highly anticipated PTX! Of course, I don’t understand it, but I’ve already click the star and even the fork button, which basically means I’ve mastered it, right? I feel incredibly powerful right now...