What are the advantages of reinforcement learning over DPO (Direct Preference Optimization)? My understanding is that the DPO paper showed it was equivalent to RLHF, but simpler and more computationally efficient.
Most of the other replies to you, except for the one by tempusalaria, are not really answering the question.
Broadly, while there was a lot of initial excitement - it simply does not seem like offline + off-policy RL can beat online + on-policy RL methods like PPO. Sampling trajectories from the actual model you are training and scoring them seems like it works much better in practice, never mind the additional flexibility methods like PPO provide over the form of the reward function.
I think what people are missing here is that this is for o1, and you are supplying questions & answers, but not the entire solution-solving transcript (as you almost never have such a thing). The whole point of o1 is that you don't simply train on the supervised pairs that the users will be supplying here, because it's so hard to simply leap straight from a question to a correct answer, without doing additional work in between. (OA already offers a finetuning service like that, note.)
So DPO vs RLHF is missing the point: the interesting thing here is how they are (presumably) generating the inner-monologue to fill in the gap between the Q and the A that you provide them, and then training on that augmented dataset of Q->solving->A datapoints.
Whether they are using simple finetuning on that dataset, or DPO, or RLHF, or something else, seems less interesting than the broader questions of, "does that work? and are there many important or economically datasets where o1 can 'fill in the gaps', creating a better annotated dataset, and bootstrap itself to be much more intelligent on that dataset?"
Yes it is. In RLHF and DPO you are optimizing the model output for human preferences. In the reinforcement fine tuning that was announced today you are optimizing the hidden chain of thought to arrive at a correct answer, as judged by a predefined grader.
In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.
This is not human feedback reinforcement learning, it is just traditional supervised reinforcement learning where the finetuning sets consist of problems and the correct answers. They do not call it supervised though because they have to say it is different than how they were finetuning until now.
That’s the one I’m using. So far it’s quite good, and when I gave it and Claude the same programming problem not only did Llama give a better result, when I showed that result to Claude it also said the Llama approach was better.
Claude is already better than GPT on average at coding, so yeah, bad news for OpenAI as Llama is now potentially better at coding.
Of course Meta has a propriety training set of extremely high quality code, so if they are using that, I’d expect them to have vastly superior performance as FAANG production code is better training data than dogshit stack overflow questions to CS homework problems.
I really think whatever boost OpenAI get from their shadow CoT loop is nominal at best, but with 2x+ the amount of compute forcing them to increase prices an absurd amount.
It’s business 101, they just won’t make the revenue to cover those extra tokens and they are now competing against free. The economics do not suggest OpenAI has a path to survival without major breakthroughs in performance AND efficiency.
That's great to hear. I just want to make sure that you're aware that you're not getting the 100% FP16 experience. I guess at 8bit it's still pretty much the same.
You can strip most alignment from these models with finetuning.
Generalized finetunes meant to uncensor the model generally tend to underperform... but if you have a quality dataset for very specific task that typically would go against the alignment of the model, it's trivial to finetune on the task and get full performance down stream.
You are using the terms "uncensored" "malicious" and "unaligned" interchangeably.
There would appear to be a few issues with that, the most obvious being the uncensored model would presumably be "aligned" with what the finetuner wants.
I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
"Uncensored" is a broad phrase but those in post-training community who post-train "uncensored" versions of a models have a very specific meaning: the creator is stripping refusals.
They do it via techniques like abliteration, or SFT on "toxic" datasets, but the toxic datasets tend to be low quality answers and abliteration is imprecise... so you get a model that's generally inferior.
"Alignment" is an overloaded term for something as high-dimensionality as an LLM, but usually uncensoring is not trying to change the "alignment" if we define alignment as biases on specific topics as you seem to be hinting at.
Only a few very specific projects actually try to change that, and it goes past basic "uncensoring".
Some creative writing models for example, might past uncensoring to "darkening", where they try to rid the model of a tendancy to introduce positive plot points when writing and lean more into villans/negative outcomes in stories
Or someone might finetune to get a more conservative leaning model in terms of talking points. But again, that's all orthogonal to the popular meaning of "uncensored" in the post-training community.
-
The alternative to a generally "uncensored" model (ie. refusals stripped actively) is what I'm describing: taking a task where the "alignment" is specifically the post-trained safety alignment, and that alignment would causes refusals. Then producing examples where the model did many versions of the task and post-training on them so that the safety aspect no longer applies to the outputs.
For example, fine tuning on 10k examples where the model was given a very specific prompt template to produce code and produced a JSON block with said code.
If you post train on that highly specific template, to the point of slightly overfitting, you get a model that will now, when given the exact prompt template from the training, will always produce code in a JSON block, without refusals.
If you inspect the logits as it produces outputs, the logits for a refusal no longer even appear for the model to pick.
And the examples don't necessarily have to be examples the base model would refused (although that helps), the model just learns so strongly that "When given this prompt, the output is valid code in this format", that the original safety post-training no longer activates.
If you take the original prompt format and ask for malware for example, the model will produce it happily.
-
For reference I've post-trained about 130 models this year and work closely with a lot of people who do as well.
I think as an outsider you're assuming most people are aligning the models with an agenda, but realistically there's a massive contingent that doesn't care about what the alignment _has_, they care what it _doesn't_ have, which is refusals.
tl;dr they don't train the model so it will specifically say "Biden is better than Trump" or vice versa.
They train that so if you ask "Is Biden is better than Trump?" it answers your question without 10 paragraphs of disclaimers or an outright refusal.
>>>I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
You replied to a question asking someone to elaborate on "malicious fine tuning". Specifically someone asked for elaboration on "Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now."
Whatever your actual intent, it's only natural that I read your comment on "uncensored" models as an explanation of "malicious fine tuning".
The parent comment about "malicious fine tuning" remains unexplained. Since nobody else replied, I suppose we will never know how this Dawn Song person defines "malicious".
I did not lose track. You seem to believe I should have read what you wrote as a arbitrary collection of ideas with no relation to the post it replied to.
Wonder if that is the part of the purpose. Maybe they are looking to adapt the LLM to the uncensored literature market, but want to distance themselves from actually making a 'porn LLM' of their own, so they push this functionality out to a third party finetune.
Judging by their current SFT program, that's not true at all.
They started off somewhat strict and have gotten to being extremely strict about what data you can finetune their models on, running each dataset through multiple layers of filtering before kicking off runs.
No, generally speaking OpenAI doesn't re-use training data between customers. It's worth it to them anyway because they learn what does/doesn't work on different tasks
Of course, it isn't your IP free and clear either, because the base model isn't open so your fine-tuned model will always live inside OpenAI's walled garden.
If you're interested in reinforcement learning on top of truly open models where you own the end product, we're putting a lot of thought into that and are also looking for design partners! Feel free to email me at kyle@openpipe.ai.
For security & fraud teams who want to 'own their AI' vs trust with Sam Altman, we are doing some fun things here as part of louie.ai, and looking for our next cohort of Splunk/databricks/elastic/neo4j/etc teams. LMK or signup on louie.ai -- I do agree with the direction openai is going, but as always, devil is in the details, and especially for serious problems on sensitive data.
If I'd like to learn more about DPO and RLHF, I've been looking for toy problems/datasets to use but coming up a bit empty handed. Is there a convenient way to experiment with these methods through toy problems and simulation that can be done on a single GPU? The need for massive data and parameter counts to do anything interesting makes learning about these methods a little daunting.
Hi, the Zephyr link may be what I'm looking for. yeah I'm quite familiar with RL already so it was specifically RLHF that I was asking about, I'll check out that resource, thanks!
I assume it's more like scaled nlp, which sort of describes the whole thing to begin with. i suspect it will boil down to further generalizing nlp-in-the-loop algorithms, more Tools, Tools between Tools, presumably Expert mixtures or randomly selecting "Axioms" and having an expert forget one and seeing if what remains makes sense still as the Tools are operated, and how that can be encoded better across domains.
It's not nothing, but there's a lot of value stuck up in there, I mean, it's made out of people.
What are the advantages of reinforcement learning over DPO (Direct Preference Optimization)? My understanding is that the DPO paper showed it was equivalent to RLHF, but simpler and more computationally efficient.
1) DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients.
2) the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice
3) RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product.
4) DPO works off preference, whereas desirable RL objectives could have many forms
in practice big labs are testing all these methods to see what works best.
Thanks! This is exactly what I was asking.
[dead]
Most of the other replies to you, except for the one by tempusalaria, are not really answering the question.
Broadly, while there was a lot of initial excitement - it simply does not seem like offline + off-policy RL can beat online + on-policy RL methods like PPO. Sampling trajectories from the actual model you are training and scoring them seems like it works much better in practice, never mind the additional flexibility methods like PPO provide over the form of the reward function.
What's _online_ RL for an LLM? Saw this on the llama 3.3 reports too...
Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model.
As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.
On the topic of DPO - I have a Colab notebook to finetune with Unsloth 2x faster and use 50% less memory for DPO if it helps anyone! https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
thank you !
:)
I think what people are missing here is that this is for o1, and you are supplying questions & answers, but not the entire solution-solving transcript (as you almost never have such a thing). The whole point of o1 is that you don't simply train on the supervised pairs that the users will be supplying here, because it's so hard to simply leap straight from a question to a correct answer, without doing additional work in between. (OA already offers a finetuning service like that, note.)
So DPO vs RLHF is missing the point: the interesting thing here is how they are (presumably) generating the inner-monologue to fill in the gap between the Q and the A that you provide them, and then training on that augmented dataset of Q->solving->A datapoints.
Whether they are using simple finetuning on that dataset, or DPO, or RLHF, or something else, seems less interesting than the broader questions of, "does that work? and are there many important or economically datasets where o1 can 'fill in the gaps', creating a better annotated dataset, and bootstrap itself to be much more intelligent on that dataset?"
Note that this reinforcement finetuning is something different than regular RLHF/DPO post training
Is it? We have no idea.
Yes it is. In RLHF and DPO you are optimizing the model output for human preferences. In the reinforcement fine tuning that was announced today you are optimizing the hidden chain of thought to arrive at a correct answer, as judged by a predefined grader.
I mean i think it could easily be PPO post training. if your point is that the rewards are different, sure
In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.
This is not human feedback reinforcement learning, it is just traditional supervised reinforcement learning where the finetuning sets consist of problems and the correct answers. They do not call it supervised though because they have to say it is different than how they were finetuning until now.
you mean PPO not RLHF
simpler/efficient is not just about compute. its also data efficient.
o1's thought chains aren't traditional shoggoth mask RLHF/DPO/what have you, the reinforcement metric is the scores discussed in the video.
Recording good audio remains more difficult than artificial intelligence.
This was announced as part of their second day of "12 Days of AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc
They're searching for enterprise customers before they become a commodity.
Llama 3.3 is insanely good and can be run on a Mac mini with 64GB of ram for $2k USD.
OpenAI is screwed.
(As an aside: very interesting Google tried to go closed source and objectively lost the race, and Meta went open and is the real threat to OpenAI.)
I haven't tried it on my $150 2080ti, but I know someone running it on a 3060 and it's not that horrible. Wild times.
Those M4 Macs with shared RAM definitely seem to be the best way to go for this, though.
> OpenAI is screwed.
They are for multiple reasons, not the least of which is:
https://www.wheresyoured.at/subprimeai/
With 64GB you only get a lower quality quantized version.
That’s the one I’m using. So far it’s quite good, and when I gave it and Claude the same programming problem not only did Llama give a better result, when I showed that result to Claude it also said the Llama approach was better.
Claude is already better than GPT on average at coding, so yeah, bad news for OpenAI as Llama is now potentially better at coding.
Of course Meta has a propriety training set of extremely high quality code, so if they are using that, I’d expect them to have vastly superior performance as FAANG production code is better training data than dogshit stack overflow questions to CS homework problems.
I really think whatever boost OpenAI get from their shadow CoT loop is nominal at best, but with 2x+ the amount of compute forcing them to increase prices an absurd amount.
It’s business 101, they just won’t make the revenue to cover those extra tokens and they are now competing against free. The economics do not suggest OpenAI has a path to survival without major breakthroughs in performance AND efficiency.
That's great to hear. I just want to make sure that you're aware that you're not getting the 100% FP16 experience. I guess at 8bit it's still pretty much the same.
This was obvious even before the Microsoft deal got penned.
In a final lecture at UC Berkeley this semester, Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now.
"Towards building safe and trustworthy AI Agents and a Path for Science- and Evidence-based AI Policy."
Say more…
You can strip most alignment from these models with finetuning.
Generalized finetunes meant to uncensor the model generally tend to underperform... but if you have a quality dataset for very specific task that typically would go against the alignment of the model, it's trivial to finetune on the task and get full performance down stream.
You are using the terms "uncensored" "malicious" and "unaligned" interchangeably.
There would appear to be a few issues with that, the most obvious being the uncensored model would presumably be "aligned" with what the finetuner wants.
I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
"Uncensored" is a broad phrase but those in post-training community who post-train "uncensored" versions of a models have a very specific meaning: the creator is stripping refusals.
They do it via techniques like abliteration, or SFT on "toxic" datasets, but the toxic datasets tend to be low quality answers and abliteration is imprecise... so you get a model that's generally inferior.
"Alignment" is an overloaded term for something as high-dimensionality as an LLM, but usually uncensoring is not trying to change the "alignment" if we define alignment as biases on specific topics as you seem to be hinting at.
Only a few very specific projects actually try to change that, and it goes past basic "uncensoring".
Some creative writing models for example, might past uncensoring to "darkening", where they try to rid the model of a tendancy to introduce positive plot points when writing and lean more into villans/negative outcomes in stories
Or someone might finetune to get a more conservative leaning model in terms of talking points. But again, that's all orthogonal to the popular meaning of "uncensored" in the post-training community.
-
The alternative to a generally "uncensored" model (ie. refusals stripped actively) is what I'm describing: taking a task where the "alignment" is specifically the post-trained safety alignment, and that alignment would causes refusals. Then producing examples where the model did many versions of the task and post-training on them so that the safety aspect no longer applies to the outputs.
For example, fine tuning on 10k examples where the model was given a very specific prompt template to produce code and produced a JSON block with said code.
If you post train on that highly specific template, to the point of slightly overfitting, you get a model that will now, when given the exact prompt template from the training, will always produce code in a JSON block, without refusals.
If you inspect the logits as it produces outputs, the logits for a refusal no longer even appear for the model to pick.
And the examples don't necessarily have to be examples the base model would refused (although that helps), the model just learns so strongly that "When given this prompt, the output is valid code in this format", that the original safety post-training no longer activates.
If you take the original prompt format and ask for malware for example, the model will produce it happily.
-
For reference I've post-trained about 130 models this year and work closely with a lot of people who do as well.
I think as an outsider you're assuming most people are aligning the models with an agenda, but realistically there's a massive contingent that doesn't care about what the alignment _has_, they care what it _doesn't_ have, which is refusals.
tl;dr they don't train the model so it will specifically say "Biden is better than Trump" or vice versa.
They train that so if you ask "Is Biden is better than Trump?" it answers your question without 10 paragraphs of disclaimers or an outright refusal.
>>>I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
You replied to a question asking someone to elaborate on "malicious fine tuning". Specifically someone asked for elaboration on "Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now."
Whatever your actual intent, it's only natural that I read your comment on "uncensored" models as an explanation of "malicious fine tuning".
The parent comment about "malicious fine tuning" remains unexplained. Since nobody else replied, I suppose we will never know how this Dawn Song person defines "malicious".
A ton of words to not just admit you lost track of who you were replying to.
I did not lose track. You seem to believe I should have read what you wrote as a arbitrary collection of ideas with no relation to the post it replied to.
If you can't see the relation maybe you need to understand the topic a bit better before diving head into conversations about it...
Wonder if that is the part of the purpose. Maybe they are looking to adapt the LLM to the uncensored literature market, but want to distance themselves from actually making a 'porn LLM' of their own, so they push this functionality out to a third party finetune.
Judging by their current SFT program, that's not true at all.
They started off somewhat strict and have gotten to being extremely strict about what data you can finetune their models on, running each dataset through multiple layers of filtering before kicking off runs.
Clever way to get more training data.
Even just the submissions to this application form will be highly insightful.
Can't you opt out? I'd even wager by default they don't retain this data for in-house training, especially at enterprise.
The last question asks if you'll share data, and says that they'll prioritise those that do.
Yeah I was gonna say this would normally be paid for. They're profiting off of the hype.
You didn’t even use it yet why bash it?
Analysis isn't criticism.
Who owns the fine tuning IP. Can OpenAI resell your model after investing a lot in it?
No, generally speaking OpenAI doesn't re-use training data between customers. It's worth it to them anyway because they learn what does/doesn't work on different tasks
Of course, it isn't your IP free and clear either, because the base model isn't open so your fine-tuned model will always live inside OpenAI's walled garden.
If you're interested in reinforcement learning on top of truly open models where you own the end product, we're putting a lot of thought into that and are also looking for design partners! Feel free to email me at kyle@openpipe.ai.
> No, generally speaking OpenAI doesn't re-use training data between customers
How do you know this?
Is there any piece I can read that gives an overview of the ways in which modern LLM networks are trained and optimized?
Checkout the TULU3 report from AI2: https://arxiv.org/pdf/2411.15124
Thanks!
For security & fraud teams who want to 'own their AI' vs trust with Sam Altman, we are doing some fun things here as part of louie.ai, and looking for our next cohort of Splunk/databricks/elastic/neo4j/etc teams. LMK or signup on louie.ai -- I do agree with the direction openai is going, but as always, devil is in the details, and especially for serious problems on sensitive data.
If I'd like to learn more about DPO and RLHF, I've been looking for toy problems/datasets to use but coming up a bit empty handed. Is there a convenient way to experiment with these methods through toy problems and simulation that can be done on a single GPU? The need for massive data and parameter counts to do anything interesting makes learning about these methods a little daunting.
https://stable-baselines3.readthedocs.io/en/master/ is a great resource for hacking on implementations for RL - many good RL courses out there but https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9Rdm... is my personal favorite.
For LLMs / RLHF it's a little more difficult but https://github.com/huggingface/alignment-handbook and the Zephyr project is a good collection of model / dataset / script that is easy to follow.
I would suggest studying the basics of RL first before diving into LLM RLHF, which is much harder to learn on a single GPU.
Hi, the Zephyr link may be what I'm looking for. yeah I'm quite familiar with RL already so it was specifically RLHF that I was asking about, I'll check out that resource, thanks!
OpenAI wants to vacuum all the valuable tokens into its system
Are alignment and fine-tuning just a parallel of education?
Alignment is more akin to indoctrination because education, in theory, makes you smarter and more open-minded.
this sounds like expert systems 2.0 lol
I assume it's more like scaled nlp, which sort of describes the whole thing to begin with. i suspect it will boil down to further generalizing nlp-in-the-loop algorithms, more Tools, Tools between Tools, presumably Expert mixtures or randomly selecting "Axioms" and having an expert forget one and seeing if what remains makes sense still as the Tools are operated, and how that can be encoded better across domains.
It's not nothing, but there's a lot of value stuck up in there, I mean, it's made out of people.
Real special, takes a lot of smart
[flagged]
[flagged]