OpenAI Reinforcement Fine-Tuning Research Program

229 points by marban 20 days ago

brandonb 20 days ago

What are the advantages of reinforcement learning over DPO (Direct Preference Optimization)? My understanding is that the DPO paper showed it was equivalent to RLHF, but simpler and more computationally efficient.

tempusalaria 20 days ago

1) DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients.
2) the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice
3) RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product.
4) DPO works off preference, whereas desirable RL objectives could have many forms
in practice big labs are testing all these methods to see what works best.
- brandonb 20 days ago
  
  Thanks! This is exactly what I was asking.
- craaaft92 20 days ago
  
  [dead]
whimsicalism 20 days ago

Most of the other replies to you, except for the one by tempusalaria, are not really answering the question.
Broadly, while there was a lot of initial excitement - it simply does not seem like offline + off-policy RL can beat online + on-policy RL methods like PPO. Sampling trajectories from the actual model you are training and scoring them seems like it works much better in practice, never mind the additional flexibility methods like PPO provide over the form of the reward function.
- eggie5 19 days ago
  
  What's _online_ RL for an LLM? Saw this on the llama 3.3 reports too...
  
  whimsicalism 19 days ago
  
  Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model.
  As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.
danielhanchen 20 days ago

On the topic of DPO - I have a Colab notebook to finetune with Unsloth 2x faster and use 50% less memory for DPO if it helps anyone! https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
- hackernewds 20 days ago
  
  thank you !
  
  danielhanchen 20 days ago
  
  :)
gwern 19 days ago

I think what people are missing here is that this is for o1, and you are supplying questions & answers, but not the entire solution-solving transcript (as you almost never have such a thing). The whole point of o1 is that you don't simply train on the supervised pairs that the users will be supplying here, because it's so hard to simply leap straight from a question to a correct answer, without doing additional work in between. (OA already offers a finetuning service like that, note.)
So DPO vs RLHF is missing the point: the interesting thing here is how they are (presumably) generating the inner-monologue to fill in the gap between the Q and the A that you provide them, and then training on that augmented dataset of Q->solving->A datapoints.
Whether they are using simple finetuning on that dataset, or DPO, or RLHF, or something else, seems less interesting than the broader questions of, "does that work? and are there many important or economically datasets where o1 can 'fill in the gaps', creating a better annotated dataset, and bootstrap itself to be much more intelligent on that dataset?"
changoplatanero 20 days ago

Note that this reinforcement finetuning is something different than regular RLHF/DPO post training
- whimsicalism 20 days ago
  
  Is it? We have no idea.
  
  changoplatanero 20 days ago
  
  Yes it is. In RLHF and DPO you are optimizing the model output for human preferences. In the reinforcement fine tuning that was announced today you are optimizing the hidden chain of thought to arrive at a correct answer, as judged by a predefined grader.
  
  whimsicalism 19 days ago
  
  I mean i think it could easily be PPO post training. if your point is that the rewards are different, sure
tsaoyu 20 days ago

In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.
freehorse 19 days ago

This is not human feedback reinforcement learning, it is just traditional supervised reinforcement learning where the finetuning sets consist of problems and the correct answers. They do not call it supervised though because they have to say it is different than how they were finetuning until now.
swyx 20 days ago

you mean PPO not RLHF
simpler/efficient is not just about compute. its also data efficient.
refulgentis 20 days ago

o1's thought chains aren't traditional shoggoth mask RLHF/DPO/what have you, the reinforcement metric is the scores discussed in the video.
whydid 18 days ago

Recording good audio remains more difficult than artificial intelligence.

throwup238 20 days ago

This was announced as part of their second day of "12 Days of AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc

echelon 20 days ago

They're searching for enterprise customers before they become a commodity.
- griomnib 19 days ago
  
  Llama 3.3 is insanely good and can be run on a Mac mini with 64GB of ram for $2k USD.
  OpenAI is screwed.
  (As an aside: very interesting Google tried to go closed source and objectively lost the race, and Meta went open and is the real threat to OpenAI.)
  
  qingcharles 18 days ago
  
  I haven't tried it on my $150 2080ti, but I know someone running it on a 3060 and it's not that horrible. Wild times.
  Those M4 Macs with shared RAM definitely seem to be the best way to go for this, though.
  
  AdieuToLogic 19 days ago
  
  > OpenAI is screwed.
  They are for multiple reasons, not the least of which is:
  https://www.wheresyoured.at/subprimeai/
  
  Tepix 19 days ago
  
  With 64GB you only get a lower quality quantized version.
  
  griomnib 18 days ago
  
  That’s the one I’m using. So far it’s quite good, and when I gave it and Claude the same programming problem not only did Llama give a better result, when I showed that result to Claude it also said the Llama approach was better.
  Claude is already better than GPT on average at coding, so yeah, bad news for OpenAI as Llama is now potentially better at coding.
  Of course Meta has a propriety training set of extremely high quality code, so if they are using that, I’d expect them to have vastly superior performance as FAANG production code is better training data than dogshit stack overflow questions to CS homework problems.
  I really think whatever boost OpenAI get from their shadow CoT loop is nominal at best, but with 2x+ the amount of compute forcing them to increase prices an absurd amount.
  It’s business 101, they just won’t make the revenue to cover those extra tokens and they are now competing against free. The economics do not suggest OpenAI has a path to survival without major breakthroughs in performance AND efficiency.
  
  Tepix 17 days ago
  
  That's great to hear. I just want to make sure that you're aware that you're not getting the 100% FP16 experience. I guess at 8bit it's still pretty much the same.
- talldayo 20 days ago
  
  This was obvious even before the Microsoft deal got penned.

mistrial9 20 days ago

In a final lecture at UC Berkeley this semester, Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now.

"Towards building safe and trustworthy AI Agents and a Path for Science- and Evidence-based AI Policy."

ada1981 20 days ago

Say more…
- BoorishBears 20 days ago
  
  You can strip most alignment from these models with finetuning.
  Generalized finetunes meant to uncensor the model generally tend to underperform... but if you have a quality dataset for very specific task that typically would go against the alignment of the model, it's trivial to finetune on the task and get full performance down stream.
  
  staticman2 20 days ago
  
  You are using the terms "uncensored" "malicious" and "unaligned" interchangeably.
  There would appear to be a few issues with that, the most obvious being the uncensored model would presumably be "aligned" with what the finetuner wants.
  
  BoorishBears 20 days ago
  
  I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
  "Uncensored" is a broad phrase but those in post-training community who post-train "uncensored" versions of a models have a very specific meaning: the creator is stripping refusals.
  They do it via techniques like abliteration, or SFT on "toxic" datasets, but the toxic datasets tend to be low quality answers and abliteration is imprecise... so you get a model that's generally inferior.
  "Alignment" is an overloaded term for something as high-dimensionality as an LLM, but usually uncensoring is not trying to change the "alignment" if we define alignment as biases on specific topics as you seem to be hinting at.
  Only a few very specific projects actually try to change that, and it goes past basic "uncensoring".
  Some creative writing models for example, might past uncensoring to "darkening", where they try to rid the model of a tendancy to introduce positive plot points when writing and lean more into villans/negative outcomes in stories
  Or someone might finetune to get a more conservative leaning model in terms of talking points. But again, that's all orthogonal to the popular meaning of "uncensored" in the post-training community.
  -
  The alternative to a generally "uncensored" model (ie. refusals stripped actively) is what I'm describing: taking a task where the "alignment" is specifically the post-trained safety alignment, and that alignment would causes refusals. Then producing examples where the model did many versions of the task and post-training on them so that the safety aspect no longer applies to the outputs.
  For example, fine tuning on 10k examples where the model was given a very specific prompt template to produce code and produced a JSON block with said code.
  If you post train on that highly specific template, to the point of slightly overfitting, you get a model that will now, when given the exact prompt template from the training, will always produce code in a JSON block, without refusals.
  If you inspect the logits as it produces outputs, the logits for a refusal no longer even appear for the model to pick.
  And the examples don't necessarily have to be examples the base model would refused (although that helps), the model just learns so strongly that "When given this prompt, the output is valid code in this format", that the original safety post-training no longer activates.
  If you take the original prompt format and ask for malware for example, the model will produce it happily.
  -
  For reference I've post-trained about 130 models this year and work closely with a lot of people who do as well.
  I think as an outsider you're assuming most people are aligning the models with an agenda, but realistically there's a massive contingent that doesn't care about what the alignment _has_, they care what it _doesn't_ have, which is refusals.
  tl;dr they don't train the model so it will specifically say "Biden is better than Trump" or vice versa.
  They train that so if you ask "Is Biden is better than Trump?" it answers your question without 10 paragraphs of disclaimers or an outright refusal.
  
  staticman2 17 days ago
  
  >>>I didn't use two of those three terms, so maybe confirming you read the comment you replied to is in order?
  You replied to a question asking someone to elaborate on "malicious fine tuning". Specifically someone asked for elaboration on "Dawn Song was very clear that malicious fine tuning is a top priority among implementers right now."
  Whatever your actual intent, it's only natural that I read your comment on "uncensored" models as an explanation of "malicious fine tuning".
  The parent comment about "malicious fine tuning" remains unexplained. Since nobody else replied, I suppose we will never know how this Dawn Song person defines "malicious".
  
  BoorishBears 17 days ago
  
  A ton of words to not just admit you lost track of who you were replying to.
  
  staticman2 16 days ago
  
  I did not lose track. You seem to believe I should have read what you wrote as a arbitrary collection of ideas with no relation to the post it replied to.
  
  BoorishBears 16 days ago
  
  If you can't see the relation maybe you need to understand the topic a bit better before diving head into conversations about it...
  
  torginus 20 days ago
  
  Wonder if that is the part of the purpose. Maybe they are looking to adapt the LLM to the uncensored literature market, but want to distance themselves from actually making a 'porn LLM' of their own, so they push this functionality out to a third party finetune.
  
  BoorishBears 20 days ago
  
  Judging by their current SFT program, that's not true at all.
  They started off somewhat strict and have gotten to being extremely strict about what data you can finetune their models on, running each dataset through multiple layers of filtering before kicking off runs.

thorum 20 days ago

Clever way to get more training data.

SheinhardtWigCo 20 days ago

Even just the submissions to this application form will be highly insightful.
turingfeel 20 days ago

Can't you opt out? I'd even wager by default they don't retain this data for in-house training, especially at enterprise.
- disgruntledphd2 20 days ago
  
  The last question asks if you'll share data, and says that they'll prioritise those that do.
j_maffe 20 days ago

Yeah I was gonna say this would normally be paid for. They're profiting off of the hype.
- m3kw9 20 days ago
  
  You didn’t even use it yet why bash it?
  
  krainboltgreene 20 days ago
  
  Analysis isn't criticism.

patrickhogan1 20 days ago

Who owns the fine tuning IP. Can OpenAI resell your model after investing a lot in it?

kcorbitt 20 days ago

No, generally speaking OpenAI doesn't re-use training data between customers. It's worth it to them anyway because they learn what does/doesn't work on different tasks
Of course, it isn't your IP free and clear either, because the base model isn't open so your fine-tuned model will always live inside OpenAI's walled garden.
If you're interested in reinforcement learning on top of truly open models where you own the end product, we're putting a lot of thought into that and are also looking for design partners! Feel free to email me at kyle@openpipe.ai.
- esperent 19 days ago
  
  > No, generally speaking OpenAI doesn't re-use training data between customers
  How do you know this?

amelius 20 days ago

Is there any piece I can read that gives an overview of the ways in which modern LLM networks are trained and optimized?

popol1991 20 days ago

Checkout the TULU3 report from AI2: https://arxiv.org/pdf/2411.15124
- amelius 19 days ago
  
  Thanks!

lmeyerov 20 days ago

For security & fraud teams who want to 'own their AI' vs trust with Sam Altman, we are doing some fun things here as part of louie.ai, and looking for our next cohort of Splunk/databricks/elastic/neo4j/etc teams. LMK or signup on louie.ai -- I do agree with the direction openai is going, but as always, devil is in the details, and especially for serious problems on sensitive data.

radarsat1 19 days ago

If I'd like to learn more about DPO and RLHF, I've been looking for toy problems/datasets to use but coming up a bit empty handed. Is there a convenient way to experiment with these methods through toy problems and simulation that can be done on a single GPU? The need for massive data and parameter counts to do anything interesting makes learning about these methods a little daunting.

roborovskis 19 days ago

https://stable-baselines3.readthedocs.io/en/master/ is a great resource for hacking on implementations for RL - many good RL courses out there but https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9Rdm... is my personal favorite.
For LLMs / RLHF it's a little more difficult but https://github.com/huggingface/alignment-handbook and the Zephyr project is a good collection of model / dataset / script that is easy to follow.
I would suggest studying the basics of RL first before diving into LLM RLHF, which is much harder to learn on a single GPU.
- radarsat1 19 days ago
  
  Hi, the Zephyr link may be what I'm looking for. yeah I'm quite familiar with RL already so it was specifically RLHF that I was asking about, I'll check out that resource, thanks!

nextworddev 19 days ago

OpenAI wants to vacuum all the valuable tokens into its system

CaptRon 20 days ago

Are alignment and fine-tuning just a parallel of education?

dr_kiszonka 20 days ago

Alignment is more akin to indoctrination because education, in theory, makes you smarter and more open-minded.

ausbah 20 days ago

this sounds like expert systems 2.0 lol

meltyness 20 days ago

I assume it's more like scaled nlp, which sort of describes the whole thing to begin with. i suspect it will boil down to further generalizing nlp-in-the-loop algorithms, more Tools, Tools between Tools, presumably Expert mixtures or randomly selecting "Axioms" and having an expert forget one and seeing if what remains makes sense still as the Tools are operated, and how that can be encoded better across domains.
It's not nothing, but there's a lot of value stuck up in there, I mean, it's made out of people.
Real special, takes a lot of smart

X--corner--X 20 days ago

[flagged]

Cloudbase 13 days ago

[flagged]