Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

199 points by kcorbitt 8 months ago

Imnimo 7 months ago

>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.

I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)

>the 32B model’s response lengths collapsing, especially after reaching peak performance.

I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?

bradhilton 7 months ago

Yeah, it may help. In this paper[1], the author used a KL penalty of 0.01 for general tasks and 0.001 for mathematical. I tend to think it's probably not very important unless you're trying to optimize for human preferences.
As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.
[1] https://arxiv.org/html/2501.03262v1
jstanley 7 months ago

I keep seeing people mention "illegible reasoning" but I'd be fascinated to see an example of what it actually looks like. Do you have any examples?
Apparently DeepSeek-R1 can switch between English, Chinese, and gibberish, and even the gibberish helps it think! That's fascinating, but all I can find is people saying it, nobody showing it.
- Imnimo 7 months ago
  
  Here's an example of language switching:
  https://gr.inc/question/although-a-few-years-ago-the-fundame...
  In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).
  I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.
  
  jstanley 7 months ago
  
  Thanks, that's cool to see. I hadn't seen this site before but browsing around I also found this example: https://gr.inc/question/why-does-the-professor-say-this-good... - also with LIMO.
  
  pizza 7 months ago
  
  Seems like if you want to stay in the same language, you could just add a verifiable rewards term for that w/o having to fully load up on the baggage of a base model KL penalty.
  
  kcorbitt 7 months ago
  
  Yep. And tbh you probably don't even have to do this; the R1 paper found that just running SFT the base model with a relatively small number of monolingual reasoning traces was enough for it to get the idea and iirc they didn't even bother selecting for language specifically in the RL training looop itself.
- NitpickLawyer 7 months ago
  
  Don't have examples handy, but I did a round of grpo on a 7b model and it did indeed start to switch between english, coreean and chinese, but the reward was steadily increasing. RL doesn't care what the middle tokens are, as long as the end result gets the carrot.
  I think there's still a lot to learn about reward functions (saw a team work w/ just correct output, and nothing else), if you should reward partial success (i.e. code compiles / math outputs a result) or just the final thing (i.e. test cases pass / correct answer) and so on.
  Not to mention how to get downstream signals from e2e tasks (i.e. if an "agent" navigates to the correct webpage and finds a "cookie" or something, figure out how to reward all the intermediary steps out of that single binary signal).
  And there's a lot to learn in using grammars & stuff w/ RL as well. The problem there is that the libraries are pretty wonky atm, some things work, some things need work, and RL in itself is pretty slow due to having to generate, update the model and generate again.

jmmcd 7 months ago

These puzzles probably have more in common with "Zebra puzzles" (eg https://www.zebrapuzzles.com/) than Cluedo (USA Clue) itself. I've been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.

From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.

But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.

mdp2021 7 months ago

> But of course the best way for a model to solve a problem like this is to translate it
Which means that when you asked it (e.g.) whether A is better than B (as a Decision Support System), it should write a program to decide it instead of "guessing it" from the network.
You are stating that, since the issue is general, LLMs should write programs to produce their own outputs, instead of their standard output.
- jmmcd 7 months ago
  
  > since the issue is general
  I'm not sure what that means specifically. I don't agree overall. Only certain types of problems encountered by LLMs map cleanly to well-understood problems where existing solvers are perfect.
  
  mdp2021 7 months ago
  
  I am stating that since the ability to solve those puzzles is critical in an intelligence, and the general questions I can think of require an intelligence as processor, if to solve those problems the LLMs "should write code" then in general they should.
  All problems require proficient reasoning to get a proper solution - not only puzzles. Without proper reasoning you can get some "heuristic", which can only be useful if you only needed an unreliable result based on "grosso modo" criteria.
  
  jmmcd 7 months ago
  
  > Without proper reasoning you can get some "heuristic"
  Right, but the question is whether this is good enough. And what counts as "proper". A lot of what we call proper reasoning is still quite informal, and even mathematics is usually not formal enough to be converted directly into a formal language like Coq.
  So this is a deep question: is talking reasoning? Humans talk (out loud, or in their heads). Are they then reasoning? Sure, some of what happens internally is not just self-talk, but the thought experiment goes: if the problem is not completely ineffable, then (a bit like Borges' library) there is some 1000-word text which is the best possible reasoned, witty, English-language 1000-word solution to the problem. In principle, an LLM can generate that.
  If your goal is a reductio, ie my statement must be false since it implies models should write code for every problem - then I disagree, because while the ability to solve these problems might be a requirement to be deemed "an intelligence", nonetheless many other problems which require an intelligence don't require the ability to solve these problems.
  
  mdp2021 7 months ago
  
  > Are they then reasoning
  Reasoning properly is at least operating through processes that output correct results.
  > Borges' library
  Which in fact is exactly made of "non-texts" (the process that produces them is `String s = numToString(n++);` - they are encoded numbers, not "weaved ideas").
  > many other problems which require an intelligence don't require the ability to solve these problems
  Which ones? Which problems that demand producing correct solutions could be solved by a general processor which could not solve a "detective game"?
  
  jmmcd 7 months ago
  
  I can't reply to your new post below, I guess the thread is too deep. But you've bit the bullet and stated that what humans do is not reasoning, I think.
  You didn't like "what colour is the sky" (without looking), ok. "Given the following [unseen during training] page of text, can you guess what emotion the main character is feeling at the end?" This is a problem that a human can solve, and many LLMs can solve, even if they can't solve the detective puzzle. In case it doesn't sound important, this can be reframed as a customer-service sentiment-recognition problem.
  
  mdp2021 7 months ago
  
  > I can't reply to your new post below, I guess the thread is too deep
  (I'd instead guess that you tried to reply before the timer - which allows HN members to reply after a delay proportional to a function of the depth of the discussion tree - allowed you.)
  > do is not reasoning
  What some people do is «not reasoning», for lack of training, or for lack of resources (e.g. time - Herbert Simon's "satisficing"), or for lack of ability. I had to write since the late 2022 boom that "if the cousin you write about is consistently not using the faculty of intelligence you can't call her "intelligent" for the purpose of this discussion". I have just written in another parallel discussion that «There is a difference between John who has a keen ethical sense, Ron who does not exercise it, and Don who is a clinical psychopath with missing cerebral modules making it completely Values-blind» - of course if we had to implement ethics we would "backward engineer" John and use Don as a counter-model.
  > can you guess what emotion
  Let me remind you my words: «Without proper reasoning you can get some "heuristic", which can only be useful if you only needed an unreliable result based on "grosso modo" criteria». Is that problem one that has "true solutions" or one that has "good enough solutions"?
  Let me give another example. Bare LLMs can be "good" (good enough) e.g. in setting capitalization and punctuation in "[a-z0-9 ]" texts, such as raw subtitles. That is because they operate without explicitly pondering the special cases in which it is subtle to unequivocally decide whether the punctuation there "should have been a colon or a dash", and such cases are generally rare, so heuristic seems to suffice.
  Similar engines are useless and/or dangerous in all cases in which correct responses are critical. Important problems are those which require correct responses.
  
  jmmcd 7 months ago
  
  > What some people do is «not reasoning», for lack of training, or for lack of resources (e.g. time - Herbert Simon's "satisficing"), or for lack of ability.
  According to your definition of reasoning, which involves surely getting the right answer, no human does reasoning. Probably less than 1% of published mathematics meets the definition.
  > Important problems are those which require correct responses.
  There are many important problems where formal reasoning is not possible, and yet a decision is required, and both humans and LLMs can provide answers. "Should I accept this proposed business deal / should I declare war / what diagnostic test should I order?" We would like to have correct responses for these problems, but it is not possible, even in principle, to guarantee correctness. So yes, we use heuristics and approximate reasoning for such problems. Is an LLM "unreliable" or "dangerous" in such problems? Maybe yes, and maybe more so than humans, but maybe not, it depends on the case. To try to keep the point of the thread in focus, an LLM should probably not try to solve such problems by writing code.
  
  mdp2021 7 months ago
  
  > According to your definition of reasoning, which involves surely getting the right answer
  No. Let me reiterate: «"proper reasoning" is that process which given sufficient input will surely bring to a correct output owing to the effectiveness of its inner workings», given that enough resources are spent. I.e.: it is a matter of method.
  And a processor that cannot solve the "detective games" shows lacking that method. (I.e.: the general capabilities that can be instanced in solving a "detective game" are required, though not exhaustive, for the reasoner.)
  > we use heuristics and approximate reasoning for such problems
  But we are expected to still use decent reasoning, even when bounded.
  So: there may be no need to try and solve problems through writing code when the reasoning machine has the procedural modules that allow to reason similarly to running code, when such form of "diligence" is needed. When the decision is not that impactful (e.g. "best colour for the car"), let the decisor "feel"; when the decision will be impactful, I want that the decisor be able to reason.
  
  jmmcd 7 months ago
  
  I've said several times that according to your definition, humans do not reason. You haven't really responded to that and I guess you're not going to. I can't quite parse your overall position, ie specifically, as I already said, whether you genuinely think LLMs should output code for most problems, or whether you were using that as a reductio against my initial statement. So, I will stop here and thank you for the discussion.
  
  mdp2021 7 months ago
  
  > according to your definition, humans do not reason
  Some do.
  > whether you genuinely think LLMs should output code for most problems, or whether you were using that as a reductio against my initial statement
  No, they should not. But proper reasoning is related to procedural operations like code.
  
  jmmcd 7 months ago
  
  > Reasoning properly is at least operating through processes that output correct results.
  Human "reasoning" (ie speech or self-talk that sounds a bit like reasoning) often outputs correct results. Does "often" fit the definition?
  > Which problems that demand producing correct solutions could be solved by a processor which could not solve a "detective game"?
  For example, "what colour is the sky right now?". A lot of people could solve this (even if they haven't looked outside), and so could a lot of language models, which can't solve this detective game.
  
  mdp2021 7 months ago
  
  > Does "often" fit the definition?
  No: "proper reasoning" is that process which given sufficient input will surely bring to a correct output owing to the effectiveness of its inner workings.
  > what colour is the sky right now
  That is not a general problem solver, and "output the most common recorded reply to a question" is certainly not a general problem solver, and the responses from the box indicated will easily be worthless for all special cases in which the question will make sense.

layer8 7 months ago

GRPO = Group Relative Policy Optimization

https://arxiv.org/abs/2402.03300

Tostino 7 months ago

I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).

bradhilton 7 months ago

We trained all the parameters. Those would definitely be interesting ablations. I would also like to see how much of a performance hit we would take with PEFT methods like LoRA.

kcorbitt 8 months ago

One of the authors here. Happy to answer any questions about our methods/results!

malcolmgreaves 7 months ago

Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.
- bradhilton 7 months ago
  
  We updated the first paragraph to define the acronym. Thanks again for the feedback!
- bradhilton 7 months ago
  
  Great point! Thanks for the feedback.
pama 7 months ago

Can you elaborate on this point:
“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”
In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?
- bradhilton 7 months ago
  
  No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.
  We only tested this with the 14B model. You can see the run here:
  https://wandb.ai/bradhilton/rl-experiments/runs/062
  Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.
  
  pama 7 months ago
  
  Thanks.
bydgjohc 8 months ago

Any hypotheses on why the performance dropped suddenly while training?
- bradhilton 7 months ago
  
  Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.
  Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.
  Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.
  Other ideas to explore include:
  - Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero
  
  kiratp 7 months ago
  
  Is my understanding here correct? Could this be the reason?
  https://news.ycombinator.com/item?id=43287312
- bradhilton 7 months ago
  
  As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.
- a_wild_dandan 7 months ago
  
  [flagged]
mdp2021 7 months ago

Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.
snovv_crash 7 months ago

Do you have any other logic puzzles you could use to see if the performance generalises?
- kcorbitt 7 months ago
  
  To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.
  I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).
  
  shinryuu 7 months ago
  
  The question to me if you can call that deduction in that case. Isn't it just a type of pattern matching that fits this particular task?
  
  ekidd 7 months ago
  
  Once the problem gets narrow enough, do you risk training a model that reinvents a straightforward classic algorithm at far higher cost?
  
  bradhilton 7 months ago
  
  Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.

kiratp 7 months ago

Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.

The results of this paper would indicate doing what they did, but online could return better results

https://arxiv.org/abs/2402.04792

bradhilton 7 months ago

Technically yes, only if you do a gradient step with data sampled from the exact same weights is it an online step.
With our training recipe this can be easily done by accumulating the gradients across the entire batch and only doing one step with optimizer before sampling more responses.
In our experiments, however, we found the advantages of doing multiple gradient steps outweighed any potential drift in policy.
Ultimately the online-ness of data is on a spectrum and while more online data is better, other factors may be more important.
- fc417fc802 7 months ago
  
  > only if you do a gradient step with data sampled from the exact same weights is it an online step.
  Bit pedantic, but amusing thought; wouldn't that imply that asynchronous actor critic is an offline training methodology?
  
  bradhilton 7 months ago
  
  Yes, pedantically, it is! But as I said, everything's on a spectrum. Online-ish data can still work just fine.

bionhoward 7 months ago

This looks impressive but I’m concerned, is it fair to “teach to the test” by fine tuning the Qwen model with RL on the test task, while the other models in the comparison are not fine tuned on the test task?

bradhilton 7 months ago

Yeah, the takeaway shouldn't be "our model is smarter," but that we were able to train weak models to as good or better than the best for this specific task. Depends on what you're doing, but sometimes that is enough.

Liwink 7 months ago

Can you please share the training cost?

bradhilton 7 months ago

We used about 58 hours on 4xH100s and about 19 hours on 8xH100s to get the very best result with the 32B model. We trained for about another 16 hours before finishing the run, but we could have stopped earlier after it was apparent the model was regressing. Actual dollar costs are provider dependent.

behnamoh 7 months ago

this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.

machiaweliczny 7 months ago

Would be great if some details given about how exactly model is penalized for staying off-track.

bradhilton 7 months ago

The model is rewarded for accuracy. For each puzzle there are a few multiple choice questions. If it got 1 out of 4 correct, for example, its reward would be 0.25.
Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.
The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.

randomcatuser 7 months ago

Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?

Would be super interesting to see which one is more data-efficient!

bradhilton 7 months ago

Great question! So the dataset includes prompts and solutions, but no "gold" answer per se to use for SFT. You could sample responses from larger models and then train the smaller model on their answers, but as outlined in the benchmarks there is still a lot of headroom on this task and I wouldn't expect that to get the same results. At the very least you would probably want to do rejection sampling to discard bad results. It would definitely be a good experiment!

dsffsad 7 months ago

[flagged]