Open source inference time compute example from HuggingFace

82 points by burningion 5 days ago

srush a day ago

Full blog is here: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...

Happy to answer any questions about these methods.

dinp a day ago

Great work! When I use models like o1, they work better than sonnet and 4o for tasks that require some thinking but the output is often very verbose. Is it possible to get the best of both worlds? The thinking takes place resulting in better performance but the output is straightforward to work with like with sonnet and 4o. Did you observe similar behaviour with the 1B and 3B models? How does the model behaviour change when used for normal tasks that don't require thinking?
Also how well do these models work to extract structured output? Eg- perform ocr on some hand written text with math, convert to html and format formulas correctly etc. Single shot prompting doesn't work well with such problems but splitting the steps into consecutive api calls works well.
- amitness 10 hours ago
  
  OpenAI recommends using o1 to generate the verbose plan and then chain the verbose output to a cheaper model (e.g. gpt-4o-mini) to convert it into structured data / function calls / summary etc. They call it planner-executor pattern. [1]
  [1] https://vimeo.com/showcase/11333741/video/1018737829
- srush a day ago
  
  That's a good point. We don't see that in our experiments because it's all in the math domain. However for OAI it's plausible that training for o1 might conflict with standard instruction training, leading to less human preferred output style.
- dimitry12 a day ago
  
  In this paper and HF's replication the model used to produce solutions to MATH problems is off-the-shelf. It is induced to produce step-by-step CoT-style solutions by few-shot ICL prompts or by instructions.
  Yes, the search process (beam-search of best-of-N) does produce verbose traces because there is branching involved when sampling "thoughts" from base model. These branched traces (including incomplete "abandoned" branches) can be shown to the user or hidden, if the approach is deployed as-is.
mccoyb a day ago

In the blog post, learned verifiers are mentioned. Are these learned offline using data, and is the intent to learn a scoring heuristic to help the search?
- dimitry12 a day ago
  
  Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model.
  In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step.
  Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier).
  In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935
OakNinja a day ago

Excellent and interesting post!
Minor gripe - The best-of-n | beam search illustration is not compatible with red-green color blindness. I can literally not see the difference between the Rejected and the Selected dots even if I zoom in.
- srush a day ago
  
  Thanks for the feedback, and not minor. Sorry about that.

Happy to see "inference time compute" term being used primarily nowadays - it's a much more precise and appropriate term compared to the unwieldy "test-time compute" that openai used to call it when they thought they "invented" scaling inference time.

emmelaich a day ago

The linked "bitter lesson" paper by Rich Sutton is so good!

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

boroboro4 a day ago

What's a point of such inference time compute if verifier is 8B model itself? Am I missing something?

dimitry12 a day ago

I believe this is a valid point: HF's replication indeed uses larger off-the-shelf model as a verifier.
In contrast, in the original paper, verifier is a fine-tune of the exact same base model which is used to sample step-by-step solutions (="solver").
- boroboro4 a day ago
  
  Using different 1B model as verifier makes sense, yes. Using Llama 8B finetune as verifier to compare 1B inference time scaled in comparison with 8B makes little sense to me.
  Using 3B model with 8B verifier against 70B model would make sense too. This being said their performance barely crossed 70B line with 256 examples. This is 256*(8+3)/70 ~ 40 times more computationally expensive than running 70B model as is.
  
  dimitry12 a day ago
  
  "1B solver + 8B verifier + search" beating 0-shot 70B is nice, agree.
  "1B solver + 8B verifier + search" beating 1B-0-shot or 1B-majority as baselines isn't illustrative imo. In other words, by using larger verifier, HF's replication fails to establish a "fair" baseline. Still an awesome blog and release/repository from HF's group - I love it!
- zackangelo a day ago
  
  Where did you see that? I thought they used an 8b model for their reward model?
  > To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision
  
  dimitry12 a day ago
  
  "Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.
  See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...
  In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".

bilsbie a day ago

Eli5?

srush a day ago

For problems that require multi-step reasoning, standard LLMs seem to be stuck. The field is increasingly interested in models like o1 that output many "guesses" to find the right one. Currently open-source does not know how to do this, but we are reimplementing several possible directions to try. This replicates one important path using search and a verifier model.
dimitry12 a day ago

To spend more compute at inference time, at least two simple approaches are readily available:
1) make model output a full solution, step-by-step, then induce it to revise the solution - repeat this as many times as you have token-budget for. You can do this via prompting alone (see Reflexion for example), or you can fine-tune the model to do that. The paper explores fine-tuning of the base model to turn it into self-revision model.
2) sample step-by-step (one "thought"-sentence per line) solutions from the model, and do it at non-zero temperature to be able to sample multiple next-steps. Then use verifier model to choose between next-step candidates and prefer to continue the rollout of the more promising branches of "thoughts". There are many many methods of exploring such tree when you can score intermediate nodes (beam search is an almost 50 years old algorithm!).
loudmax a day ago

If I understand correctly, Hugging Face is exploring approaches to tuning the output quality of a given model by tuning how long to let it run.
Normally when you run an LLM, you set your prompt and whatever tunable parameters, and the LLM software (eg. lamma.cpp) spits out tokens at whatever rate it can. If you want higher quality, you run a bigger model (though you're limited by the amount of memory you have available). If you want higher speed, you run a smaller model. Hugging Face seems to be looking at ways to make this tradeoff without switching between different models.
- yeahwhatever10 a day ago
  
  So I can get LLM results from an SLM if I run it long enough?
  
  vlovich123 a day ago
  
  They show Llama 3.2 1B with chain-of-thought that outperforms Llama 3.1 8B and 3.2 3B that outperforms 3.1 70B. It’s less clear whether you actually inference time is faster for CoT 3B using 256x generations vs 70B if you have enough RAM. Basically a classical RAM/compute trade off
  
  dimitry12 a day ago
  
  From a practical standpoint, scaling test-time compute does enable datacenter-scale performance on the edge. I can not feasibly run 70B on my iphone, but I can run 3B even if takes a lot of time for it to produce a solution comparable to 70B's 0-shot.
  I think it *is* an unlock.
  
  beardedwizard a day ago
  
  I struggle with this idea of "run it long enough", or another description I have heard "give the model time to think" it's not a thing - it takes as long as it takes. What im taking away from this is two things:
  1. the reason for generalizations like 'long enough' and 'think more' are apparently because the methods are somewhat obscure 2. those methods are being explored by hugging face to make them less obscure
  am I getting that right? I have been struggling to see past the metaphors and understand exactly what additional computation is being done - and here I read its something like multiple guesses being fed back in and chosen among which means its just multiple inferences in series that are all related to solving 1 problem.

david_bai a day ago

[dead]