Note that this is the English-only leaderboard, which you have to manually select when you visit the site. Also the uncertainty on the score is still high, so it may change. But this is undoubtedly the strongest open weights model of any size so far, by far.
People seem to be saying that Llama 3's safety tuning is much less severe than before so my speculation is that this is due to reduced refusal of prompts rather than superior knowledge or reasoning, given the eval scores. But still, a real and useful improvement! At this rate, the 400B is practically guaranteed to dominate.
People love this benchmark. And it has a solid idea behind it, but it also seems like it would be trivial to game since it relies on public “blind” rankings. All the models need for an advantage is just to sound more personal and confident and that’ll heavily bias most users asking generic questions or having a chat. It’ll be a terrible measure for any kind of intelligence aside from sounding human-ish.
It’s also good to keep in mind that the ELO scores compute the probability of winning, so a 5 point Elo difference essentially still means it’s a 49-51 chance of user preference. Not much better than a coin flip between the first 4 models, especially considering the confidence interval.
That means we truly do have a long way to go. However, it's also possible that user questions don't have enough 'difficulty' to distinguish -- eg, if the proportion of relatively easy user questions is high, then there are fewer opportunities for a better model to distinguish itself.
Totally. And it’s interesting to consider what Elo a human would have, if there was some way to account for the extra latency without unblinding the test. I suspect a human control would not be a lot higher ranked. For the most part top LLMs already sound human if you prompt or fine tune them to emulate conversational style. A determined Turing tester would tell the difference most of the time I believe, but most people submitting arena ratings are not determined.
It would be interesting to have a couple of "time-limited human" entrants. 10 second human, 1 minute human, 10 minute human. See where they fall on the leaderboard. You could even have the people doing the voting randomly assigned to be the human sometimes.
You'd know that one of the entries was a human because of the latency, but you wouldn't know which one (obviously the chatbot response would be delayed to match).
I guess the problem is people would try to game the system by leaving tells in their response so you'd know that a human wrote it. You could block obvious ones but that would create a meta game of circumventing the block.
I think anyone who's spent any amount of time interacting with LLMs would easily tell the human response apart because LLMs have a distinguishable style of writing that's mostly consistent across interactions. Gemini, for example, loves bullet points and splits its responses into multiple distinct sections with subheadings.
Furthermore, you'd have to limit the scope of questioning to topics that the human on the other side is familiar with, otherwise they won't be able to come up with an answer to questions that aren't common knowledge, while an LLM will always happily generate pages of plausible-sounding (even if incorrect) text.
I think a more appropriate benchmark would be subject matter experts interviewing an LLM and another subject matter expert, with clear guidelines regarding the style of writing they're expected to match.
Llama 3 is tuned very nicely for English answers. What is most surprising to me is that the 8B model is performing similarly to Mistral's large model and the original GPT4 model (in English answers). Easily the most efficient model currently available.
I don't think MoE is the way forward. The bottleneck is memory, and MoE trades MORE memory consumption for lower inference times at a given performance level.
Before too long we're going to see architectures where a model decomposes a prompt into a DAG of LLM calls based on expertise, fans out sub-prompts then reconstitutes the answer from the embeddings they return.
I'm very surprised to see this considering the models aren't chat-tuned and shouldn't be used for chat, at least, by the disclaimer Meta provided.
I wish they didn't include the disclaimer just for the sake of having one, they provided a chat format, so they must have trained _some_ chat, and we can see here its good enough at chat to not act like a pure LLM, or even close.
Labelling it with a disclaimer it doesn't need causes general confusion in scenarios like Phi-2, which had no chat tuning but great scores, so people couldn't understand "why it didn't work".
LLMs are not Facebook's core business at all, and it actually sees in their valuation: if you look at Meta's valuation over the past few days, markets don't care about Llama 3's release or even meta AI.
I throw about a dozen canned prompts at it each day if I have time, mainly puzzles, knowledge and coding tests. I rewrite those from the first principles occasionally.
This is the first time there's been a non-private model at the top, and I'd wager top 3.
This isn't OSS either.
Not sure where the jaundiced view comes from. I generally believe the free models tend to be overrated, but I don't think it's a good idea to pretend there's widespread gaming of this. There's a strong self-peasantization streak in LLM discussions, and truth is, no one thing may be "correct", but we can't exclude all data without making the very act of discussion meaningless.
I'm at the front of the line, I maintain an app with integrations with all 4 major providers and llama.cpp with 4 built-in models. Yesterday, I was one of the, if not the, first to get it integrated and flag to model makers they had the wrong chat template embedded.
That's a lengthy way of saying, contra what you imply, I have practical experience & understand this stuff intimately.
I agree with your opinion re: it's unlikely LLaMA 3 70B beats all private models, but think your way of relaying it, claiming there's widespread gaming of it by advocates for open models, is obviously incorrect, and adds more confusion rather than reducing it.
Why do you take it personally? Just because you are genuinely interested doesn't mean there are army of people who is sitting their with the sole purpose of gaming leaderboards.
If you think that doesn't happen, I have a bridge to sell you
Then explain to us how they game it? The test is blind. You don't know which models give the answer. If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.
Genuine question, because as far as I know there is no feasible way to game this score. But if there is, I want to know.
> If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.
However that process works would be the thing to circumvent, like by getting it to disclose some letters of its name, or any value that is different between models but the same or similar for each one. I assume it doesn't provide the numerical token values in the output or it would be trivial.
I'm not taking it personally: frankly, I'm not sure what that would mean in this context. You asked me to try it, I explained that I did, and provided context.
Thank you for clarifying: to confirm, yes, I do understand your claim is a vast cabal of OSS zealots rigs LMSys's blind A/B testing.
I don't think there's anything more of value I can contribute to a discussion on that topic.
I'm a big fan of Elon/Tesla, but I also know Elon/Tesla (a vast majority of them) is a cult.
See? it's easy to have those seemingly 'conflicting' thoughts
It was during the time of posting, which kind of proves my point. It is gamed. A statistically random sample of users would have yielded the same results.
How does that "prove" it's gamed? If anything it proves that the system does exactly what it's supposed to do: correct itself over time using more and more samples from users.
Not really, right now it's within +14/-15 ELO with 95% probability, which barely changes the ranking. It's well ahead of Gemini Pro and squarely within the top 5 with Claude 3 Opus and a few versions of GPT-4-Turbo.
More impressively even, Llama 3 8B is approximately tied with GPT-4 (depending on the version), as well as Mistral-Large and Mixtral 8x22B, which is mad for the size.
The ranking has changed significantly after 6 hours: Llama 3 70B Instruct (ELO 1198) is now behind even Claude 3 Sonnet (ELO 1202) and well behind the latest GPT-4-Turbo (ELO 1259), but still above the last GPT-4 (ELO 1189).
EDIT: My bad, I was looking at the Overall leaderboard instead of the English one as other commenters. Still quite impressive.
Once again, I must ask everyone not to place too much emphasis on this benchmark. Another post I see right now on the HN homepage is this: https://www.tbray.org/ongoing/When/202x/2024/04/18/Meta-AI-o.... As it points out, Llama 3 gave a plausible, smart-sounding answer and people would rate it highly on the LMSYS leaderboard, yet it might be totally incorrect. It's best to think of the LMSYS ranking as something akin to the Turing Test, with all its flaws.
That said, all other benchmarks so far (including my NYT Connections benchmark) show that both Llama 3 models are exceptionally strong for their sizes.
Note that this is the English-only leaderboard, which you have to manually select when you visit the site. Also the uncertainty on the score is still high, so it may change. But this is undoubtedly the strongest open weights model of any size so far, by far.
People seem to be saying that Llama 3's safety tuning is much less severe than before so my speculation is that this is due to reduced refusal of prompts rather than superior knowledge or reasoning, given the eval scores. But still, a real and useful improvement! At this rate, the 400B is practically guaranteed to dominate.
People love this benchmark. And it has a solid idea behind it, but it also seems like it would be trivial to game since it relies on public “blind” rankings. All the models need for an advantage is just to sound more personal and confident and that’ll heavily bias most users asking generic questions or having a chat. It’ll be a terrible measure for any kind of intelligence aside from sounding human-ish.
It’s also good to keep in mind that the ELO scores compute the probability of winning, so a 5 point Elo difference essentially still means it’s a 49-51 chance of user preference. Not much better than a coin flip between the first 4 models, especially considering the confidence interval.
I thought that surely the ELO ratings were on a different scale than the usual chess ones, but nope, it uses the same factor of 400: https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc....
That means we truly do have a long way to go. However, it's also possible that user questions don't have enough 'difficulty' to distinguish -- eg, if the proportion of relatively easy user questions is high, then there are fewer opportunities for a better model to distinguish itself.
Totally. And it’s interesting to consider what Elo a human would have, if there was some way to account for the extra latency without unblinding the test. I suspect a human control would not be a lot higher ranked. For the most part top LLMs already sound human if you prompt or fine tune them to emulate conversational style. A determined Turing tester would tell the difference most of the time I believe, but most people submitting arena ratings are not determined.
It would be interesting to have a couple of "time-limited human" entrants. 10 second human, 1 minute human, 10 minute human. See where they fall on the leaderboard. You could even have the people doing the voting randomly assigned to be the human sometimes.
You'd know that one of the entries was a human because of the latency, but you wouldn't know which one (obviously the chatbot response would be delayed to match).
I guess the problem is people would try to game the system by leaving tells in their response so you'd know that a human wrote it. You could block obvious ones but that would create a meta game of circumventing the block.
I think anyone who's spent any amount of time interacting with LLMs would easily tell the human response apart because LLMs have a distinguishable style of writing that's mostly consistent across interactions. Gemini, for example, loves bullet points and splits its responses into multiple distinct sections with subheadings.
Furthermore, you'd have to limit the scope of questioning to topics that the human on the other side is familiar with, otherwise they won't be able to come up with an answer to questions that aren't common knowledge, while an LLM will always happily generate pages of plausible-sounding (even if incorrect) text.
I think a more appropriate benchmark would be subject matter experts interviewing an LLM and another subject matter expert, with clear guidelines regarding the style of writing they're expected to match.
this.
Llama 3 is tuned very nicely for English answers. What is most surprising to me is that the 8B model is performing similarly to Mistral's large model and the original GPT4 model (in English answers). Easily the most efficient model currently available.
Parameter count seems to only matter for range of skills, but these smaller models can be tuned to be more than competitive with far larger models.
I suspect the future is going to be owned by lots of smaller more specific models, possibly trained by much larger models.
These smaller models have the advantage of faster and cheaper inference.
Probably why MoE models are so competitive now. Basically that idea within a single model.
I don't think MoE is the way forward. The bottleneck is memory, and MoE trades MORE memory consumption for lower inference times at a given performance level.
Before too long we're going to see architectures where a model decomposes a prompt into a DAG of LLM calls based on expertise, fans out sub-prompts then reconstitutes the answer from the embeddings they return.
Please, what is an MoE model?
https://huggingface.co/blog/moe
Mixture of Experts. A popular example is Mixtral.
I'm very surprised to see this considering the models aren't chat-tuned and shouldn't be used for chat, at least, by the disclaimer Meta provided.
I wish they didn't include the disclaimer just for the sake of having one, they provided a chat format, so they must have trained _some_ chat, and we can see here its good enough at chat to not act like a pure LLM, or even close.
Labelling it with a disclaimer it doesn't need causes general confusion in scenarios like Phi-2, which had no chat tuning but great scores, so people couldn't understand "why it didn't work".
Only in English. At least in Brazilian Portuguese, Llama 3 is awful. You literally ask in Portuguese and it answers in English, etc.
It's a very much English First model, which is totally different from even GPT 3.5 that works amazing in Brazilian Portuguese.
The lmsys leaderboard has 700k votes.
Who did all those votes? I did like 5 of them...
I'm slightly suspicious someone might be gaming the votes, because billions of dollars of company valuations ride on who has the best AI...
How would they game them? As I understood it, this leaderboard is one where you vote on which output is best without knowing which LLM prepared it.
LLMs are not Facebook's core business at all, and it actually sees in their valuation: if you look at Meta's valuation over the past few days, markets don't care about Llama 3's release or even meta AI.
It gives access to paid models for free...
I bet a bunch of people use it for actual life tasks.
But not when they are not sure the result would be of quality, in the arena which is used for ranking.
I throw about a dozen canned prompts at it each day if I have time, mainly puzzles, knowledge and coding tests. I rewrite those from the first principles occasionally.
It's mostly gamed by open source zealots, who have plenty of time in their hands to send "OSS, BEST" on the internet
This is incorrect.
This is the first time there's been a non-private model at the top, and I'd wager top 3.
This isn't OSS either.
Not sure where the jaundiced view comes from. I generally believe the free models tend to be overrated, but I don't think it's a good idea to pretend there's widespread gaming of this. There's a strong self-peasantization streak in LLM discussions, and truth is, no one thing may be "correct", but we can't exclude all data without making the very act of discussion meaningless.
You have to be seriously delusional to think that a 70B model will beat Claude Opus. But go ahead and use it to prove my point.
I know the power of Zealotry and Cult in the age of internet and this is just one example of that
I'm at the front of the line, I maintain an app with integrations with all 4 major providers and llama.cpp with 4 built-in models. Yesterday, I was one of the, if not the, first to get it integrated and flag to model makers they had the wrong chat template embedded.
That's a lengthy way of saying, contra what you imply, I have practical experience & understand this stuff intimately.
I agree with your opinion re: it's unlikely LLaMA 3 70B beats all private models, but think your way of relaying it, claiming there's widespread gaming of it by advocates for open models, is obviously incorrect, and adds more confusion rather than reducing it.
Why do you take it personally? Just because you are genuinely interested doesn't mean there are army of people who is sitting their with the sole purpose of gaming leaderboards.
If you think that doesn't happen, I have a bridge to sell you
Then explain to us how they game it? The test is blind. You don't know which models give the answer. If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.
Genuine question, because as far as I know there is no feasible way to game this score. But if there is, I want to know.
> If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.
However that process works would be the thing to circumvent, like by getting it to disclose some letters of its name, or any value that is different between models but the same or similar for each one. I assume it doesn't provide the numerical token values in the output or it would be trivial.
I'm not taking it personally: frankly, I'm not sure what that would mean in this context. You asked me to try it, I explained that I did, and provided context.
Thank you for clarifying: to confirm, yes, I do understand your claim is a vast cabal of OSS zealots rigs LMSys's blind A/B testing.
I don't think there's anything more of value I can contribute to a discussion on that topic.
Have a good weekend!
I'm a big fan of Elon/Tesla, but I also know Elon/Tesla (a vast majority of them) is a cult. See? it's easy to have those seemingly 'conflicting' thoughts
It's not beating Claude Opus. It's rank #5, tied with Bard and Claude Sonnet.
It was during the time of posting, which kind of proves my point. It is gamed. A statistically random sample of users would have yielded the same results.
https://twitter.com/Teknium1/status/1781328542367883765/phot...
How does that "prove" it's gamed? If anything it proves that the system does exactly what it's supposed to do: correct itself over time using more and more samples from users.
Confidence intervals too wide to draw that conclusion.
Not really, right now it's within +14/-15 ELO with 95% probability, which barely changes the ranking. It's well ahead of Gemini Pro and squarely within the top 5 with Claude 3 Opus and a few versions of GPT-4-Turbo.
More impressively even, Llama 3 8B is approximately tied with GPT-4 (depending on the version), as well as Mistral-Large and Mixtral 8x22B, which is mad for the size.
The ranking has changed significantly after 6 hours: Llama 3 70B Instruct (ELO 1198) is now behind even Claude 3 Sonnet (ELO 1202) and well behind the latest GPT-4-Turbo (ELO 1259), but still above the last GPT-4 (ELO 1189).
EDIT: My bad, I was looking at the Overall leaderboard instead of the English one as other commenters. Still quite impressive.
Title ignores the low statistic significance. It lost 23 points since the screenshot was taken.
No, you're just looking at the multi language results instead of the English results. Llama 3 isn't as good in other languages.
I asked the console
"How can I visualize a decision tree trained using PySpark?"
and it very confidently gave me 4 detailed, step-by-step answers (including code!) that are utterly false and useless. :)
comments on the X post point out that 8B models is pretty good.
What type of VRAM would you need to run 8B locally?
My understanding is that as a general rule:
( Q / 8 ) * B = GB RAM, where Q is the Quant level and B is the model size.
So a Q4 7B model is ( 4 / 8 ) * 7 = 3.5GB RAM (or VRAM).
A non-Q model is 16, so 2 * B.
I believe this is before context, which adds a bit.
Definitely does not seem too far off:
https://ollama.com/library/llama3/tags
Llama3 8B fp16 is 16GB, q8 is 8.5GB and q4 is between 4.3 and 4.9GB.
Llama3 70B fp16 is 141GB, q8 75GB and q4 is between 40 and 43GB.
based on your formulas, I am definitely going to need a larger card.
I run 8b on my Macbook air m2 with 16gb's of vram at decent speed of 12 t/s (mlx can probably get 15 t/s).
I use the Q5_K_M GGUF, which is >99% the same as the original.
I've seen tests that there is close to no divergence with these quantisations, but it rises steeply going lower:
https://www.reddit.com/r/LocalLLaMA/comments/1816h1x/how_muc...
I can run the 8B with my M2 Pro Mac mini with 32GB via Ollama. It takes around one second to get a response from the API.
I have enough VRAM on my 6900XT (16GB) for 13B quantized models alongside desktop/hyprland usage.
~8GB RAM for a quantized version of 8B
Once again, I must ask everyone not to place too much emphasis on this benchmark. Another post I see right now on the HN homepage is this: https://www.tbray.org/ongoing/When/202x/2024/04/18/Meta-AI-o.... As it points out, Llama 3 gave a plausible, smart-sounding answer and people would rate it highly on the LMSYS leaderboard, yet it might be totally incorrect. It's best to think of the LMSYS ranking as something akin to the Turing Test, with all its flaws.
That said, all other benchmarks so far (including my NYT Connections benchmark) show that both Llama 3 models are exceptionally strong for their sizes.