modeless 14 days ago

Note that this is the English-only leaderboard, which you have to manually select when you visit the site. Also the uncertainty on the score is still high, so it may change. But this is undoubtedly the strongest open weights model of any size so far, by far.

People seem to be saying that Llama 3's safety tuning is much less severe than before so my speculation is that this is due to reduced refusal of prompts rather than superior knowledge or reasoning, given the eval scores. But still, a real and useful improvement! At this rate, the 400B is practically guaranteed to dominate.

futureshock 14 days ago

People love this benchmark. And it has a solid idea behind it, but it also seems like it would be trivial to game since it relies on public “blind” rankings. All the models need for an advantage is just to sound more personal and confident and that’ll heavily bias most users asking generic questions or having a chat. It’ll be a terrible measure for any kind of intelligence aside from sounding human-ish.

It’s also good to keep in mind that the ELO scores compute the probability of winning, so a 5 point Elo difference essentially still means it’s a 49-51 chance of user preference. Not much better than a coin flip between the first 4 models, especially considering the confidence interval.

  • tibbar 14 days ago

    I thought that surely the ELO ratings were on a different scale than the usual chess ones, but nope, it uses the same factor of 400: https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc....

    That means we truly do have a long way to go. However, it's also possible that user questions don't have enough 'difficulty' to distinguish -- eg, if the proportion of relatively easy user questions is high, then there are fewer opportunities for a better model to distinguish itself.

    • futureshock 13 days ago

      Totally. And it’s interesting to consider what Elo a human would have, if there was some way to account for the extra latency without unblinding the test. I suspect a human control would not be a lot higher ranked. For the most part top LLMs already sound human if you prompt or fine tune them to emulate conversational style. A determined Turing tester would tell the difference most of the time I believe, but most people submitting arena ratings are not determined.

      • modeless 13 days ago

        It would be interesting to have a couple of "time-limited human" entrants. 10 second human, 1 minute human, 10 minute human. See where they fall on the leaderboard. You could even have the people doing the voting randomly assigned to be the human sometimes.

        You'd know that one of the entries was a human because of the latency, but you wouldn't know which one (obviously the chatbot response would be delayed to match).

        I guess the problem is people would try to game the system by leaving tells in their response so you'd know that a human wrote it. You could block obvious ones but that would create a meta game of circumventing the block.

        • dns_snek 13 days ago

          I think anyone who's spent any amount of time interacting with LLMs would easily tell the human response apart because LLMs have a distinguishable style of writing that's mostly consistent across interactions. Gemini, for example, loves bullet points and splits its responses into multiple distinct sections with subheadings.

          Furthermore, you'd have to limit the scope of questioning to topics that the human on the other side is familiar with, otherwise they won't be able to come up with an answer to questions that aren't common knowledge, while an LLM will always happily generate pages of plausible-sounding (even if incorrect) text.

          I think a more appropriate benchmark would be subject matter experts interviewing an LLM and another subject matter expert, with clear guidelines regarding the style of writing they're expected to match.

    • tosh 13 days ago

      this.

davej 14 days ago

Llama 3 is tuned very nicely for English answers. What is most surprising to me is that the 8B model is performing similarly to Mistral's large model and the original GPT4 model (in English answers). Easily the most efficient model currently available.

  • swalsh 14 days ago

    Parameter count seems to only matter for range of skills, but these smaller models can be tuned to be more than competitive with far larger models.

    I suspect the future is going to be owned by lots of smaller more specific models, possibly trained by much larger models.

    These smaller models have the advantage of faster and cheaper inference.

    • theLiminator 14 days ago

      Probably why MoE models are so competitive now. Basically that idea within a single model.

      • CuriouslyC 13 days ago

        I don't think MoE is the way forward. The bottleneck is memory, and MoE trades MORE memory consumption for lower inference times at a given performance level.

        Before too long we're going to see architectures where a model decomposes a prompt into a DAG of LLM calls based on expertise, fans out sub-prompts then reconstitutes the answer from the embeddings they return.

refulgentis 14 days ago

I'm very surprised to see this considering the models aren't chat-tuned and shouldn't be used for chat, at least, by the disclaimer Meta provided.

I wish they didn't include the disclaimer just for the sake of having one, they provided a chat format, so they must have trained _some_ chat, and we can see here its good enough at chat to not act like a pure LLM, or even close.

Labelling it with a disclaimer it doesn't need causes general confusion in scenarios like Phi-2, which had no chat tuning but great scores, so people couldn't understand "why it didn't work".

vitorgrs 13 days ago

Only in English. At least in Brazilian Portuguese, Llama 3 is awful. You literally ask in Portuguese and it answers in English, etc.

It's a very much English First model, which is totally different from even GPT 3.5 that works amazing in Brazilian Portuguese.

londons_explore 14 days ago

The lmsys leaderboard has 700k votes.

Who did all those votes? I did like 5 of them...

I'm slightly suspicious someone might be gaming the votes, because billions of dollars of company valuations ride on who has the best AI...

  • cjbprime 13 days ago

    How would they game them? As I understood it, this leaderboard is one where you vote on which output is best without knowing which LLM prepared it.

  • littlestymaar 14 days ago

    LLMs are not Facebook's core business at all, and it actually sees in their valuation: if you look at Meta's valuation over the past few days, markets don't care about Llama 3's release or even meta AI.

  • londons_explore 13 days ago

    It gives access to paid models for free...

    I bet a bunch of people use it for actual life tasks.

    • jlpom 13 days ago

      But not when they are not sure the result would be of quality, in the arena which is used for ranking.

  • orbital-decay 14 days ago

    I throw about a dozen canned prompts at it each day if I have time, mainly puzzles, knowledge and coding tests. I rewrite those from the first principles occasionally.

  • zooq_ai 14 days ago

    It's mostly gamed by open source zealots, who have plenty of time in their hands to send "OSS, BEST" on the internet

    • refulgentis 14 days ago

      This is incorrect.

      This is the first time there's been a non-private model at the top, and I'd wager top 3.

      This isn't OSS either.

      Not sure where the jaundiced view comes from. I generally believe the free models tend to be overrated, but I don't think it's a good idea to pretend there's widespread gaming of this. There's a strong self-peasantization streak in LLM discussions, and truth is, no one thing may be "correct", but we can't exclude all data without making the very act of discussion meaningless.

      • zooq_ai 14 days ago

        You have to be seriously delusional to think that a 70B model will beat Claude Opus. But go ahead and use it to prove my point.

        I know the power of Zealotry and Cult in the age of internet and this is just one example of that

        • refulgentis 14 days ago

          I'm at the front of the line, I maintain an app with integrations with all 4 major providers and llama.cpp with 4 built-in models. Yesterday, I was one of the, if not the, first to get it integrated and flag to model makers they had the wrong chat template embedded.

          That's a lengthy way of saying, contra what you imply, I have practical experience & understand this stuff intimately.

          I agree with your opinion re: it's unlikely LLaMA 3 70B beats all private models, but think your way of relaying it, claiming there's widespread gaming of it by advocates for open models, is obviously incorrect, and adds more confusion rather than reducing it.

          • zooq_ai 14 days ago

            Why do you take it personally? Just because you are genuinely interested doesn't mean there are army of people who is sitting their with the sole purpose of gaming leaderboards.

            If you think that doesn't happen, I have a bridge to sell you

            • jantissler 13 days ago

              Then explain to us how they game it? The test is blind. You don't know which models give the answer. If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.

              Genuine question, because as far as I know there is no feasible way to game this score. But if there is, I want to know.

              • extraduder_ire 13 days ago

                > If you trick them into disclosing their name, it is discarded right away and doesn't count towards the score.

                However that process works would be the thing to circumvent, like by getting it to disclose some letters of its name, or any value that is different between models but the same or similar for each one. I assume it doesn't provide the numerical token values in the output or it would be trivial.

            • refulgentis 14 days ago

              I'm not taking it personally: frankly, I'm not sure what that would mean in this context. You asked me to try it, I explained that I did, and provided context.

              Thank you for clarifying: to confirm, yes, I do understand your claim is a vast cabal of OSS zealots rigs LMSys's blind A/B testing.

              I don't think there's anything more of value I can contribute to a discussion on that topic.

              Have a good weekend!

              • zooq_ai 14 days ago

                I'm a big fan of Elon/Tesla, but I also know Elon/Tesla (a vast majority of them) is a cult. See? it's easy to have those seemingly 'conflicting' thoughts

        • free_bip 13 days ago

          It's not beating Claude Opus. It's rank #5, tied with Bard and Claude Sonnet.

          • zooq_ai 13 days ago

            It was during the time of posting, which kind of proves my point. It is gamed. A statistically random sample of users would have yielded the same results.

            https://twitter.com/Teknium1/status/1781328542367883765/phot...

            • free_bip 13 days ago

              How does that "prove" it's gamed? If anything it proves that the system does exactly what it's supposed to do: correct itself over time using more and more samples from users.

hackerlight 14 days ago

Confidence intervals too wide to draw that conclusion.

  • oersted 13 days ago

    Not really, right now it's within +14/-15 ELO with 95% probability, which barely changes the ranking. It's well ahead of Gemini Pro and squarely within the top 5 with Claude 3 Opus and a few versions of GPT-4-Turbo.

    More impressively even, Llama 3 8B is approximately tied with GPT-4 (depending on the version), as well as Mistral-Large and Mixtral 8x22B, which is mad for the size.

oersted 13 days ago

The ranking has changed significantly after 6 hours: Llama 3 70B Instruct (ELO 1198) is now behind even Claude 3 Sonnet (ELO 1202) and well behind the latest GPT-4-Turbo (ELO 1259), but still above the last GPT-4 (ELO 1189).

EDIT: My bad, I was looking at the Overall leaderboard instead of the English one as other commenters. Still quite impressive.

arnaudsm 14 days ago

Title ignores the low statistic significance. It lost 23 points since the screenshot was taken.

  • modeless 13 days ago

    No, you're just looking at the multi language results instead of the English results. Llama 3 isn't as good in other languages.

random_i 13 days ago

I asked the console

"How can I visualize a decision tree trained using PySpark?"

and it very confidently gave me 4 detailed, step-by-step answers (including code!) that are utterly false and useless. :)

tmaly 14 days ago

comments on the X post point out that 8B models is pretty good.

What type of VRAM would you need to run 8B locally?

  • CobaltFire 13 days ago

    My understanding is that as a general rule:

    ( Q / 8 ) * B = GB RAM, where Q is the Quant level and B is the model size.

    So a Q4 7B model is ( 4 / 8 ) * 7 = 3.5GB RAM (or VRAM).

    A non-Q model is 16, so 2 * B.

    I believe this is before context, which adds a bit.

    • tmaly 11 days ago

      based on your formulas, I am definitely going to need a larger card.

  • mertbio 14 days ago

    I can run the 8B with my M2 Pro Mac mini with 32GB via Ollama. It takes around one second to get a response from the API.

  • zamalek 14 days ago

    I have enough VRAM on my 6900XT (16GB) for 13B quantized models alongside desktop/hyprland usage.

  • tosh 13 days ago

    ~8GB RAM for a quantized version of 8B

zone411 14 days ago

Once again, I must ask everyone not to place too much emphasis on this benchmark. Another post I see right now on the HN homepage is this: https://www.tbray.org/ongoing/When/202x/2024/04/18/Meta-AI-o.... As it points out, Llama 3 gave a plausible, smart-sounding answer and people would rate it highly on the LMSYS leaderboard, yet it might be totally incorrect. It's best to think of the LMSYS ranking as something akin to the Turing Test, with all its flaws.

That said, all other benchmarks so far (including my NYT Connections benchmark) show that both Llama 3 models are exceptionally strong for their sizes.