Fergusonb 6 hours ago

These benchmarks have even the small models absolutely demolishing Sonnet-3.5, which doesn't reflect my subjective experience.

It still seems to me that these models are 'dumb' and often don't understand what I'm asking, where claude's intuition is much stronger.

I feel r1 14b even feels weaker than qwen 2.5 14b

Primary use-case is web technology / coding. Maybe I'm prompting it incorrectly?

  • Workaccount2 6 hours ago

    There is a frustrating gap between benchmarks and real world ability.

    O1 or even O3 might be able to crack academic level math problems, but I still wouldn't trust it to correctly fill out a McDonalds application using a PDF of my resume and a calendar of my availability.

    • pclmulqdq 5 hours ago

      A lot of that has to do with certainty. The GPTs and Claudes will be replacing graudate-level research assistant jobs and other jobs that are very high skill but have soft success criteria long before they replace travel agents, which have low skill but very hard criteria for success.

  • Havoc 5 hours ago

    The reasoning models are much better suited to questions that have answers and a conclusion to arrive at. Ie exactly what benchmarks ask. Rather than make me a todo list app or whatever.

    It’s a bit like you get instruct tuned models and you get chat tuned ones. It’s not really one worse than the other just aimed at different uses

buyucu 7 hours ago

OpenAI was caught gaming benchmarks recently with FrontierMath. Just (yet another) sign that benchmarks are very flawed and everyone is training on them.

So I would not put too much weight on how the models are doing on benchmarks.

  • brookst 6 hours ago

    Was OpenAI demonstrated to have cheated, or just that they had access to the benchmarks and it can’t be proven they didn’t cheat? (Which is hard to do in any case).

    Last I saw FrontierMath said they had a holdback set of problems specifically to ensure investors with access couldn’t cheat[1]. Or did that turn out to be a lie?

    1. https://www.reddit.com/r/singularity/comments/1i4n0r5/this_i...

  • tedsanders 4 hours ago

    Please don’t spread misinformation. OpenAI didn’t cheat on FrontierMath. They have access to the questions, same as MMLU, MATHD, GPQA, ARC-AGI and pretty much every eval. Sure, we could all be lying, but it would be pretty self defeating, horrible for employee morale, and quickly discovered.

amelius 7 hours ago

Where can we read some genuine non-cherrypicked conversations with this model?

  • colonial 7 hours ago

    You should be able to run it locally w/ something like Ollama. It's been a while since I tinkered with the local LLM tools, but 1.5B is tiny, so it should run at a decent clip even on just your CPU.

    `ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:F16` and you're off to the races.