points by bhouston 1 year ago

A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:

- ChatGPT 4.5 on AWS Bench verified: 38.0%

- ChatGPT 4o on AWS Bench verified: 30.7%

- OpenAI o3-mini on AWS Bench verified: 61.0%

BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder

[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...

pawelduda 1 year ago

Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.

  • jasonjmcghee 1 year ago

    personal anecdote - claude code is the best llm devx i've had.

_cs2017_ 1 year ago

I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.

If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.

  • aoeusnth1 1 year ago

    I doubt o3s costs will be lower for that performance. They juice their benchmark results by letting it spend $100k in thinking tokens.

logicchains 1 year ago

>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code

That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.

  • cheema33 1 year ago

    > That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...

    I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.

    • bhouston 1 year ago

      Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.

  • QuadmasterXLII 1 year ago

    If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison

ehsanu1 1 year ago

It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.

  • Topfi 1 year ago

    To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.

    [0] https://arxiv.org/pdf/2502.12115