Opus 4.7 vs. 4.6 after 3 days of real coding side by side from my actual session

1 point by agentseal 5 minutes ago

I spent some time today comparing Opus 4.6 and 4.7 using my own usage data to see how they actually behave side by side.

still pretty early for 4.7, but a few things surprised me.

In my sessions, 4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%, and I am seeing roughly double the retries per edit (0.46 vs 0.22).

It also produces a lot more output per call, about 800 tokens vs 372 on 4.6, which makes it noticeably more expensive. cost per call is $0.185 vs $0.112.

when I broke it down by task type, coding and debugging both looked weaker on 4.7. Coding one-shot dropped from 84.7% to 75.4%, debugging from 85.3% to 76.5%. Feature work was slightly better on 4.7 (75% vs 71.4%), but the sample is small. Delegation showed a big gap (100% vs 33.3%), though that one only has 3 samples on the 4.7 side so I wouldnt read much into it yet.

4.7 also uses fewer tools per turn (1.83 vs 2.77) and barely delegates to subagents (0.6% vs 3.1%). Not sure yet if that's a style difference or just the smaller sample.

A couple of caveats. This is about 3 days of 4.7 data (3,592 calls) vs 8 days of 4.6 (8,020 calls). Some categories only have a handful of examples. These numbers will shift with more usage, and your results will probably look different depending on what kind of work you do.

npx codeburn compare