My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.
It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.
It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.
Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.
The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.
Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/
Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.
I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.
Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.
You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...
Illustrative extract:
> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.
> When direction changes, move all aliens down by step (e.g., 10 px).
The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).
12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1
Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.
As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.
If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.
If the models are memorizing and regurgitating from their training data, how come every model I've tried this with produces entirely different code?
Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!
I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.
Transformer does not need to emit a byte for byte clone of a training example to benefit from having seen it. It can store a distributed representation of many near duplicate implementations and then sample a novel linear combination. That still short circuits algorithm design so the burden of discovering the game loop, collision logic, sprite sheet etc. was ALREADY SOLVED during pre training.
When you temperature sample the same model twice you also get "different" code, diversity alone is not evidence of new reasoning. What matters is functional novelty under controlled transformations (renamed variables, resized canvas, obfuscated asset file names etc). On such metamorphic rewrites, models that appear brilliant on canonical prompts suddenly collapse, a hallmark of shallow pattern matching.
The paper I mentioned in my previos comment shows SOTA coding LLMs scoring 70%+ on SWE bench verified yet dropping 10–47% when the very same issues are paraphrased or drawn from unseen repos, even though the task semantics are identical. That is classic memorisation, just fuzzier than a CRC match.
As to qwen, even at 4 bit per weight, a 4B model retains ≈ 2.1 GB of entropy so enough to memorise tens of thousands of full game loops. The reason it garbled the alien movement logic is probably that its limited capacity forced lossy compression, so the behaviour you saw is typical of partially recalled code patterns whose edge cases were truncated during training. That’s still interpolation over memorised fragments, just with fewer fragments to blend. And this is something that is actually proven (https://arxiv.org/abs/2406.15720v1) by controlled fact memorisation studies and extraction attacks up through 70B params show a monotone curve so basically each extra order of magnitude adds noticeably more verbatim or near verbatim recall. So a 20B model succeeds where a 4B one fails because the former crossed the "capacity per training token" threshold for that exemplar. So nothing magical there.
Don't get me wrong, I’m not arguing against interpolation per se, generalising between held out exemplars is precisely what we want. The problem is that most public "just write space invaders” demos never verify that the endpoints were truly unseen. Until they do, a perfect clone is compatible with nothing deeper than glorified fuzzy lookup.
This is a great explanation, thanks for putting it together.
It more or less fits my fuzzy mental model of how this stuff works.
I'm completely fine with my test prompt taking advantage of this - the point of "implement space invaders" is to explore how well it can construct a game of that shape based on the examples that it has seen in its training data, especially in comparison to other models.
I'm not trying for a test of ability to produce a unique new game - I want a short prompt that gets it to output some HTML and JavaScript that I can then interact with.