points by reissbaker a year ago

I think it's an interesting tech demo. You're right that as-is it's not useful. Here are some long-term things I could imagine:

1. Scale it up so that it has a longer context length than a single frame. If it could observe the last million frames, for example, that would allow significantly more temporal consistency.

2. RAG-style approaches. Generate a simple room-by-room level map (basically just empty bounding boxes), and allow the model to read the map as part of its input rather than simply looking at frames. And when your character is in a bounding box the model has generated before, give N frames of that generation as context for the current frame generation (perhaps specifically the frames with the same camera direction, even, or the closest to that camera direction). That would probably result in near-perfect temporal consistency even over very long generations and timeframes, assuming the frame context length was long enough.

3. Train on larger numbers of games, and text modalities, so that you can describe a desired game and get something similar to your description (instead of needing to train on a zillion Minecraft runs just to get... Minecraft.)

That being said I think in the near-term it'll be much more fruitful to generate game assets and put them in a traditional game engine — or generate assets, and have an LLM generate code to place them in an engine — rather than trying to end-to-end go from keyboard+mouse input to video frames without anything structured in between.

Eventually the end-to-end model will probably win unless scaling limits get hit, as per the Bitter Lesson [1], but that's a long eventually, and TBH at that scale there really may just be fundamental scaling issues compared to assets+code approaches.

It's still pretty cool though! And seems useful from a research perspective to show what can already be done at the current scale. And there's no guarantee the scaling limits will exist such that this will be impossible; betting against scaling LLMs during the gpt-2 era would've been a bad bet. Games in particular are very nice to train against, since you have easy access to near-infinite ground truth data, synthetically, since you can just run the game on a computer. I think you could also probably do some very clever things to get better-than-an-engine results by training on real-life video as well, with faked keyboard/mouse inputs, such that eventually you'd just be better both in terms of graphics and physics than a game engine could ever hope to be.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html