Every month, a new local LLM claims it can generate working games.
This is the arena where we put those claims to the test.
One pixelated bird at a time.
What is this?
Each entry here is a Flappy Bird clone generated entirely by a different local LLM model. Same prompt, same constraints, completely different results. Some work flawlessly. Some crash on launch. Some are... art.
Why does it matter
Flappy Bird is the perfect stress test for code generation. It's simple enough that a model should be able to produce it, but complex enough that bugs, edge cases, and subtle logic errors reveal exactly where each model stands. How does it handle collision detection? Game loops? State management? The answers tell you more about a model's capabilities than any benchmark.
How to use
Pick a model from the sidebar to see its game in action. If the model produced multiple iterations, toggle between versions using the brick tabs above the game. Read the curator notes on the right for analysis on what worked, what didn't, and what surprised me.