Arash Moradi Karkaj, Mark J. Nelson, Ioannis Koutis, Amy K. Hoover (2024). Prompt Wrangling: On replication and generalization in large language models for PCG levels. In Proceedings of the Workshop on Procedural Content Generation. DOI: 10.1145/3649921.3659853
The ChatGPT4PCG competition calls for participants to submit inputs to ChatGPT or prompts that guide its output toward instructions to generate levels as sequences of Tetris-like block drops. Prompts submitted to the competition are queried by ChatGPT to generate levels that resemble letters of the English alphabet. Levels are evaluated based on their similarity to the target letter and physical stability in the game engine. This provides a quantitative evaluation setting for prompt-based procedural content generation (PCG), an approach that has been gaining popularity in PCG, as in other areas of generative AI. This paper focuses on replicating and generalizing the competition results. The replication experiments in the paper first aim to test whether the number of responses gathered from ChatGPT is sufficient to account for the stochasticity. We requery the original prompt submissions and rerun the original scripts from the competition, on different machines, about six months after the competition. We find that results largely replicate, except that two of the 15 submissions do much better in our replication, for reasons we can only partly determine. When it comes to generalization, we notice that the top-performing prompt has instructions for all 26 target levels hardcoded, which is at odds with the PCGML goal of generating new, previously unseen content from examples. We perform experiments in more restricted zero-shot and few-shot prompting scenarios, and find that generalization remains a challenge for current approaches.
Back to publications.