OpenAI Releases Minecraft-Playing AI VPT

OpenAI-Releases-Minecraft-Playing-AI-VPT

To train VPT, the team first contracted players to perform specific actions in the game, creating a labelled dataset of around 2,000 hours of video

Researchers from OpenAI have open-sourced Video PreTraining (VPT), a semi-supervised learning technique for training game-playing agents. In a zero-shot setting, VPT performs tasks that agents cannot via reinforcement learning (RL) alone, and with fine-tuning is the first AI to craft a diamond pickaxe in Minecraft.

The model and several experiments were described in a paper published on arXiv. To train VPT, the team first contracted players to perform specific actions in the game, creating a labelled dataset of around 2,000 hours of video. Using this, the researchers trained an Inverse Dynamics Model (IDM) that can infer what keystrokes or mouse actions produced the action in the video. The team used this model to label around 70k hours of internet videos showing Minecraft play; this dataset was then used to pretrain a VPT foundation model. Without fine-tuning, this model could perform complex game behaviours that have previously proven impossible for RL models to learn, including multi-step crafting activities. When fine-tuned on additional contractor data, VPT learned to craft a diamond pickaxe, which can require over 24k in-game actions.

According to the OpenAI team, “VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet…While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.”

Recent research in natural language processing (NLP) and computer vision (CV) AI has shown that pretraining models on large, noisy datasets scraped from the web can produce state-of-the-art results on various downstream tasks. These large pretrained models, called foundation models, are typically fine-tuned on a relatively small task-specific dataset. In contrast, most game-playing agents are trained using RL, requiring many thousands of episodes of the agent playing the game, which can be time-consuming yet still not explore a large part of the game’s potential, especially with “open-world” games such as Minecraft.

While internet video-sharing sites like YouTube have hundreds of thousands of gameplay videos for an agent to learn from, the problem is that these videos show only the game screen, but not the control inputs which are crucial for learning. The OpenAI solution was to train an IDM to infer the control inputs given a series of video frames. To do this, the team first hired contractors to play Minecraft; during play, their video screens were recorded along with keystroke and mouse inputs. This produced a labelled dataset which was used to train the IDM.

Next, the team collected and cleaned Minecraft gameplay videos from the internet and then used the IDM to label this dataset with the inferred control inputs driving the game. This larger dataset was used to train the VPT foundation model using “standard behavioural cloning.” Behavioural cloning is a form of imitation learning, where an agent is trained on another agent’s observed states and actions (usually a human teacher) and learns to estimate the teacher’s own policy. In contrast with RL, behavioural cloning does not require the learning agent to interact with the environment directly.

In addition to releasing the VPT code and model weights, OpenAI has partnered with this year’s MineRL NeurIPS competition. This competition offers prizes to teams who train agents that can perform tasks in the MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT). Besides OpenAI, several other large tech companies are supporting AI research efforts using Minecraft as a platform. In 2019, InfoQ covered Meta’s open-source CraftAssist framework for building bots to assist players in the game. More recently, NVIDIA open-sourced MineDojo, a framework for embodied agent research in Minecraft. The VPT code and models are available on GitHub.