Mastering Diverse Domains through World Models

Danijar Hafner Jurgis Pasukonis Jimmy Ba Timothy Lillicrap

Preprint

Paper Twitter Code

Abstract

General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.

Minecraft

DreamerV3 is the first algorithm that collects diamonds in Minecraft without human demonstrations or manually-crafted curricula, which poses a big exploration challenge. The video shows the first diamond that it collects, which happens at 30M environment steps or 17 days of playtime.

Below, we show uncut videos of runs during which DreamerV3 collected diamonds. We find that it succeeds across many starting conditions, which requires searching the world for trees, swimming across lakes, and traversing mountains. Note that a reward is provided only for the first diamond per episode, so the agent is not incentiviced to pick up additional diamonds.

Restart Videos

Benchmarks

DreamerV3 masters a wide range of domains with a fixed set of hyperparameters, outperforming specialized methods. Removing the need for tuning reduces the amount of expert knowledge and computational resources needed to apply reinforcement learning.

Scaling

Due to its robustness, DreamerV3 shows favorable scaling properties. Notably, using larger models consistently increases not only its final performance but also its data-efficiency. Increasing the number of gradient steps further increases data efficiency.

Data Efficiency

On a set of DMLab tasks, DreamerV3 exceeds IMPALA while using over 130 times fewer environment steps. This demonstrates that the peak performance of DreamerV3 exceeds model-free algorithms, while reducing data requirements by two orders of magnitude.

Media Coverage