Read paper View poster

Learning to model the world could in principle enable agents to generalize in environments with many different tasks. However, learning latent dynamics models suitable for planning has been a long-standing challenge. We present the deterministic belief state model DBSM, a probabilistic dynamics model for latent planning in high-dimensional environments. The model propagates deterministic beliefs as activation vectors forward in time, providing context during long-term predictions. We further introduce variational overshooting, a generalization of the variational free energy bound for sequence models that encourages consistency between closed-loop and open-loop predictions. Experiments on pixel-based locomotion tasks show that our model recovers the information of the true simulator state from purely unsupervised experience without rewards. Leveraging the latent space, we learn reward functions from a few example episodes and obtain locomotion gaits using planning without the need for a separate policy. Leveraging online data acquisition, our model reaches reasonable scores in 100 to 1000 times fewer episodes than model-free algorithms.