Deep Hierarchical Planning from Pixels
NeurIPS 2022
Abstract
Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels.
Media Coverage
Strategies
While Director uses latent feature vectors as goals, the world model allows us to decode them into images for human inspection. Each of the videos below shows:
- Left: The environment inputs of an episode as seen by the agent
- Right: The visualized goals that Director chooses internally
We find that Director discovers diverse strategies for breaking down tasks into internal subgials, such as leveraging robot poses, salient landmarks in the environment, and inventory and score displays on the screen. The goals generally stay ahead of the worker, efficiently directing it often without giving it enough time to fully reach the previous goal.
Visual Pin Pad Six
There is only one sparse reward after activating all pads in the right order. The manager directs the worker via the history display at the bottom of the image and sometimes via the black player square.
Atari Pong
Because the game is reactive and requires no long-term reasoning, the manager learns to communicate the task to the worker by requesting a higher score via the score display at the top of the screen.
Egocentric Ant Maze M
The quadruped robot has to explore the maze through low-level locomotion and find the target zone to receive a sparse reward. The colored walls allow the agent to localize itself in the maze and can serve as meaningful subgoals.
Egocentric Ant Maze XL
This is the most challenging task in our benchmark, requiring the quadruped to find the sparse goal in the largest maze. The manager proposes multiple intermediat subgoals for this tasks, including red, pink, yellow, and gray walls.
Cartpole Swingup
Early during the episode, the manager requests a sideways angle to help the worker swing up the pole, after which the goal remains an upright position with small left and right movements to correct imbalances.
Acrobot Swingup
The upright reward is difficult to discover in this task. But once found, the manager can easily request the rewarding upright pose. From then on, the worker can follow its dense feature space reward to learn swing up the two link pole.
Walker Walk
The manager abstracts away the detail of leg movement, directing the worker through a forward-leaning pose with both feed above the ground and a shifting floor pattern. The latent goal also likely contains velocity information. The worker fills in the leg movement to pass through the goals.
Humanoid Walk
For the humanoid task, the manager uses an upright pose to direct the worker to stand up, following by a shifting floor pattern for carefully walking forward without falling. Director is the first hierarchical agent to solve this challenging control task end-to-end without demonstrations.
Crafter
The manager directs the worker via the item display to collect wood and create a pickaxe. It then sends the worker to a cave to collect stone and iron. As it gets dark, the manager tells the worker to find a small cave or island to hide from mosters.
DMLab Goals Small
The manager requests the teleport animation that occurs when collecting the reward object. Because there is no locomotion challenge in this environment, the worker can navigate to the goal object on its own, without fine-grained goals.
No Goal Autoencoder
As an ablation, we here show the selected goals if we remove the goal autoencoder so that the manager chooses goal vectors directly in the continuous feature space of the world model. Each video shows:
- Left: The environment inputs of an episode as seen by the agent
- Right: The visualized goals that Director chooses internally
We observe that the goals are completely uninterpretable and cause the agent to fail in many, but not all, of the tested environments.