Adaptive Reward Shaping for Generalisable Quadruped Locomotion Policy

Abstract

Navigating unstructured terrains with quadruped robots requires sophisticated locomotion capable of adapting to diverse environmental conditions. Existing RL-based single-policy methods work well when fine-tuned for a subdomain of terrains and obstacles but do not generalise well to others. Hierarchical multi-policy approaches comprising several individual locomotion policies trained for each type of terrain with a higher level neural network to switch between low-level policies exhibit greater adaptability at the cost of more intensive resource requirements for retraining, onboard deployment, and complexity which results in a computationally expensive and labour-intensive hand-tuning and distillation process. We propose the adoption of a dynamic reward shaping method to training of a single locomotion policy to enable robust adaptability by eliminating competing reward landscapes. Leveraging real-time inferences from the elevation map of the robot's surrounding terrain for the classification of terrain type, adaptive reward shaping during training improves generalisability of single-policy locomotion to challenging terrains without the computational overhead and complexity of hierarchical multi-policy approaches. We validate the generalisability and robustness of our method on existing and newly generated terrain types in simulation and in real-world experiments.

Without Adaptive Rewards

Without adaptive rewards, it is challenging to efficiently train and perform robustly with a generalised locomotion policy.

Adaptability to New Environments (Trench)

Adaptive rewards enables efficient training and robust, generalisable motion with a singular locomotion policy.

Adaptability to New Environments (Bridge)

Using adaptive rewards you can create policies that generalise to specific environments better. Walking through a tunnel would be impossible without our adaptive rewards framework since it would require conflicting rewards to the rest of the terrains in the environment (e.g. walking on stairs) which would make the policy unable to learn a generalisable policy.

Adaptability to New Environments (Trench)

We can also easily adjust to train for a radically different environment, such as a trench, by adaptively scaling reward weights, for example the weight for collision penalty and the weight for foot target height reward in the reward shaping function are reduced to aid the agent in learning to enter and traverse the trench.

One Laptop, One Policy, Many Terrains: Efficient Training of Generalisable Quadruped Locomotion Policies via Adaptive Rewards

Adaptive Rewards helps your mobile robot's policy generalize to new environments.

Abstract

Without Adaptive Rewards

Adaptability to New Environments (Trench)

Adaptability to New Environments (Bridge)

Adaptability to New Environments (Trench)