Without Adaptive Rewards
Without adaptive rewards, it is challenging to efficiently train and perform robustly with a generalised locomotion policy.
Navigating unstructured terrains with quadruped robots requires sophisticated locomotion capable of adapting to diverse environmental conditions. Existing RL-based single-policy methods work well when fine-tuned for a subdomain of terrains and obstacles but do not generalise well to others. Hierarchical multi-policy approaches comprising several individual locomotion policies trained for each type of terrain with a higher level neural network to switch between low-level policies exhibit greater adaptability at the cost of more intensive resource requirements for retraining, onboard deployment, and complexity which results in a computationally expensive and labour-intensive hand-tuning and distillation process. We propose the adoption of a dynamic reward shaping method to training of a single locomotion policy to enable robust adaptability by eliminating competing reward landscapes. Leveraging real-time inferences from the elevation map of the robot's surrounding terrain for the classification of terrain type, adaptive reward shaping during training improves generalisability of single-policy locomotion to challenging terrains without the computational overhead and complexity of hierarchical multi-policy approaches. We validate the generalisability and robustness of our method on existing and newly generated terrain types in simulation and in real-world experiments.
Without adaptive rewards, it is challenging to efficiently train and perform robustly with a generalised locomotion policy.
Adaptive rewards enables efficient training and robust, generalisable motion with a singular locomotion policy.
Using adaptive rewards you can create policies that generalise to specific environments better. Walking through a tunnel would be impossible without our adaptive rewards framework since it would require conflicting rewards to the rest of the terrains in the environment (e.g. walking on stairs) which would make the policy unable to learn a generalisable policy.
We can also easily adjust to train for a radically different environment, such as a trench, by adaptively scaling reward weights, for example the weight for collision penalty and the weight for foot target height reward in the reward shaping function are reduced to aid the agent in learning to enter and traverse the trench.
Hardware experiments without perception showing generalisability of the policy. The robot is able to tackle a wide range of obstacles without any prior knowledge of the environment and without any stops or resets.
BibTeX will be available if accepted.