Introduction

Legged systems can execute agile motions by leveraging their ability to reach appropriate and disjoint support contacts, enabling outstanding mobility in complex and unstructured environments1. This ability makes them a popular choice for tasks such as inspection, monitoring, search, rescue, or transporting goods in complex, unstructured environments2.

However, until now, legged robots could not match the performance of animals in traversing real-world terrains3. In nature, legged animals such as cheetahs and hunting dogs can make small radius turns at high speed running or recover quickly from a fall with little or no slowing down while chasing prey. Animals naturally learn to process a wide range of sensory information and respond adaptively in unexpected situations, even when exteroception is completely limited. This ability requires motion controllers capable of recovering from unexpected perturbations, adapting to system and environmental dynamics changes, and executing safe and reliable motion relying solely on proprioception.

Endowing legged robots with this ability is a grand challenge in robotics. Conventional control theories often must be revised to deal with these problems effectively. As a promising alternative, model-free deep reinforcement learning (RL) algorithms autonomously solve complex and high-dimensional legged locomotion problems and do not assume prior knowledge of environmental dynamics.

Under the privileged learning framework, previous work has solved motion tasks such as fast running4, locomotion over uneven terrain5, and mountain climbing3. However, existing motion controllers can still not respond appropriately and adaptively when various accidents lead to motion interruptions, such as collisions with obstacles or falls caused by stepping into potholes.

In this paper, our goal is to construct a quadrupedal system that can traverse terrains agility and adaptively at an extensive range of commands. One straightforward idea is to have the agent learn directly in a simulation environment in which commands, external disturbing forces, and other random variables are generated uniformly5,6. However, previous works4,7 have shown that this naive approach can only succeed in learning control policies with a small range of commands or disturbances. The training process can progressively address complex tasks by implementing curriculum learning, but manual curriculum design may also fail due to the difficulty of evaluating learning progress and task difficulty4,8. This paper presents Curricular Hindsight Reinforcement Learning (CHRL) to solve this problem by introducing a novel automatic curriculum strategy to automatically assess the learning progress of the policy and control task difficulty. CHRL periodically evaluates the policy’s performance and automatically adjusts the curriculum parameters to ramp up task difficulty, including command ranges, reward coefficients, and environment difficulty.

However, adjusting the environment parameters and command ranges destroys the consistency of the state distribution between the environment and the replay buffer, which is disastrous for off-policy RL. This work solved this problem by adapting Hindsight Experience Replay (HER)9 to quadrupedal locomotion tasks to match the existing replay buffer to the current curriculum. Presented HER modifies the commands of past experiences and recalculates rewards, additionally utilizing past experiences, thus increasing the learning efficiency. The proposed policy yields significant performance improvements in learning agility and adaptive high-speed locomotion.

When zero-shot deployed in the real world on uneven outdoor terrain covered with grass, our learned policy sustained a top forward velocity of 3.45 m/s and spinning angular velocity of 3.2 rad/s. The policy also shows strong robustness and adaptability in unstructured environments. When the robot hits an obstacle or steps on an unseen pit and falls, the learned policy shows the capability of failure-resilient running and critical recovery within one second. The robot spontaneously resisted unpredictable external disturbances in indoor tests and demonstrated unique motor skills. These results are reported qualitatively, and corresponding videos highlight the adaptive responses that emerge from end-to-end learning.

The main contributions of this paper include:

  • A novel automatic curriculum strategy enables the discovery of behaviors that are challenging to learn directly using reinforcement learning.

  • We adapt Hindsight Experience Replay to legged locomotion tasks, allowing sample efficient learning and improving performance.

  • The learned controller can be deployed directly to the real world and performs agility and adaptively in various environments.

Related works

Dynamic locomotion over unknown and challenging terrain requires careful motion planning and precise motion and force control10. The primary approach in the legged locomotion community uses model-based mathematical optimization to solve these problems, such as Model-Predictive Control (MPC)11, Quadratic Programming (QP)12, and Trajectory Optimization (TO)13. MPC enables a system to make current decisions while considering their impact on the future through predictive modeling14, which has shown promise in recent research on legged locomotion. Recently, some complex leg models, such as single rigid-body models15,16, central dynamics models17,18, and whole-body models19, have been used to improve the locomotor skills of legged robots, especially quadruped robots. These MPC-based strategies focus on the unified processing of legged motions and can find reliable motion trajectories in real time, which are robust to external disturbances. This paradigm relies on explicitly specifying contact points manually or automatically. It requires advanced optimization schemes that grow exponentially in computational power and time as complexity increases and are too slow for real-time solutions, making closed-loop control impractical for robotic fall recovery.

End-to-end reinforcement learning for quadrupedal locomotion

Instead of tedious manual controller engineering, the Reinforcement learning (RL) technique automatically synthesizes a controller for the desired task by optimizing the controller’s objective function20. Using policies trained in simulation, ANYmal21,22 is capable of precisely and energy-efficiently following high-level body velocity commands, running faster than ever, and recovering from falling even in complex configurations. Extending this approach, Miki et al.3 developed a controller to integrate exteroceptive and proprioceptive perception for legged locomotion, and completed an hour-long hike in the Alps in the time recommended for human hikers. However, the mechanical design of the ANYmal robot is thought to limit it from running at higher speeds. Margolis et al.4 present an end-to-end learned controller that achieves record agility for the MIT Mini Cheetah, sustaining speeds up to 3.9 m/s on flat ground and 3.4 m/s on grassland. Choi et al.23 demonstrated high-speed locomotion capabilities on deformable terrains. The robot could run on soft beach sand at 3.03 m/s, although the feet were completely buried in the sand during the stance phase. Yang et al.24 demonstrated successful multi-skill locomotion on a real quadruped robot that autonomously performed coherent trotting, steering, and falls recovery. A design guideline25 was proposed for selecting key states for initialization and showed that the learned fall recovery policies are hardware-feasible and can be implemented on real robots.

Fall recovery control

Previously, fall recovery controllers for legged robots were heuristically handcrafted to produce trajectories similar to human or animal fall recovery maneuvers, which required extensive manual design26,27. Offline planning methods automate the process by predicting fall28 or calculating trajectories offline based on specific fall postures29. This offline planning is not event-based in nature and, therefore, needs more real-time responsiveness to react to external disturbances. Optimization-based methods can compute feasible fall recovery solutions without the need to handcraft trajectories directly30,31. DRL has been used to learn fall recovery for humanoid character animation in physics simulation32. Compared to previous work, our proposed controller not only realizes high-speed and agile motion on uneven terrains, but also can adaptively handle unexpected events and external disturbances.

CPG-based methods

Inspired by neuroscience, central pattern generators (CPGs) are another promising approach for improving legged robots locomotion33,34. AI-CPG35 trains a CPG-like controller through imitation learning to generate rhythmic feedforward activity patterns and RL is applied in forming a reflex neural network, which can adjust feedforward patterns based on sensory feedback, enabling the stable body balancing to adapt to environmental or target velocity changes. Using CPGs, robots can achieve more natural and stable movements, similar to those of living organisms35,36. However, CPGs limit the diversity of controller motion patterns, which is essential for agility and adaptive locomotion. As this work demonstrates, since gait or joint locomotion patterns are not explicitly specified, the end-to-end RL can autonomously learn unique movement patterns to cope with unexpected situations.

Curriculum learning

Prior works have shown that a curriculum on environments can significantly improve learning performance and efficiency using reinforcement learning37. However, despite these advantages, previous curriculum learning methods are required to be manually designed by a user38; the curriculum learning should be frequently modified according to the training results, and diverse curricula should be tested to find a proper method as well. This trial-and-error approach does not guarantee robust performance, and finding an effective curriculum learning method is difficult. Thus, many researchers have suggested automatic curriculum learning (ACL) methods that design curricula automatically without human intervention39. Ji et al.40 used a fixed-schedule curriculum on forward linear velocity only. A Grid Adaptive Curriculum Update Rule is proposed4 to track a more extensive range of velocities. A curriculum on terrains3,41 was applied to learn highly robust walking controllers on non-flat ground. Nahrendra et al.42 utilized a game-inspired curriculum to ensure progressive locomotion policy learning over rugged terrains. Compared to previous work, our approach adjusts command, reward coefficients, and environment difficulty in stages, allowing for finer control of curriculum difficulty. The combination of curriculum learning and HER also further enhances learning efficiency.

Hindsight experience replay

Hindsight Experience Replay (HER) has paved a promising path toward increasing the efficiency of goal-conditioned tasks with sparse rewards9. Several variants have been proposed to enhance HER, such as curriculum learning to maximize the diversity of the achieved goals43, providing demonstrations44, generating more valuable desired goals45, and curiosity-driven exploration46. Legged locomotion tasks can be viewed as a particular variant of goal-conditioned tasks due to the presence of control commands. This work adapts HER to legged locomotion tasks and proves its effectiveness.

Methods

Our goal is to learn a policy that takes sensory data and velocity commands as input and gives as output joint desired positions. Symbols used in the paper are listed in Table S1. The command \(\textbf{c}^{cmd}_t\) includes the linear velocity \(\textbf{v}^{cmd}_t\) and its yaw direction \(\textbf{d}^{cmd}_t\). The policy is trained in simulation and then performs zero-shot sim-to-real deployment. The controller is trained via privileged learning, which consists of two stages:

First, a teacher policy \(\pi _{\theta }^{teacher}\) with parameters \(\theta\) is trained via reinforcement learning with full access to privileged information that includes the ground-truth state of the environment. Proposed method Curricular Hindsight Reinforcement Learning (CHRL) works in this stage and improves the performance of the teacher policy.

Second, a student policy \(\pi _{\phi }^{student}\) is trained via imitation learning to predict the teacher’s optimal action given only partial and noisy observations of the environment. Then, the student policy is deployed on the robot without any fine-tuning.

Hardware and training environment

The robot used in this work stands 35 cm tall and weighs 15 kg. The robot boasts 18 degrees of freedom (DoFs) with 12 actuated joints, each capable of delivering a maximum torque of 33.5 Nm and six generalized coordinates for the floating base. The robot is equipped with an Inertial Measurement Unit (IMU) and joint encoders.

The robot is equipped with an ARM-based STM32 microcontroller that sends commands to drive joints, receives sensor readings, and performs simple computations. However, this microcontroller cannot run the neural network controller in deep reinforcement learning with sufficient frequency. Therefore, the robot was additionally integrated with the NVIDIA Jetson Orin development kit to provide additional computational power. The Orin development kit is equipped with a 2048-core GPU that utilizes CUDA technology to facilitate neural network inference. Our neural network controller runs at 66 Hz on an Orin development kit.

We use pybullet47 as the simulator to build the training environment. A procedural terrain generation system is developed to generate diverse sets of trajectories. Four parallel agents collect five million simulated time steps for policy training, which spends less than 6 hours of wall-clock time using a single NVIDIA RTX 3090 GPU.

Control architecture

Action space

The action \(\textbf{a}_t\) is a 12-dimensional desired joint position vector. A PD controller is used to calculate torque \(\varvec{\tau } = Kp(\hat{\textbf{q}} - \textbf{q})+Kd(\hat{\dot{\textbf{q}}} - \dot{\textbf{q}})\), Kp and Kd are manually specified gains, which are set to 27.5 and 0.5 respectively. The target joint velocities \(\hat{\dot{\textbf{q}}}\) are set to 0. The PD controller outperforms the torque controller regarding training speed and final control performance48. Although there is always a bi-directional mapping relationship between them, the PD controller has an advantage in training because it starts as a stable controller and the individual joints do not easily swing to the limit position, whereas the torque controller can easily cause the joints to get stuck in the limit position at first.

Student observation space

The robot’s joint encoders provide joint angles \(\textbf{q}_{t} \in \mathbb {R}^{12}\) and velocities \(\dot{\textbf{q}}_{t} \in \mathbb {R}^{12}\). \(\textbf{g}_{t}^{\text{ ori }} \in \mathbb {R}^{3}\) and \(\omega _{t}^{\text{ ori }} \in \mathbb {R}^{3}\) denote the orientation and angular velocities measured using the IMU. \(\pi _{\phi }^{student}\) takes as input a history of previous observations and actions denoted by \(\textbf{o}_{t-H:t}\) where \(\textbf{o}_{t}=\left[ \textbf{q}_{t}, \dot{\textbf{q}}_{t}, \textbf{g}_{t}^{\text{ ori }}, \omega _{t}^{\text{ ori }}, \textbf{a}_{t-1}\right]\). The input to \(\pi _{\phi }^{student}\) is \(\textbf{x}_{t}= (\textbf{o}_{t-H:t}, \textbf{c}_{t})\), where \(\textbf{c}_{t}\) is specified by a human operator through remote control during deployment, \(H=100\).

Teacher observation space

The teacher observation is defined as \(\textbf{s}_{t} = (\textbf{o}_{t-H:t}, \textbf{c}_{t}, \textbf{p}_{t-H:t})\), where \(H=4\). \(\textbf{p}_t\) is the privileged state which contains the body velocity \(\textbf{v}_{t} \in \mathbb {R}^{3}\), the binary foot contact indicator vector \(\textbf{f}^{t} \in \mathbb {R}^{4}\), the relative position in the world frame \(\textbf{p}_{t} \in \mathbb {R}^{3}\), friction coefficient of feet, and payload mass.

Reward function

The reward function encourages the agent to move forward and penalizes it for jerky and inefficient motions. We denote joint torques as \(\varvec{\tau }_{t} \in \mathbb {R}^{12}\), the acceleration of the base in the base frame of the robot as \(\ddot{a}_{t} \in \mathbb {R}^{3}\), the velocity of the feet as \(\textbf{v}_{t}^{f} \in \mathbb {R}^{4}\), and the total mechanical power of the robot as \(\mathbb {W}_{t}\). Building on previous work49, a reward function is designed to encourage the robot to move forward in a way that tracks desired commands and penalizes it for unnatural and inefficient movements. The reward function contains task reward terms for linear velocity and orientation tracking, as well as a set of auxiliary terms for stability (height constraints and angular velocity penalties for body roll and pitch), safety (penalties for self-collision), smoothness (foot-slip, joint torque, and body-acceleration penalties), and energy (power for the current time-step). The smoothness, safety, and energy rewards encourage the agent to learn a stable and natural gait. The reward at time t is defined as the sum of the following quantities:

  • Linear Velocity: \(\exp \left\{ -0.5\left( \textbf{v}_{t}^{cmd}-\textbf{v}^{x}_{t}\right) ^{2}\right\}\)

  • Linear Velocity Penalties: \(- 0.4\textbf{v}^{y}_{t} - 0.4\textbf{v}^{z}_{t}\)

  • Orientation Tracking: \(\exp \left\{ -0.5\left( \textbf{d}_{t}^{cmd}-\textbf{g}_{t}^{\textrm{ori}}\right) ^{2}\right\}\)

  • Height Constraint: \(-|\textbf{p}^{z}_{t}-\textbf{p}^{\textrm{target}}_{t} |\)

  • Angle Velocity Penalties: \(-\left\| \omega _{t}^{\text{ ori }} \right\| ^{2}\)

  • Self-collision Penalties: \(-\textbf{1}_{\textrm{selfcollision}}\)

  • Joint Torque Penalties: \(-\left\| \varvec{\tau }_{t} \right\| ^{2}\)

  • Base Acceleration Penalties: \(-\left\| \ddot{a}_{t} \right\| ^{2}\)

  • Energy: \(-\mathbb {W}_{t}\)

  • Foot Slip: \(-\left\| \operatorname {diag}\left( \textbf{f}^{t}\right) \cdot \textbf{v}_{t}^{f}\right\| ^{2}\)

The reward at time t is defined as the sum of the quantities with the scaling factor of each term being 3.0, 1.0, 3.0, 10, 0.21, 2.0, 0.018, 0.1, 0.012, and 0.3.

Policy optimization

Teacher policy

The teacher policy \(\pi _{\theta }^{teacher}\) is modeled as an MLP, which consists of two MLP components: a state encoder \(g_{\theta _{e}}\) and the main network \(\pi _{\theta _{m}}\), such that \(\textbf{a}_{t}=\pi _{\theta _{m}}\left( \textbf{z}_{t}\right)\) where \(\textbf{z}_{t}=g_{\theta _{e}}\left( \textbf{s}_{t}\right)\) is a latent representation. Each module is parameterized as a neural network with 256 hidden nodes respectively and rectified linear units (ReLU)50 between each layer. We optimize the teacher parameters together using REDQ51.

Student policy

We use the same training environment as for the teacher policy, but add additional noise to the student observation \(\textbf{o}_{t}^{noise} = n(\textbf{o}_{t})\) where \(n(\textbf{o}_{t})\) is a Gaussian noise model applied to the observation. The student policy uses a temporal convolutional network (TCN)6 encoder \(g_{\phi _{e}}\) to solve the Partially Observable Markov Decision Process (POMDP). The student action \(\hat{\textbf{a}}_{t} = \pi _{\phi _{m}}(\hat{\textbf{z}}_{t})\) where \(\hat{\textbf{z}}_{t}=g_{\phi _{e}}\left( \textbf{o}_{t}^{noise}\right)\). The student policy is trained via supervised learning. The loss function is defined as

$$\begin{aligned}\mathscr {L}=(\hat{\textbf{a}}_{t} - \textbf{a}_{t})^{2} + (\hat{\textbf{z}}_{t} - \textbf{z}_{t})^{2}. \end{aligned}$$
(1)

We employ the dataset aggregation strategy (DAgger)52. Training data are generated by rolling out trajectories according to the student policy. The weights of all networks are initialized with Kaiming Uniform Initialization53, and the biases are zero-initialized. All parameters are updated by the Adam optimizer54 with a fixed learning rate \(3 \times 10^{-4}\).

Domain randomization

Domain randomization encourages the policy to learn a single behavior that works across all the randomized parameters to cross the sim-to-real gap. We apply external force and torque to the robot’s body, introduce slippage by setting the friction coefficients of the feet to a low value, and randomize the robot’s dynamics parameters3,55. Before each training episode, we randomly select a set of physical parameters (Table 2) to initiate the simulation.

Since ___domain randomization trades optimality for robustness55, the parameters and their ranges in Table 2 must be chosen carefully to prevent learning overly conservative motion gaits. Robot mass and joint friction were measured during the design of the robot, thus giving conservative ranges, but less certainty about the rotational inertia as it was estimated using CAD software. In addition, some of the dynamic variables change over time, such as motor friction due to wear and tear, control steps and delays that fluctuate due to the non-real-time nature of the system, and battery voltage that varies depending on whether or not it is fully charged. For these reasons, the choice was made to randomize these parameters and their ranges based on actual measurements with a small safety factor. The noise level of the sensors (e.g. the IMU and joint encoders) is obtained by counting real sensor data.

Automatic curriculum strategy

Similar to previous work, our approach implements a training curriculum that progressively modifies the distribution of environmental parameters, thus enabling policies to improve motor skills and continuously adapt to new environments.

Some works use a curriculum where the commands are updated on a fixed schedule \(\mathbb {C}^{k}\), as a function of the timing variable k. The schedule \(\mathbb {C}^{k}\) consists of two parts, the distribution \(p_{x}^{k}\) of the random variable x (such as ___domain randomization parameters and commands) and \(r_{c}^{k}\) for the curriculum coefficient c (such as reward factors).

The update rule f takes the form \(\mathbb {C}^{k+1} \longleftarrow f(\mathbb {C}^{k})\). However, a fixed schedule requires manual tuning. If the environment, rewards, or learning settings are modified, the schedule will likely need to be re-turned, which will be costly in terms of time. Rather than advancing the curriculum on a fixed schedule, we automatically update the curriculum using a command-based rule.

Unlike previous works, our approach is not limited to the use of commands or terrain but can take advantage of more environmental parameters, such as reward coefficients and ___domain randomization parameters, to control the task’s difficulty at a finer granularity. Table 1 shows the parameters used in our experiences.

Table 1 Range of curriculum parameters.
Table 2 Ranges of the randomized parameters.

In this work, we apply a tabular Curriculum Update Rule. First, we manually set the number of curriculums N to uniformly split the curriculum parameters (Table 1) from start to end into \(p_{x}^{1} \dots p_{x}^{N}\) and \(r_{c}^{1} \dots r_{c}^{N}\). At episode k, the curriculum parameters for the agent and environment are sampled from the joint distribution \(p_{x}^{n}\), and the reward factor is replaced with \(r_{c}^{n}\). If the agent succeeds in this region of the curriculum space, we would like to increase the difficulty by updating the tabular curriculum from \(p_{x}^{n}\) and \(r_{c}^{n}\) to \(p_{x}^{n+1}\) and \(r_{c}^{n+1}\):

$$\begin{aligned}p_{x}^{n}, r_{c}^{n} \leftarrow \left\{ \begin{array}{ll} p_{x}^{n+1}, r_{c}^{n+1} & \epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right]< \epsilon _{0} \text { and } \epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right] < \epsilon _{1} \\ p_{x}^{n}, r_{c}^{n} & \text{ otherwise } \end{array}\right. \end{aligned}$$
(2)

where

$$\begin{aligned}\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right]&=\mathbb {E}_{\textbf{v}_{t}^{\textrm{cmd}} , \varvec{d}_{t}^{\textrm{cmd}}} \sqrt{\mathbb {E}_{t}\left( \textbf{v}_{t}^{\textrm{cmd}}-\textbf{v}_{t}^{x}\right) ^{2}}, \\ \epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right]&=\mathbb {E}_{\textbf{v}_{t}^{\textrm{cmd}} , \varvec{d}_{t}^{\textrm{cmd}}} \sqrt{\mathbb {E}_{t}\left( \textbf{d}_{t}^{\textrm{cmd}}-\textbf{g}_{t}^{\textrm{ori}}\right) ^{2}}, \end{aligned}$$
(3)

\(\textbf{v}_{t}^{\textrm{cmd}}\) and \(\textbf{d}_{t}^{\textrm{cmd}}\) are the linear velocity of the robot and its yaw direction measured at time t. The average tracking error \(\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right]\) and \(\epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right]\) over trials with the current policy is defined as the main evaluation metric.

Measuring both of these metrics individually does not effectively reflect the performance of the controller, so we need a performance metric that measures both commands. We construct a composite metric that captures the diversity of commands that the controller can execute within a certain maximum error tolerance:

$$\begin{aligned}\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right]< \epsilon _{0} \text { and } \epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right] < \epsilon _{1} \end{aligned}$$
(4)

The proposed automatic Curriculum strategy cyclical evaluates the learned policy and automatically controls curriculum difficulty based on evaluation results. The evaluation procedure will be initiated for every 100,000 samples collected by the agent. Ten evaluation processes were run in parallel, each lasting 10 seconds and randomly sampling new commands every 2 seconds. In all experiences, we set \(\epsilon _{0} = 0.15\), \(\epsilon _{1} = 0.25\), and the number of curriculums \(N=10\).

Hindsight experience replay for legged locomotion

Algorithm 1
figure a

Curricular Hindsight Reinforcement Learning

Hindsight Experience Replay (HER)9 allows the algorithm to reuse existing samples and can be combined with any off-policy RL algorithm. However, the original HER algorithm is only suitable for some goal-conditioned tasks with sparse and binary rewards. We have made some improvements to HER to accommodate legged locomotion tasks.

Firstly, we sample a new goal for each transition during the training phase instead of choosing a different goal for a trajectory during the storing phase. Second, due to the need to recalculate rewards, more information about the state of the robot (\(\mathbb {W}, \ddot{a}, \omega ^{\text{ ori }}, \tau , \textbf{f}, \textbf{v}, \textbf{g}^{\textrm{ori}}\)) is added in each transition. Then the transition \(\textbf{T}_t\) transition becomes

$$\begin{aligned}\textbf{T}_t&= \left( \textbf{s}_t, \textbf{a}_t, r_{t}, \textbf{s}_{t+1}, \textbf{c}_{t}^{cmd}, \textbf{s}^{robot}_{t+1} \right) , \\ \textbf{s}^{robot}_{t+1}&\doteq \{\mathbb {W}_{t+1}, \ddot{a}_{t+1}, \omega _{t+1}^{\text{ ori }}, \tau _{t+1}, \textbf{f}^{t+1}, \textbf{v}_{t+1}, \textbf{g}_{t+1}^{\textrm{ori}}\} \end{aligned}$$
(5)

Third, the original HER samples the goal achieved in the episode’s final state. This is inefficient as policies for similar commands in locomotion tasks tend to be similar. For a transition with command \(\textbf{c}_{t}^{cmd}\) consisted with the linear velocity \(\textbf{v}^{cmd}_t\) and its yaw direction \(\textbf{d}^{cmd}_t\). We sample new command \(\textbf{c}_{new}^{cmd}\) by adding neighboring regions to the sampling distribution:

$$\begin{aligned}\textbf{c}_{new}^{cmd}&\doteq \{\textbf{v}^{cmd}_{new}, \textbf{d}^{cmd}_{new}\}, \\ \textbf{v}^{cmd}_{new}&\sim \mathscr {U}[\textbf{v}^{cmd}_t - 0.3, \textbf{v}^{cmd}_t + 0.3], \\ \textbf{d}^{cmd}_{new}&\sim \mathscr {U}[\textbf{d}^{cmd}_t - 0.5, \textbf{d}^{cmd}_t + 0.5]. \end{aligned}$$
(6)

All new commands sampled will be limited to the range of the current curriculum. Then \(r_t\) is recalculated with \(\textbf{s}^{robot}_{t+1}\) according to Section “Control architecture”.

This approach may be seen as a form of data augmentation as samples generated by the policy of a specific command are shared within neighboring policies. Since commands are usually given manually or by a high-level planning controller, the generation of commands is usually not highly correlated with the current state of the robot, so implementing a small random sampling of current commands will not result in biased training samples. While commands are closely related to the rewards received by the agent, sampling commands and recomputing the rewards allows the agent to know whether the current action is harmful or beneficial to the neighboring commands, which can alleviate the problem of sparse rewards in large-scale command tracking tasks4.

Combined with the automatic curriculum strategy in Section “Automatic curriculum strategy”, we present Curricular Hindsight Reinforcement Learning (CHRL). The pseudocode for CHRL is shown in Algorithm 1. During the interaction of the agent with the environment, the command tracking accuracy is periodically evaluated to obtain the \(\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right] \text { and } \epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right]\) according to (3), then update curriculum schedule \(\mathbb {C}\) and the environment \(\mathbb {E}\) according to the update rule (2). In the policy training phase, resampling commands \(\textbf{c}_{j}^{cmd} \longleftarrow \textbf{c}_{new}^{cmd}\) according to (6), and recalculate the reward \(r_t\). Finally reinforcement learning algorithm \(\mathbb {A}\) uses the updated rewards and commands to compute the gradient and optimize the policy and critics.

Results

Command tracking task test

Firstly, the control performance was evaluated in forward running under random commands in simulation. In our experiments, we resampled commands and sent them to the robot with a probability of 1/150 (every 2.25 seconds on average) and resampled environmental variables with the same probability. Fig. 1 shows the command tracking accuracy of the policy in the simulation.

The robot can move steadily and well in the desired direction in rough terrain. Even if the velocity command changes during direction correction, the controller can track both commands well. Note that the observed velocity oscillation around the commanded velocity is a well-known phenomenon in legged systems, including humans56. Regarding the average speed, the learned policy has an error of less than 5% on the simulated robot.

We also performed experiences with real robots, including speed tracking tests at 0.8 m/s indoors and a maximum of 3.5 m/s outdoors. The outdoor terrain presented multiple challenges not present in indoor running, including variations in ground height, friction, and terrain deformation. With these variations, the robot must actuate its joints differently to achieve high velocities that are different from those achieved on flat, rigid terrain with high friction, such as a treadmill or paved road. We estimated the robot’s locomotion speed by measuring the time it took to pass through a 5-meter-long section of the road and performed multiple sets of repetitive tests. In the indoor 0.8 m/s speed test, we recorded a 5 m walking time of 6.17 s, with an average speed of 0.81 m/s. In contrast, we recorded an outdoor 5-meter sprint time of 1.45 seconds with an average speed of 3.45 m/s. The results show that the robot can track speed commands consistently and accurately in unknown scenarios, both indoors and outdoors.

As shown in Fig. 2, we evaluated the yaw tracking control of the controller in an outdoor grass environment. In the test, the robot’s desired direction command and velocity command were randomly sent by the human operator. Our experiments observed a maximum yaw speed of 3.2 rad/s for the robot, followed by a safe stop. Even with continuous slewing motion at larger velocity commands (3 m/s), the robot remained stable while turning, demonstrating the submissive interactions and robustness learned by the agent.

The robot could track commands robustly in different ground conditions with different hardness, friction, and obstacles (Fig. 2). The learned motor skills were stable across the different ground conditions, and the robot continued to trot steadily in all three conditions. The trained policy exhibited compliant interaction behaviors to handle physical interactions and impacts. In testing, when a large foot slip occurred, the robot could recover quickly, even running near maximum speed. If the command was suddenly set to zero during operation, the robot assumed a stable posture and quickly stopped moving.

Fig. 1
figure 1

Command tracking task test results in simulation.

Fig. 2
figure 2

Real Robot Deployment Experiments. (a) Running on grassland. (b) Walking on wet dirt ground. (c) Indoor walking through different ground conditions.

Fig. 3
figure 3

Indoor walking with unexpected disturbance. The robot was suddenly kicked when it was walking indoors, causing it to lose its balance. The controller controlled the legs to support the body and restore balance in one second.

Fig. 4
figure 4

Outdoor running accident. The robot ran on grassland with command 3m/s when it stepped on an unknown deep pit, causing it to trip and fall. The controller utilized forward inertia to roll the body and control the legs in an impact-resistant stance to protect the robot, followed by a quick return to running.

Fig. 5
figure 5

Outdoor fall recovery test. In the test, we deliberately controlled the robot to hit a tree on a dirt floor, and due to the tree’s obstruction, the robot could not move forward and fell. The robot actively contacted the ground with its legs to support its body and held it upright.

Fall recovery and response to unexpected disturbances

Due to the uncertainties in unforeseen situations, locomotion failures are likely to occur. We illustrate these challenges in robotic locomotion using field tests (Figs. 4 and 5) and adaptive behaviors that are robust to uncertainty (Fig. 3).

During the outdoor tests, we found that the robot also experienced many unexpected accidents, which resulted in unintended contact of the robot’s body with the environment. In the indoor tests, we actively applied an external disturbance to the robot during its walking to destabilize its movement and observe its reaction after losing balance. Typically, robots fall within a second of losing their balance, and the window of time to prevent a fall is about 0.2–0.5 s. The proposed controller is observed to have different adaptive behaviors for the above unexpected scenarios, which generate dynamic locomotions and complex leg coordination for immediate recovery from failures.

In these unexpected scenarios, our robots autonomously coordinate different locomotion patterns to mitigate disturbances and prevent or recover from failures without human assistance. These behaviors are extremely similar to the behavior of biological systems (such as cats, dogs, and humans), which shows higher versatility and agent: the ability to deal with constantly changing and complex situations.

We classified learned response behaviors into three strategies:

  • Natural rolling using semi-passive motion (Fig. 4). Natural rolling is the behavior of a robot that uses its inertia and gravity to tumble.

  • Active righting and tumbling (Fig. 5). Active righting is a policy in which the robot actively uses its legs and elbows to propel itself and generate momentum to flip into a prone position.

  • Stepping. Fig. 3 shows an example of stepping, where coordination and switching of the support legs are required to regain balance in the event of a loss of stability of the current motion due to an external disturbance. This multi-touch switching occurs naturally using a learning-based policy.

Compared to a manually designed fall recovery controller with a fixed pattern, our learned controller can recover from various fall situations by responding to dynamic changes using online feedback. In contrast, a manual controller can only cope with a narrow range of situations.

Fig. 6
figure 6

Two-dimensional t-SNE embedding of the representations in the last hidden layer of the student networks in simulation. (a) Embedding under different commands. (b) Embedding under different robot payloads (kg) and ground friction.

Analysis of skill adaptation

We analyzed the features learned by the policies separately using t-distributed Stochastic Neighbour Embedding (t-SNE) in simulation, thus investigating how skills are adapted and distributed in the network. The t-SNE algorithm is a dimensionality reduction technique used to embed high-dimensional data into a low-dimensional space and to visualize it. Similar features in the output of the student network will appear with high probability in the same neighborhood of the clustered points (Fig. 6) and vice versa.

A two-dimensional projection of the student network features by t-SNE visualizes the neighborhoods and clusters of the samples. In Fig. 6a, t-SNE analyses of student network features under different commands show that the agent learns unique skills and patterns for different commands, revealing a diversity of skills after course training. If the commands are very similar, the network reuses certain patterns and features to some extent but also fine-tunes for these minor differences.

We also use t-SNE to compare policy features with different payloads and friction coefficients. As shown in Fig. 6b, the features under different environmental variables are far away from each other, meaning that the student policy can clearly recognize subtle environmental changes and respond appropriately.

Fig. 7
figure 7

(a) Learning curves for different curriculum strategies. The horizontal axis indicates the number of time steps. The vertical axis shows the average reward over time steps within an episode. The shaded areas denote one standard deviation over four runs. (cd) The measured torques of the right leg during forward running (around 3.0 m/s) for different curriculum strategies.

Fig. 8
figure 8

(a) Velocity tracking test results averaged over four runs. The vertical axis represents the percentage of velocity tracking error relative to the velocity command, and the horizontal axis represents the velocity command received by the robot. (b) CoT test results averaged over four runs. The vertical axis uses logarithmic coordinates to more clearly reflect the intensity of change in CoT.

Comparative evaluation

We compare the performance with the following baselines in simulation:

  • The grid adaptive curriculum from Margolis et al.4, which increases the difficulty by adding neighboring regions to the command sampling distribution.

  • An adaptive curriculum from Miki et al.3, which adjusts terrain difficulty using an adaptive method and changes elements such as reward or applied disturbances using a logistic function.

CHRL consists of curriculum learning and HER, but their contribution to controller performance still needs to be clarified. We also removed some of these components to compare their performance.

Figure 7a shows the learning curves. In contrast to a policy without a curriculum, the learning efficiency and performance of the policy can be significantly improved regardless of the kind of curriculum learning strategy. CHRL consistently performs better and learns significantly faster than other baselines. The measured torques from each joint, while the robot ran at the average speed nearest 3 m/s, are shown in Fig. 7b–d. CHRL has a smoother torque variation, which contributes to its high reward.

In Fig. 8a, we find that the velocity tracking error of the policy increases dramatically when the velocity command is greater than 3 m/s without CHRL. This suggests that the curriculum is a crucial approach for learning high-speed locomotions. Other benchmarks can track high-speed motion commands stably, but have larger tracking errors than CHRL when tracking low-speed commands from 1.0 to 2.0 m/s.

Figure 8b presents the CoTs versus average velocities for baselines. The dimensionless cost of transport (CoT)6 is calculated to compare the efficiency of the controllers. The mechanical CoT is define as \(\sum _{12 \text{ actuators } }[\varvec{\tau } \dot{\textbf{q}}]^{+} /(\textbf{m} g \textbf{v})\), where \(\textbf{m} g\) is the total weight. CHRL recorded slightly lower CoTs than controllers trained with baselines. The presented controller is more energy efficient than the RL controller for ANYmal6 with a log mechanical CoT of about \(-0.4\).

Conclusions

We propose a framework for end-to-end training controller: Curricular Hindsight Reinforcement Learning (CHRL). We fully trained the neural network controller in simulation end-to-end with this framework. Since our controller uses only the most basic sensing, we can implement it on a low-cost robot. It is also relatively easy for others to test and improve our methods. Experimental and simulation results outline the main contributions of CHRL in learning various adaptive behaviors from experts, adapting to changing environments, and robustness to uncertainty. The experimental results show that CHRL achieves multi-modal locomotion with agility and fast response to different situations and perturbations, smooth transitions between standing balance, trotting, turning, and recovery from a fall. CHRL also enables the ability to achieve high-performance omnidirectional locomotion at high speeds. As a learning-based approach, CHRL uses computational agent and shows advantages in generating adaptive behaviors over traditional approaches that rely purely on explicit manual programming. However, while increasing the complexity of the task, physical simulation training may introduce some limitations. As the task becomes more complex, differences between the simulation and the real world may accumulate and become problematic. Based on the results of CHRL, future work will investigate learning algorithms that can safely refine motor skills on real hardware for more complex multi-modal tasks.