Abstract
Agile and adaptive maneuvers such as fall recovery, high-speed turning, and sprinting in the wild are challenging for legged systems. We propose a Curricular Hindsight Reinforcement Learning (CHRL) that learns an end-to-end tracking controller that achieves powerful agility and adaptation for the legged robot. The two key components are (i) a novel automatic curriculum strategy on task difficulty and (ii) a Hindsight Experience Replay strategy adapted to legged locomotion tasks. We demonstrated successful agile and adaptive locomotion on a real quadruped robot that performed fall recovery autonomously, coherent trotting, sustained outdoor running speeds up to 3.45 m/s, and a maximum yaw rate of 3.2 rad/s. This system produces adaptive behaviors responding to changing situations and unexpected disturbances on natural terrains like grass and dirt.
Similar content being viewed by others

Introduction
Legged systems can execute agile motions by leveraging their ability to reach appropriate and disjoint support contacts, enabling outstanding mobility in complex and unstructured environments1. This ability makes them a popular choice for tasks such as inspection, monitoring, search, rescue, or transporting goods in complex, unstructured environments2.
However, until now, legged robots could not match the performance of animals in traversing real-world terrains3. In nature, legged animals such as cheetahs and hunting dogs can make small radius turns at high speed running or recover quickly from a fall with little or no slowing down while chasing prey. Animals naturally learn to process a wide range of sensory information and respond adaptively in unexpected situations, even when exteroception is completely limited. This ability requires motion controllers capable of recovering from unexpected perturbations, adapting to system and environmental dynamics changes, and executing safe and reliable motion relying solely on proprioception.
Endowing legged robots with this ability is a grand challenge in robotics. Conventional control theories often must be revised to deal with these problems effectively. As a promising alternative, model-free deep reinforcement learning (RL) algorithms autonomously solve complex and high-dimensional legged locomotion problems and do not assume prior knowledge of environmental dynamics.
Under the privileged learning framework, previous work has solved motion tasks such as fast running4, locomotion over uneven terrain5, and mountain climbing3. However, existing motion controllers can still not respond appropriately and adaptively when various accidents lead to motion interruptions, such as collisions with obstacles or falls caused by stepping into potholes.
In this paper, our goal is to construct a quadrupedal system that can traverse terrains agility and adaptively at an extensive range of commands. One straightforward idea is to have the agent learn directly in a simulation environment in which commands, external disturbing forces, and other random variables are generated uniformly5,6. However, previous works4,7 have shown that this naive approach can only succeed in learning control policies with a small range of commands or disturbances. The training process can progressively address complex tasks by implementing curriculum learning, but manual curriculum design may also fail due to the difficulty of evaluating learning progress and task difficulty4,8. This paper presents Curricular Hindsight Reinforcement Learning (CHRL) to solve this problem by introducing a novel automatic curriculum strategy to automatically assess the learning progress of the policy and control task difficulty. CHRL periodically evaluates the policy’s performance and automatically adjusts the curriculum parameters to ramp up task difficulty, including command ranges, reward coefficients, and environment difficulty.
However, adjusting the environment parameters and command ranges destroys the consistency of the state distribution between the environment and the replay buffer, which is disastrous for off-policy RL. This work solved this problem by adapting Hindsight Experience Replay (HER)9 to quadrupedal locomotion tasks to match the existing replay buffer to the current curriculum. Presented HER modifies the commands of past experiences and recalculates rewards, additionally utilizing past experiences, thus increasing the learning efficiency. The proposed policy yields significant performance improvements in learning agility and adaptive high-speed locomotion.
When zero-shot deployed in the real world on uneven outdoor terrain covered with grass, our learned policy sustained a top forward velocity of 3.45 m/s and spinning angular velocity of 3.2 rad/s. The policy also shows strong robustness and adaptability in unstructured environments. When the robot hits an obstacle or steps on an unseen pit and falls, the learned policy shows the capability of failure-resilient running and critical recovery within one second. The robot spontaneously resisted unpredictable external disturbances in indoor tests and demonstrated unique motor skills. These results are reported qualitatively, and corresponding videos highlight the adaptive responses that emerge from end-to-end learning.
The main contributions of this paper include:
-
A novel automatic curriculum strategy enables the discovery of behaviors that are challenging to learn directly using reinforcement learning.
-
We adapt Hindsight Experience Replay to legged locomotion tasks, allowing sample efficient learning and improving performance.
-
The learned controller can be deployed directly to the real world and performs agility and adaptively in various environments.
Related works
Dynamic locomotion over unknown and challenging terrain requires careful motion planning and precise motion and force control10. The primary approach in the legged locomotion community uses model-based mathematical optimization to solve these problems, such as Model-Predictive Control (MPC)11, Quadratic Programming (QP)12, and Trajectory Optimization (TO)13. MPC enables a system to make current decisions while considering their impact on the future through predictive modeling14, which has shown promise in recent research on legged locomotion. Recently, some complex leg models, such as single rigid-body models15,16, central dynamics models17,18, and whole-body models19, have been used to improve the locomotor skills of legged robots, especially quadruped robots. These MPC-based strategies focus on the unified processing of legged motions and can find reliable motion trajectories in real time, which are robust to external disturbances. This paradigm relies on explicitly specifying contact points manually or automatically. It requires advanced optimization schemes that grow exponentially in computational power and time as complexity increases and are too slow for real-time solutions, making closed-loop control impractical for robotic fall recovery.
End-to-end reinforcement learning for quadrupedal locomotion
Instead of tedious manual controller engineering, the Reinforcement learning (RL) technique automatically synthesizes a controller for the desired task by optimizing the controller’s objective function20. Using policies trained in simulation, ANYmal21,22 is capable of precisely and energy-efficiently following high-level body velocity commands, running faster than ever, and recovering from falling even in complex configurations. Extending this approach, Miki et al.3 developed a controller to integrate exteroceptive and proprioceptive perception for legged locomotion, and completed an hour-long hike in the Alps in the time recommended for human hikers. However, the mechanical design of the ANYmal robot is thought to limit it from running at higher speeds. Margolis et al.4 present an end-to-end learned controller that achieves record agility for the MIT Mini Cheetah, sustaining speeds up to 3.9 m/s on flat ground and 3.4 m/s on grassland. Choi et al.23 demonstrated high-speed locomotion capabilities on deformable terrains. The robot could run on soft beach sand at 3.03 m/s, although the feet were completely buried in the sand during the stance phase. Yang et al.24 demonstrated successful multi-skill locomotion on a real quadruped robot that autonomously performed coherent trotting, steering, and falls recovery. A design guideline25 was proposed for selecting key states for initialization and showed that the learned fall recovery policies are hardware-feasible and can be implemented on real robots.
Fall recovery control
Previously, fall recovery controllers for legged robots were heuristically handcrafted to produce trajectories similar to human or animal fall recovery maneuvers, which required extensive manual design26,27. Offline planning methods automate the process by predicting fall28 or calculating trajectories offline based on specific fall postures29. This offline planning is not event-based in nature and, therefore, needs more real-time responsiveness to react to external disturbances. Optimization-based methods can compute feasible fall recovery solutions without the need to handcraft trajectories directly30,31. DRL has been used to learn fall recovery for humanoid character animation in physics simulation32. Compared to previous work, our proposed controller not only realizes high-speed and agile motion on uneven terrains, but also can adaptively handle unexpected events and external disturbances.
CPG-based methods
Inspired by neuroscience, central pattern generators (CPGs) are another promising approach for improving legged robots locomotion33,34. AI-CPG35 trains a CPG-like controller through imitation learning to generate rhythmic feedforward activity patterns and RL is applied in forming a reflex neural network, which can adjust feedforward patterns based on sensory feedback, enabling the stable body balancing to adapt to environmental or target velocity changes. Using CPGs, robots can achieve more natural and stable movements, similar to those of living organisms35,36. However, CPGs limit the diversity of controller motion patterns, which is essential for agility and adaptive locomotion. As this work demonstrates, since gait or joint locomotion patterns are not explicitly specified, the end-to-end RL can autonomously learn unique movement patterns to cope with unexpected situations.
Curriculum learning
Prior works have shown that a curriculum on environments can significantly improve learning performance and efficiency using reinforcement learning37. However, despite these advantages, previous curriculum learning methods are required to be manually designed by a user38; the curriculum learning should be frequently modified according to the training results, and diverse curricula should be tested to find a proper method as well. This trial-and-error approach does not guarantee robust performance, and finding an effective curriculum learning method is difficult. Thus, many researchers have suggested automatic curriculum learning (ACL) methods that design curricula automatically without human intervention39. Ji et al.40 used a fixed-schedule curriculum on forward linear velocity only. A Grid Adaptive Curriculum Update Rule is proposed4 to track a more extensive range of velocities. A curriculum on terrains3,41 was applied to learn highly robust walking controllers on non-flat ground. Nahrendra et al.42 utilized a game-inspired curriculum to ensure progressive locomotion policy learning over rugged terrains. Compared to previous work, our approach adjusts command, reward coefficients, and environment difficulty in stages, allowing for finer control of curriculum difficulty. The combination of curriculum learning and HER also further enhances learning efficiency.
Hindsight experience replay
Hindsight Experience Replay (HER) has paved a promising path toward increasing the efficiency of goal-conditioned tasks with sparse rewards9. Several variants have been proposed to enhance HER, such as curriculum learning to maximize the diversity of the achieved goals43, providing demonstrations44, generating more valuable desired goals45, and curiosity-driven exploration46. Legged locomotion tasks can be viewed as a particular variant of goal-conditioned tasks due to the presence of control commands. This work adapts HER to legged locomotion tasks and proves its effectiveness.
Methods
Our goal is to learn a policy that takes sensory data and velocity commands as input and gives as output joint desired positions. Symbols used in the paper are listed in Table S1. The command \(\textbf{c}^{cmd}_t\) includes the linear velocity \(\textbf{v}^{cmd}_t\) and its yaw direction \(\textbf{d}^{cmd}_t\). The policy is trained in simulation and then performs zero-shot sim-to-real deployment. The controller is trained via privileged learning, which consists of two stages:
First, a teacher policy \(\pi _{\theta }^{teacher}\) with parameters \(\theta\) is trained via reinforcement learning with full access to privileged information that includes the ground-truth state of the environment. Proposed method Curricular Hindsight Reinforcement Learning (CHRL) works in this stage and improves the performance of the teacher policy.
Second, a student policy \(\pi _{\phi }^{student}\) is trained via imitation learning to predict the teacher’s optimal action given only partial and noisy observations of the environment. Then, the student policy is deployed on the robot without any fine-tuning.
Hardware and training environment
The robot used in this work stands 35 cm tall and weighs 15 kg. The robot boasts 18 degrees of freedom (DoFs) with 12 actuated joints, each capable of delivering a maximum torque of 33.5 Nm and six generalized coordinates for the floating base. The robot is equipped with an Inertial Measurement Unit (IMU) and joint encoders.
The robot is equipped with an ARM-based STM32 microcontroller that sends commands to drive joints, receives sensor readings, and performs simple computations. However, this microcontroller cannot run the neural network controller in deep reinforcement learning with sufficient frequency. Therefore, the robot was additionally integrated with the NVIDIA Jetson Orin development kit to provide additional computational power. The Orin development kit is equipped with a 2048-core GPU that utilizes CUDA technology to facilitate neural network inference. Our neural network controller runs at 66 Hz on an Orin development kit.
We use pybullet47 as the simulator to build the training environment. A procedural terrain generation system is developed to generate diverse sets of trajectories. Four parallel agents collect five million simulated time steps for policy training, which spends less than 6 hours of wall-clock time using a single NVIDIA RTX 3090 GPU.
Control architecture
Action space
The action \(\textbf{a}_t\) is a 12-dimensional desired joint position vector. A PD controller is used to calculate torque \(\varvec{\tau } = Kp(\hat{\textbf{q}} - \textbf{q})+Kd(\hat{\dot{\textbf{q}}} - \dot{\textbf{q}})\), Kp and Kd are manually specified gains, which are set to 27.5 and 0.5 respectively. The target joint velocities \(\hat{\dot{\textbf{q}}}\) are set to 0. The PD controller outperforms the torque controller regarding training speed and final control performance48. Although there is always a bi-directional mapping relationship between them, the PD controller has an advantage in training because it starts as a stable controller and the individual joints do not easily swing to the limit position, whereas the torque controller can easily cause the joints to get stuck in the limit position at first.
Student observation space
The robot’s joint encoders provide joint angles \(\textbf{q}_{t} \in \mathbb {R}^{12}\) and velocities \(\dot{\textbf{q}}_{t} \in \mathbb {R}^{12}\). \(\textbf{g}_{t}^{\text{ ori }} \in \mathbb {R}^{3}\) and \(\omega _{t}^{\text{ ori }} \in \mathbb {R}^{3}\) denote the orientation and angular velocities measured using the IMU. \(\pi _{\phi }^{student}\) takes as input a history of previous observations and actions denoted by \(\textbf{o}_{t-H:t}\) where \(\textbf{o}_{t}=\left[ \textbf{q}_{t}, \dot{\textbf{q}}_{t}, \textbf{g}_{t}^{\text{ ori }}, \omega _{t}^{\text{ ori }}, \textbf{a}_{t-1}\right]\). The input to \(\pi _{\phi }^{student}\) is \(\textbf{x}_{t}= (\textbf{o}_{t-H:t}, \textbf{c}_{t})\), where \(\textbf{c}_{t}\) is specified by a human operator through remote control during deployment, \(H=100\).
Teacher observation space
The teacher observation is defined as \(\textbf{s}_{t} = (\textbf{o}_{t-H:t}, \textbf{c}_{t}, \textbf{p}_{t-H:t})\), where \(H=4\). \(\textbf{p}_t\) is the privileged state which contains the body velocity \(\textbf{v}_{t} \in \mathbb {R}^{3}\), the binary foot contact indicator vector \(\textbf{f}^{t} \in \mathbb {R}^{4}\), the relative position in the world frame \(\textbf{p}_{t} \in \mathbb {R}^{3}\), friction coefficient of feet, and payload mass.
Reward function
The reward function encourages the agent to move forward and penalizes it for jerky and inefficient motions. We denote joint torques as \(\varvec{\tau }_{t} \in \mathbb {R}^{12}\), the acceleration of the base in the base frame of the robot as \(\ddot{a}_{t} \in \mathbb {R}^{3}\), the velocity of the feet as \(\textbf{v}_{t}^{f} \in \mathbb {R}^{4}\), and the total mechanical power of the robot as \(\mathbb {W}_{t}\). Building on previous work49, a reward function is designed to encourage the robot to move forward in a way that tracks desired commands and penalizes it for unnatural and inefficient movements. The reward function contains task reward terms for linear velocity and orientation tracking, as well as a set of auxiliary terms for stability (height constraints and angular velocity penalties for body roll and pitch), safety (penalties for self-collision), smoothness (foot-slip, joint torque, and body-acceleration penalties), and energy (power for the current time-step). The smoothness, safety, and energy rewards encourage the agent to learn a stable and natural gait. The reward at time t is defined as the sum of the following quantities:
-
Linear Velocity: \(\exp \left\{ -0.5\left( \textbf{v}_{t}^{cmd}-\textbf{v}^{x}_{t}\right) ^{2}\right\}\)
-
Linear Velocity Penalties: \(- 0.4\textbf{v}^{y}_{t} - 0.4\textbf{v}^{z}_{t}\)
-
Orientation Tracking: \(\exp \left\{ -0.5\left( \textbf{d}_{t}^{cmd}-\textbf{g}_{t}^{\textrm{ori}}\right) ^{2}\right\}\)
-
Height Constraint: \(-|\textbf{p}^{z}_{t}-\textbf{p}^{\textrm{target}}_{t} |\)
-
Angle Velocity Penalties: \(-\left\| \omega _{t}^{\text{ ori }} \right\| ^{2}\)
-
Self-collision Penalties: \(-\textbf{1}_{\textrm{selfcollision}}\)
-
Joint Torque Penalties: \(-\left\| \varvec{\tau }_{t} \right\| ^{2}\)
-
Base Acceleration Penalties: \(-\left\| \ddot{a}_{t} \right\| ^{2}\)
-
Energy: \(-\mathbb {W}_{t}\)
-
Foot Slip: \(-\left\| \operatorname {diag}\left( \textbf{f}^{t}\right) \cdot \textbf{v}_{t}^{f}\right\| ^{2}\)
The reward at time t is defined as the sum of the quantities with the scaling factor of each term being 3.0, 1.0, 3.0, 10, 0.21, 2.0, 0.018, 0.1, 0.012, and 0.3.
Policy optimization
Teacher policy
The teacher policy \(\pi _{\theta }^{teacher}\) is modeled as an MLP, which consists of two MLP components: a state encoder \(g_{\theta _{e}}\) and the main network \(\pi _{\theta _{m}}\), such that \(\textbf{a}_{t}=\pi _{\theta _{m}}\left( \textbf{z}_{t}\right)\) where \(\textbf{z}_{t}=g_{\theta _{e}}\left( \textbf{s}_{t}\right)\) is a latent representation. Each module is parameterized as a neural network with 256 hidden nodes respectively and rectified linear units (ReLU)50 between each layer. We optimize the teacher parameters together using REDQ51.
Student policy
We use the same training environment as for the teacher policy, but add additional noise to the student observation \(\textbf{o}_{t}^{noise} = n(\textbf{o}_{t})\) where \(n(\textbf{o}_{t})\) is a Gaussian noise model applied to the observation. The student policy uses a temporal convolutional network (TCN)6 encoder \(g_{\phi _{e}}\) to solve the Partially Observable Markov Decision Process (POMDP). The student action \(\hat{\textbf{a}}_{t} = \pi _{\phi _{m}}(\hat{\textbf{z}}_{t})\) where \(\hat{\textbf{z}}_{t}=g_{\phi _{e}}\left( \textbf{o}_{t}^{noise}\right)\). The student policy is trained via supervised learning. The loss function is defined as
We employ the dataset aggregation strategy (DAgger)52. Training data are generated by rolling out trajectories according to the student policy. The weights of all networks are initialized with Kaiming Uniform Initialization53, and the biases are zero-initialized. All parameters are updated by the Adam optimizer54 with a fixed learning rate \(3 \times 10^{-4}\).
Domain randomization
Domain randomization encourages the policy to learn a single behavior that works across all the randomized parameters to cross the sim-to-real gap. We apply external force and torque to the robot’s body, introduce slippage by setting the friction coefficients of the feet to a low value, and randomize the robot’s dynamics parameters3,55. Before each training episode, we randomly select a set of physical parameters (Table 2) to initiate the simulation.
Since ___domain randomization trades optimality for robustness55, the parameters and their ranges in Table 2 must be chosen carefully to prevent learning overly conservative motion gaits. Robot mass and joint friction were measured during the design of the robot, thus giving conservative ranges, but less certainty about the rotational inertia as it was estimated using CAD software. In addition, some of the dynamic variables change over time, such as motor friction due to wear and tear, control steps and delays that fluctuate due to the non-real-time nature of the system, and battery voltage that varies depending on whether or not it is fully charged. For these reasons, the choice was made to randomize these parameters and their ranges based on actual measurements with a small safety factor. The noise level of the sensors (e.g. the IMU and joint encoders) is obtained by counting real sensor data.
Automatic curriculum strategy
Similar to previous work, our approach implements a training curriculum that progressively modifies the distribution of environmental parameters, thus enabling policies to improve motor skills and continuously adapt to new environments.
Some works use a curriculum where the commands are updated on a fixed schedule \(\mathbb {C}^{k}\), as a function of the timing variable k. The schedule \(\mathbb {C}^{k}\) consists of two parts, the distribution \(p_{x}^{k}\) of the random variable x (such as ___domain randomization parameters and commands) and \(r_{c}^{k}\) for the curriculum coefficient c (such as reward factors).
The update rule f takes the form \(\mathbb {C}^{k+1} \longleftarrow f(\mathbb {C}^{k})\). However, a fixed schedule requires manual tuning. If the environment, rewards, or learning settings are modified, the schedule will likely need to be re-turned, which will be costly in terms of time. Rather than advancing the curriculum on a fixed schedule, we automatically update the curriculum using a command-based rule.
Unlike previous works, our approach is not limited to the use of commands or terrain but can take advantage of more environmental parameters, such as reward coefficients and ___domain randomization parameters, to control the task’s difficulty at a finer granularity. Table 1 shows the parameters used in our experiences.
In this work, we apply a tabular Curriculum Update Rule. First, we manually set the number of curriculums N to uniformly split the curriculum parameters (Table 1) from start to end into \(p_{x}^{1} \dots p_{x}^{N}\) and \(r_{c}^{1} \dots r_{c}^{N}\). At episode k, the curriculum parameters for the agent and environment are sampled from the joint distribution \(p_{x}^{n}\), and the reward factor is replaced with \(r_{c}^{n}\). If the agent succeeds in this region of the curriculum space, we would like to increase the difficulty by updating the tabular curriculum from \(p_{x}^{n}\) and \(r_{c}^{n}\) to \(p_{x}^{n+1}\) and \(r_{c}^{n+1}\):
where
\(\textbf{v}_{t}^{\textrm{cmd}}\) and \(\textbf{d}_{t}^{\textrm{cmd}}\) are the linear velocity of the robot and its yaw direction measured at time t. The average tracking error \(\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right]\) and \(\epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right]\) over trials with the current policy is defined as the main evaluation metric.
Measuring both of these metrics individually does not effectively reflect the performance of the controller, so we need a performance metric that measures both commands. We construct a composite metric that captures the diversity of commands that the controller can execute within a certain maximum error tolerance:
The proposed automatic Curriculum strategy cyclical evaluates the learned policy and automatically controls curriculum difficulty based on evaluation results. The evaluation procedure will be initiated for every 100,000 samples collected by the agent. Ten evaluation processes were run in parallel, each lasting 10 seconds and randomly sampling new commands every 2 seconds. In all experiences, we set \(\epsilon _{0} = 0.15\), \(\epsilon _{1} = 0.25\), and the number of curriculums \(N=10\).
Hindsight experience replay for legged locomotion
Hindsight Experience Replay (HER)9 allows the algorithm to reuse existing samples and can be combined with any off-policy RL algorithm. However, the original HER algorithm is only suitable for some goal-conditioned tasks with sparse and binary rewards. We have made some improvements to HER to accommodate legged locomotion tasks.
Firstly, we sample a new goal for each transition during the training phase instead of choosing a different goal for a trajectory during the storing phase. Second, due to the need to recalculate rewards, more information about the state of the robot (\(\mathbb {W}, \ddot{a}, \omega ^{\text{ ori }}, \tau , \textbf{f}, \textbf{v}, \textbf{g}^{\textrm{ori}}\)) is added in each transition. Then the transition \(\textbf{T}_t\) transition becomes
Third, the original HER samples the goal achieved in the episode’s final state. This is inefficient as policies for similar commands in locomotion tasks tend to be similar. For a transition with command \(\textbf{c}_{t}^{cmd}\) consisted with the linear velocity \(\textbf{v}^{cmd}_t\) and its yaw direction \(\textbf{d}^{cmd}_t\). We sample new command \(\textbf{c}_{new}^{cmd}\) by adding neighboring regions to the sampling distribution:
All new commands sampled will be limited to the range of the current curriculum. Then \(r_t\) is recalculated with \(\textbf{s}^{robot}_{t+1}\) according to Section “Control architecture”.
This approach may be seen as a form of data augmentation as samples generated by the policy of a specific command are shared within neighboring policies. Since commands are usually given manually or by a high-level planning controller, the generation of commands is usually not highly correlated with the current state of the robot, so implementing a small random sampling of current commands will not result in biased training samples. While commands are closely related to the rewards received by the agent, sampling commands and recomputing the rewards allows the agent to know whether the current action is harmful or beneficial to the neighboring commands, which can alleviate the problem of sparse rewards in large-scale command tracking tasks4.
Combined with the automatic curriculum strategy in Section “Automatic curriculum strategy”, we present Curricular Hindsight Reinforcement Learning (CHRL). The pseudocode for CHRL is shown in Algorithm 1. During the interaction of the agent with the environment, the command tracking accuracy is periodically evaluated to obtain the \(\epsilon _{k}\left[ \textbf{v}^{\textrm{cmd}}\right] \text { and } \epsilon _{k}\left[ \textbf{d}^{\textrm{cmd}}\right]\) according to (3), then update curriculum schedule \(\mathbb {C}\) and the environment \(\mathbb {E}\) according to the update rule (2). In the policy training phase, resampling commands \(\textbf{c}_{j}^{cmd} \longleftarrow \textbf{c}_{new}^{cmd}\) according to (6), and recalculate the reward \(r_t\). Finally reinforcement learning algorithm \(\mathbb {A}\) uses the updated rewards and commands to compute the gradient and optimize the policy and critics.
Results
Command tracking task test
Firstly, the control performance was evaluated in forward running under random commands in simulation. In our experiments, we resampled commands and sent them to the robot with a probability of 1/150 (every 2.25 seconds on average) and resampled environmental variables with the same probability. Fig. 1 shows the command tracking accuracy of the policy in the simulation.
The robot can move steadily and well in the desired direction in rough terrain. Even if the velocity command changes during direction correction, the controller can track both commands well. Note that the observed velocity oscillation around the commanded velocity is a well-known phenomenon in legged systems, including humans56. Regarding the average speed, the learned policy has an error of less than 5% on the simulated robot.
We also performed experiences with real robots, including speed tracking tests at 0.8 m/s indoors and a maximum of 3.5 m/s outdoors. The outdoor terrain presented multiple challenges not present in indoor running, including variations in ground height, friction, and terrain deformation. With these variations, the robot must actuate its joints differently to achieve high velocities that are different from those achieved on flat, rigid terrain with high friction, such as a treadmill or paved road. We estimated the robot’s locomotion speed by measuring the time it took to pass through a 5-meter-long section of the road and performed multiple sets of repetitive tests. In the indoor 0.8 m/s speed test, we recorded a 5 m walking time of 6.17 s, with an average speed of 0.81 m/s. In contrast, we recorded an outdoor 5-meter sprint time of 1.45 seconds with an average speed of 3.45 m/s. The results show that the robot can track speed commands consistently and accurately in unknown scenarios, both indoors and outdoors.
As shown in Fig. 2, we evaluated the yaw tracking control of the controller in an outdoor grass environment. In the test, the robot’s desired direction command and velocity command were randomly sent by the human operator. Our experiments observed a maximum yaw speed of 3.2 rad/s for the robot, followed by a safe stop. Even with continuous slewing motion at larger velocity commands (3 m/s), the robot remained stable while turning, demonstrating the submissive interactions and robustness learned by the agent.
The robot could track commands robustly in different ground conditions with different hardness, friction, and obstacles (Fig. 2). The learned motor skills were stable across the different ground conditions, and the robot continued to trot steadily in all three conditions. The trained policy exhibited compliant interaction behaviors to handle physical interactions and impacts. In testing, when a large foot slip occurred, the robot could recover quickly, even running near maximum speed. If the command was suddenly set to zero during operation, the robot assumed a stable posture and quickly stopped moving.
Outdoor running accident. The robot ran on grassland with command 3m/s when it stepped on an unknown deep pit, causing it to trip and fall. The controller utilized forward inertia to roll the body and control the legs in an impact-resistant stance to protect the robot, followed by a quick return to running.
Fall recovery and response to unexpected disturbances
Due to the uncertainties in unforeseen situations, locomotion failures are likely to occur. We illustrate these challenges in robotic locomotion using field tests (Figs. 4 and 5) and adaptive behaviors that are robust to uncertainty (Fig. 3).
During the outdoor tests, we found that the robot also experienced many unexpected accidents, which resulted in unintended contact of the robot’s body with the environment. In the indoor tests, we actively applied an external disturbance to the robot during its walking to destabilize its movement and observe its reaction after losing balance. Typically, robots fall within a second of losing their balance, and the window of time to prevent a fall is about 0.2–0.5 s. The proposed controller is observed to have different adaptive behaviors for the above unexpected scenarios, which generate dynamic locomotions and complex leg coordination for immediate recovery from failures.
In these unexpected scenarios, our robots autonomously coordinate different locomotion patterns to mitigate disturbances and prevent or recover from failures without human assistance. These behaviors are extremely similar to the behavior of biological systems (such as cats, dogs, and humans), which shows higher versatility and agent: the ability to deal with constantly changing and complex situations.
We classified learned response behaviors into three strategies:
-
Natural rolling using semi-passive motion (Fig. 4). Natural rolling is the behavior of a robot that uses its inertia and gravity to tumble.
-
Active righting and tumbling (Fig. 5). Active righting is a policy in which the robot actively uses its legs and elbows to propel itself and generate momentum to flip into a prone position.
-
Stepping. Fig. 3 shows an example of stepping, where coordination and switching of the support legs are required to regain balance in the event of a loss of stability of the current motion due to an external disturbance. This multi-touch switching occurs naturally using a learning-based policy.
Compared to a manually designed fall recovery controller with a fixed pattern, our learned controller can recover from various fall situations by responding to dynamic changes using online feedback. In contrast, a manual controller can only cope with a narrow range of situations.
Analysis of skill adaptation
We analyzed the features learned by the policies separately using t-distributed Stochastic Neighbour Embedding (t-SNE) in simulation, thus investigating how skills are adapted and distributed in the network. The t-SNE algorithm is a dimensionality reduction technique used to embed high-dimensional data into a low-dimensional space and to visualize it. Similar features in the output of the student network will appear with high probability in the same neighborhood of the clustered points (Fig. 6) and vice versa.
A two-dimensional projection of the student network features by t-SNE visualizes the neighborhoods and clusters of the samples. In Fig. 6a, t-SNE analyses of student network features under different commands show that the agent learns unique skills and patterns for different commands, revealing a diversity of skills after course training. If the commands are very similar, the network reuses certain patterns and features to some extent but also fine-tunes for these minor differences.
We also use t-SNE to compare policy features with different payloads and friction coefficients. As shown in Fig. 6b, the features under different environmental variables are far away from each other, meaning that the student policy can clearly recognize subtle environmental changes and respond appropriately.
(a) Learning curves for different curriculum strategies. The horizontal axis indicates the number of time steps. The vertical axis shows the average reward over time steps within an episode. The shaded areas denote one standard deviation over four runs. (c–d) The measured torques of the right leg during forward running (around 3.0 m/s) for different curriculum strategies.
(a) Velocity tracking test results averaged over four runs. The vertical axis represents the percentage of velocity tracking error relative to the velocity command, and the horizontal axis represents the velocity command received by the robot. (b) CoT test results averaged over four runs. The vertical axis uses logarithmic coordinates to more clearly reflect the intensity of change in CoT.
Comparative evaluation
We compare the performance with the following baselines in simulation:
-
The grid adaptive curriculum from Margolis et al.4, which increases the difficulty by adding neighboring regions to the command sampling distribution.
-
An adaptive curriculum from Miki et al.3, which adjusts terrain difficulty using an adaptive method and changes elements such as reward or applied disturbances using a logistic function.
CHRL consists of curriculum learning and HER, but their contribution to controller performance still needs to be clarified. We also removed some of these components to compare their performance.
Figure 7a shows the learning curves. In contrast to a policy without a curriculum, the learning efficiency and performance of the policy can be significantly improved regardless of the kind of curriculum learning strategy. CHRL consistently performs better and learns significantly faster than other baselines. The measured torques from each joint, while the robot ran at the average speed nearest 3 m/s, are shown in Fig. 7b–d. CHRL has a smoother torque variation, which contributes to its high reward.
In Fig. 8a, we find that the velocity tracking error of the policy increases dramatically when the velocity command is greater than 3 m/s without CHRL. This suggests that the curriculum is a crucial approach for learning high-speed locomotions. Other benchmarks can track high-speed motion commands stably, but have larger tracking errors than CHRL when tracking low-speed commands from 1.0 to 2.0 m/s.
Figure 8b presents the CoTs versus average velocities for baselines. The dimensionless cost of transport (CoT)6 is calculated to compare the efficiency of the controllers. The mechanical CoT is define as \(\sum _{12 \text{ actuators } }[\varvec{\tau } \dot{\textbf{q}}]^{+} /(\textbf{m} g \textbf{v})\), where \(\textbf{m} g\) is the total weight. CHRL recorded slightly lower CoTs than controllers trained with baselines. The presented controller is more energy efficient than the RL controller for ANYmal6 with a log mechanical CoT of about \(-0.4\).
Conclusions
We propose a framework for end-to-end training controller: Curricular Hindsight Reinforcement Learning (CHRL). We fully trained the neural network controller in simulation end-to-end with this framework. Since our controller uses only the most basic sensing, we can implement it on a low-cost robot. It is also relatively easy for others to test and improve our methods. Experimental and simulation results outline the main contributions of CHRL in learning various adaptive behaviors from experts, adapting to changing environments, and robustness to uncertainty. The experimental results show that CHRL achieves multi-modal locomotion with agility and fast response to different situations and perturbations, smooth transitions between standing balance, trotting, turning, and recovery from a fall. CHRL also enables the ability to achieve high-performance omnidirectional locomotion at high speeds. As a learning-based approach, CHRL uses computational agent and shows advantages in generating adaptive behaviors over traditional approaches that rely purely on explicit manual programming. However, while increasing the complexity of the task, physical simulation training may introduce some limitations. As the task becomes more complex, differences between the simulation and the real world may accumulate and become problematic. Based on the results of CHRL, future work will investigate learning algorithms that can safely refine motor skills on real hardware for more complex multi-modal tasks.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due data sharing is not required by the funding institution, but are available from the corresponding author upon reasonable request.
Code availability
The code used for data analysis is publicly shared on the Zenodo repository (https://zenodo.org/records/13924712)57. The algorithm code is a proprietary intellectual property and thus cannot be made publicly available.
References
Gangapurwala, S., Campanaro, L. & Havoutis, I. Learning low-frequency motion control for robust and dynamic robot locomotion. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 5085–5091 (IEEE, 2023).
Mitchell, A. L. et al. Next steps: Learning a disentangled gait representation for versatile quadruped locomotion. In 2022 International Conference on Robotics and Automation (ICRA) 10564–10570 (IEEE, 2022).
Miki, T. et al. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 7, eabk2822 (2022).
Margolis, G. B., Yang, G., Paigwar, K., Chen, T. & Agrawal, P. Rapid locomotion via reinforcement learning. arXiv:2205.02824 (2022).
Kumar, A., Fu, Z., Pathak, D. & Malik, J. Rma: Rapid motor adaptation for legged robots. arXiv:2107.04034 (2021).
Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V. & Hutter, M. Learning quadrupedal locomotion over challenging terrain. Sci. Robot. 5, eabc5986 (2020).
Xie, Z., Ling, H. Y., Kim, N. H. & van de Panne, M. Allsteps: Curriculum-driven learning of stepping stone skills. In Computer Graphics Forum Vol. 39 213–224 (Wiley Online Library, 2020).
Narvekar, S. & Stone, P. Learning curriculum policies for reinforcement learning. arXiv:1812.00285 (2018).
Andrychowicz, M. et al. Hindsight experience replay. In Advances in Neural Information Processing Systems Vol. 30 (2017).
Humphreys, J., Li, J., Wan, Y., Gao, H. & Zhou, C. Bio-inspired gait transitions for quadruped locomotion. In IEEE Robotics and Automation Letters (2023).
Farshidian, F., Jelavic, E., Satapathy, A., Giftthaler, M. & Buchli, J. Real-time motion planning of legged robots: A model predictive control approach. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids) 577–584 (IEEE, 2017).
Kamidi, V. R., Kim, J., Fawcett, R. T., Ames, A. D. & Hamed, K. A. Distributed quadratic programming-based nonlinear controllers for periodic gaits on legged robots. IEEE Control Syst. Lett. 6, 2509–2514 (2022).
Buchanan, R. et al. Walking posture adaptation for legged robot navigation in confined spaces. IEEE Robot. Autom. Lett. 4, 2148–2155 (2019).
Kerrigan, E. C. Predictive control for linear and hybrid systems [bookshelf]. IEEE Control Syst. Mag. 38, 94–96 (2018).
Di Carlo, J., Wensing, P. M., Katz, B., Bledt, G. & Kim, S. Dynamic locomotion in the MIT cheetah 3 through convex model-predictive control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 1–9 (IEEE, 2018).
Bledt, G. & Kim, S. Implementing regularized predictive control for simultaneous real-time footstep and ground reaction force optimization. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 6316–6323 (IEEE, 2019).
Mastalli, C. et al. Agile maneuvers in legged robots: A predictive control approach. arXiv:2203.07554 (2022).
Meduri, A. et al. Biconmp: A nonlinear model predictive control framework for whole body motion planning. IEEE Trans. Robot. 39, 905–922 (2023).
Carius, J., Ranftl, R., Koltun, V. & Hutter, M. Trajectory optimization for legged robots with slipping motions. IEEE Robot. Autom. Lett. 4, 3013–3020 (2019).
Wu, J., Xin, G., Qi, C. & Xue, Y. Learning robust and agile legged locomotion using adversarial motion priors. In IEEE Robotics and Automation Letters (2023).
Hwangbo, J. et al. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 4, eaau5872 (2019).
Hoeller, D., Rudin, N., Sako, D. & Hutter, M. Anymal parkour: Learning agile navigation for quadrupedal robots. arXiv:2306.14874 (2023).
Choi, S. et al. Learning quadrupedal locomotion on deformable terrain. Sci. Robot. 8, eade2256 (2023).
Yang, C., Yuan, K., Zhu, Q., Yu, W. & Li, Z. Multi-expert learning of adaptive legged locomotion. Sci. Robot. 5, eabb2174 (2020).
Yang, C., Pu, C., Xin, G., Zhang, J. & Li, Z. Learning complex motor skills for legged robot fall recovery. In IEEE Robotics and Automation Letters (2023).
Semini, C. et al. Design overview of the hydraulic quadruped robots. In The fourteenth Scandinavian International Conference on Fluid Power 20–22 (sn, 2015).
Stückler, J., Schwenk, J. & Behnke, S. Getting back on two feet: Reliable standing-up routines for a humanoid robot. In IAS 676–685 (Citeseer, 2006).
Li, Z. et al. Fall prediction of legged robots based on energy state and its implication of balance augmentation: A study on the humanoid. In 2015 IEEE International Conference on Robotics and Automation (ICRA) 5094–5100 (IEEE, 2015).
Araki, K., Miwa, T., Shigemune, H., Hashimoto, S. & Sawada, H. Standing-up control of a fallen humanoid robot based on the ground-contacting state of the body. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society 3292–3297 (IEEE, 2018).
Radulescu, A., Havoutis, I., Caldwell, D. G. & Semini, C. Whole-body trajectory optimization for non-periodic dynamic motions on quadrupedal systems. In 2017 IEEE International Conference on Robotics and Automation (ICRA) 5302–5307 (IEEE, 2017).
Mordatch, I., Todorov, E. & Popović, Z. Discovery of complex behaviors through contact-invariant optimization. ACM Trans. Graph. ToG 31, 1–8 (2012).
Peng, X. B., Guo, Y., Halper, L., Levine, S. & Fidler, S. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Trans. Graph. TOG 41, 1–17 (2022).
Zhang, X., Wu, Y., Wang, H., Iida, F. & Wang, L. Adaptive locomotion learning for quadruped robots by combining DRL with a cosine oscillator based rhythm controller. Appl. Sci. 13, 11045 (2023).
Nassour, J., Hoa, T. D., Atoofi, P. & Hamker, F. Concrete action representation model: From neuroscience to robotics. IEEE Trans. Cognit. Dev. Syst. 12, 272–284 (2019).
Li, G., Ijspeert, A. & Hayashibe, M. Ai-cpg: Adaptive imitated central pattern generators for bipedal locomotion learned through reinforced reflex neural networks. In IEEE Robotics and Automation Letters (2024).
Ijspeert, A. J. & Daley, M. A. Integration of feedforward and feedback control in the neuromechanics of vertebrate locomotion: A review of experimental, simulation and robotic studies. J. Exp. Biol. 226, jeb245784 (2023).
Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning 41–48 (2009).
Florensa, C., Held, D., Geng, X. & Abbeel, P. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning 1515–1528 (PMLR, 2018).
Graves, A., Bellemare, M. G., Menick, J., Munos, R. & Kavukcuoglu, K. Automated curriculum learning for neural networks. In International Conference on Machine Learning 1311–1320 (PMLR, 2017).
Ji, G., Mun, J., Kim, H. & Hwangbo, J. Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion. IEEE Robot. Autom. Lett. 7, 4630–4637 (2022).
Rudin, N., Hoeller, D., Reist, P. & Hutter, M. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning 91–100 (PMLR, 2022).
Nahrendra, I. M. A., Yu, B. & Myung, H. Dreamwaq: Learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 5078–5084 (IEEE, 2023).
Fang, M., Zhou, T., Du, Y., Han, L. & Zhang, Z. Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems Vol. 32 (2019).
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA) 6292–6299. https://doi.org/10.1109/ICRA.2018.8463162 (2018).
Han, C. et al. Overfitting-avoiding goal-guided exploration for hard-exploration multi-goal reinforcement learning. Neurocomputing 525, 76–87 (2023).
Li, B. et al. Acder: Augmented curiosity-driven experience replay. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 4218–4224. https://doi.org/10.1109/ICRA40945.2020.9197421 (2020).
Coumans, E. & Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org (2016–2021).
Haarnoja, T. et al. Learning to walk via deep reinforcement learning. arXiv:1812.11103 (2018).
Chen, S., Zhang, B., Mueller, M. W., Rai, A. & Sreenath, K. Learning torque control for quadrupedal locomotion. arXiv:2203.05194 (2022).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Icml (2010).
Chen, X., Wang, C., Zhou, Z. & Ross, K. W. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations (2020).
Ross, S., Gordon, G. & Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics 627–635 (JMLR Workshop and Conference Proceedings, 2011).
He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision 1026–1034 (2015).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
Tan, J. et al. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv:1804.10332 (2018).
Winter, D. A. Biomechanics and Motor Control of Human Gait: Normal, Elderly and Pathological (1991).
ihuhuhu/chrl: v1.0.0. https://doi.org/10.5281/zenodo.13924712 (2024).
Acknowledgements
This work was supported in part by the National Natural Science Foundation of Heilongjiang Province (Grant No.YQ2020E028).
Author information
Authors and Affiliations
Contributions
S.L. implemented the code and drafted the manuscript. Y.P. assisted in implementing the code and discussed the manuscript. P.B. assisted in implementing the code and discussed the manuscript. Z.L. guided the research and discussed the results. S.H. guided the research and discussed the results. G.W., Li.W., and J.L. guided the research, implemented parts of the code, and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Informed consent
The authors affirm that human research participants provided informed consent for publication of identifying information/images in an online open-access publication.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, S., Wang, G., Pang, Y. et al. Learning agility and adaptive legged locomotion via curricular hindsight reinforcement learning. Sci Rep 14, 28089 (2024). https://doi.org/10.1038/s41598-024-79292-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-79292-4