Model-Based Reinforcement Learning under Sparse Rewards

What are the key components of Model-Based Reinforcement Learning algorithms to address the sparse reward problem?

Reinforcement Learning (RL) has recently seen significant advances over the last decade in simulated and controlled environments. RL has shown impressive results in difficult decision-making problems such as playing video games or controlling robot arms, especially in industrial applications where most methods require many interactions with the system to achieve good performance, which can be costly and time-consuming. Model-Based Reinforcement Learning (MBRL) promises to close this gap by leveraging learned environment models and using them for data generation and/or planning and, at the same time trying to be sample efficient. However, Learning with sparse rewards remains a significant challenge in the field of RL. To promote efficient learning the sparsity of rewards must be addressed. This thesis work tries to study individual components of MBRL algorithms under sparse reward settings and investigate different design choices made to measure the impact on learning efficiency. Suitable Integral Probability Metrics (IPM) are introduced to understand the predictive model reward and observation space distribution during training. These design combinations will be evaluated on continuous control tasks with established benchmarks.

The major motivation behind studying sparse reward problems is that it requires close to no domain knowledge to design a reward function for a given task/environment. For example, in a continuous control task where a robotic arm task is required to stack an object on top of other objects, we can assign a sparse reward of 1 if the robotic arm successfully stacks the object and 0 otherwise. Designing intermediate rewards in this scenario can be challenging and in most cases not practical. The sparse reward problem has been addressed in previous state-of-the-art literature in a plethora of ways. However, research in MBRL under a sparse reward setting has not answered some key questions such as how accurate and diverse the model should be, how much off-policy data must be stored inside the model buffer, and how long the rollout trajectory should be. While it is fairly common for authors to regularly experiment and provide various algorithms for solving sparse reward tasks, it remains an open question as to what the key ingredients and design choices which are essential for solving sparse reward tasks. The study carried out in this thesis helps us understand why some hyperparameter design choices might be working better than others through the lens of the dataset which we use to train the agent.
The primary focus of this study revolves around understanding the essential elements within the Model-Based Reinforcement Learning algorithm that effectively tackles the challenge of sparse rewards. The objective of this thesis is to thoroughly examine the current state-of-the-art MBRL algorithm, namely Model-Based Policy Optimization (MBPO), and investigate the impact of various design decisions in the algorithm design when dealing with sparse rewards scenarios. Modifications suggested in the literature are used as a starting point for benchmarking performance under sparse rewards in a customized sparse inverted pendulum, mountain car continuous, which are stochastic in terms of their initial state and are well suited for testing continuous control tasks.

The natural choice of the MBPO model-based algorithm to further investigate is that it is a natural extension of the model-free counterpart SAC. In this approach, the learned model is used to gather imaginary data to train a policy. To gather this imaginary data, a bootstrap ensemble of dynamics model is utilized where each member of the ensemble is a probabilistic neural network. Using probabilistic neural networks, aleatoric uncertainty, or noise in outputs relative to inputs, is captured while bootstrapping ensures to capture of epistemic uncertainty or uncertainty in model parameters. To generate a prediction from the ensemble, a random model is sampled from the ensemble of the dynamics model to test different transitions along a single model rollout.

Figure 1: Detailed architecture depicting the individual components of an MBPO agent, which utilizes SAC as a policy optimization algorithm. The Figure depicts the interaction of the SAC agent with the stochastic environment to collect real experiences, which are later utilized to train a predictive model to generate a simulated experience. This simulated experience is later utilized as input to the SAC algorithm for policy optimization.

Policy optimization is performed using SAC (Soft Actor-Critic), where SAC alternates between a policy evaluation step, which estimates the state-action value function, and a policy improvement step. For all the experiments, we use the automatic entropy tuning flag that adaptively modifies the entropy gain based on the stochasticity of the policy. Performing extended model rollouts leads to compounding error problems. To mitigate this problem, the MBPO algorithm begins propagating short model rollouts from state distribution of a different policy under the true environment dynamics. As a result, a few long trajectories with large errors can be traded for many short trajectories with smaller errors. Figure 1 shows the architecture of a MBPO agent.
MBPO algorithm

Figure 2:Model-Based Policy Opimization Algorithm

We discuss the list of main model-based hyperparameters utilized in this study for carrying out the ablation study and which play an important role in the design choices of the MBRL algorithm.

Figure 3: Rollout Length

Model rollout length (k), refers to the number of steps taken by the agent during the rollout phase. A rollout is a simulated sequence of actions taken by the agent, starting from an initial state and following a particular policy until a terminal state or a predefined time horizon is reached. Rollout length determines the length of the prediction horizon generated by the model to estimate the expected returns of various policies.

Figure 4: Number of rollouts per step

Number of rollouts per step (M), refers to the number of rollouts generated by a model per training step. This hyperparameter controls the number of rollouts executed under the model, it has a direct control on the amount of data generated by the model.

Figure 5: Number of updates to retain buffer

Updates to retain buffer (R), refers to the number of model updates that are retained in the replay buffer. The replay buffer contains past experiences of the agent, typically consisting of state, action, reward, next-state, and done flags stored as tuples.

Figure 6: Ensemble size

Ensemble size (N), refers to the number of models or dynamics functions used in the ensemble. Each model present in the ensemble is a probabilistic neural network whose outputs parameterize a Gaussian distribution. Ensemble models are responsible for capturing both aleatoric and epistemic uncertainties.

Frequency of model retrain (F), refers to how frequently the model is retrained during the training process. This hyperparameter determines the frequency at which the model parameters are updated to improve the performance of the model.

Trainer Patience (P), refers to a parameter used for early stopping during model training. If the validation performance does not improve for consecutive epochs, training is halted to prevent further training on potentially overfitting the model.

Model Generated Buffers, are created by using a predictive model to generate additional synthetic data for MBRL algorithms. Model-generated buffers are used to augment the environment dataset and improve the agent learning performance. This augmented dataset is utilized to simulate imaginary actions and outcomes, expanding the range of collected experiences. As a result, the capacity of the model-generated buffers is computed as k x M x F x R. The capacity of the model replay buffer is defined in such a manner as to hold R iterations of data generation.

Integral probability metrics are employed to measure the impact of design choices. This is achieved by evaluating the discrepancy between the distribution of observations and rewards perceived by the model and the state and reward distributions in a real-world environment. Specifically, the Wasserstein distance metric is employed to quantify the discrepancy between rewards observed by the model and the actual rewards distribution present in the environment. The maximum mean discrepancy distance metric is utilized to assess the dissimilarity between the state distribution as observed by the model and the state distribution in the actual environment. These established metrics assist in illuminating the model interpretation of rewards and states when considering different components of the MBRL algorithm.
The study was carried out in two continuous control sparse reward environments namely Sparse Inverted Pendulum and Continuous Mountian car as shown in Figures 7 and 8 respectively.

Figure 7: Sparse Pendulum

Figure 8: Continuous Mountain Car

We initially investigate the performance of model-free SAC and model-based MBPO agents in environments with sparse rewards. Two scenarios are considered: a sparse pendulum environment and a continuous mountain car environment. The introduction of an action cost, representing a penalty for certain actions, aims to make solving the sparse reward problem more challenging. The results show that in the absence of an action cost, both model-free and model-based agents successfully find sparse rewards. The model-based agent tends to converge faster. However, when an action cost is applied, the model-free SAC struggles due to its reliance on unstructured exploration. In contrast, the model-based MBPO agent performs better, utilizing its learned model to explore effectively and find solutions for a significant portion of trials. Similar conclusions are drawn from a study on the continuous mountain car problem. The SAC algorithm's unstructured exploration is found insufficient for these challenging tasks, while MBPO achieves better results.

Figure 9: (a) Learning curves of SAC and MBPO agents with and without action cost in sparse pendulum environment. (b) Learning curves of SAC and MBPO agents in a sparse continuous mountain car environment. Results presented show average returns over 20 random seeds which are smoothened by a moving average filter and we report the mean (solid lines) and standard error (shaded regions).

Figure 10 Figure 10 illustrates an experiment using an MBPO agent in a sparse pendulum environment, depicting reward distribution changes during training. The comparison involves the model predictions, the actual environment, and the true reward distribution. The true reward distribution is generated by passing model observations and actions through the environment's reward function. Initially, the agent explores widely due to an inaccurate model, leading to a broad reward distribution. With experience and model updates, predictions become more accurate, especially as actual rewards begin to appear. Initially, the distribution is wide around sparse reward regions due to limited data. As experience accumulates around reward areas, the distribution narrows and concentrates around the reward. This analysis enhances understanding of how the model perceives rewards. The study further explores how various hyperparameter combinations impact the model's perception and reward distribution, aiding in solving sparse reward problems.

Figure 10: Visualization of reward distribution and pdf during training in sparse pendulum swing-up task using MBPO agent. (Left column) corresponds to the reward distribution predicted by the model. (Center column) corresponds to reward distribution as original rewards start appearing in the environment. (Right column) corresponds to true reward distribution where model observations and actions are passed to the reward function which serves as ground truth for comparison with environment rewards.

Rollout Length Ablation, Figure 11 compares rollout lengths in sparse pendulum and mountain car environments. Longer rollouts yield higher returns, highlighting the need for exploration. Short rollouts limit exploration and the risk of getting stuck. The hypothesis is supported in both environments, with better results for longer rollouts. In the Wasserstein metric plots, differences between model and environment reward distributions decrease over sparse pendulum training, while longer rollout lengths increase disparities in mountain cars, reflecting early unfamiliar state exploration that converges with training progress. In MMD distance plots, higher rollout lengths cause greater disparities between model-generated and environment-state distributions initially. As training advances, the MMD distance decreases, aligning distributions. A rollout length of 1 shows unique behavior, with MMD distance increasing as training progresses, indicating on-policy operation closely resembling environment states.

Figure 11: Ablation study over rollout length hyperparameter conducted using MBPO agent in sparse pendulum (Top row) and mountain car continuous (Bottom row) environment. (Left column)We report the learning curves, (Center column) corresponds to theWasserstein distance metric, and (Right column) corresponds to theMMDdistance metric of different rollout lengths.

Rollouts per step Ablation, Figure 12 compares rollouts per step in the sparse pendulum and mountain car environments. In a sparse pendulum, M = 1000 resulted in lower returns, possibly due to noise from excessive rollouts or model complexity leading to memorization. In mountain cars, M = 1000 initially sped up training but eventually showed a similar pattern to the sparse pendulum. Wasserstein and MMD distance metric plots show consistent model behavior across different initial state quantities from the environment buffer. The model visits the same states and receives identical rewards during training.

Figure 12: Ablation study over the number of rollouts per step hyperparameter conducted using MBPO agent in Sparse Pendulum (Top row) and Mountain Car Continuous (Bottom row) environment. (Left column) We report the learning curves, (Center column) corresponds to theWasserstein distance metric, and (Right column) corresponds to the MMD distance metric of different rollouts per step.

Updates to retain buffer, Figure 13 compares updates to retain buffer in the sparse pendulum and mountain car environment. In Figure 13, the evaluation return plots show that in the sparse pendulum, retaining buffer values at 1 yields on-policy behavior and contributes to better learning performance. In complex mountain car, having more off-policy data helps solve the environment which indicates that retaining past information assisted the model in solving the complex mountain car environment. In Wasserstein distance plots, both sparse reward environments show consistent model-environment discrepancies initially with any off-policy data retention. Smaller buffer updates lead to a faster reduction in discrepancy. The MMD distance plots exhibit similar trends, with higher buffer update values requiring more environment steps for convergence.

Figure 13: Ablation study over the number of updates to retain buffer hyperparameter conducted using MBPO agent in Sparse Pendulum (Top row) and Mountain Car Continuous (Bottom row) environment. (Left column) We report the learning curves, (Center column) corresponds to the Wasserstein distance metric, and (Right column) corresponds to the MMD distance metric of different updates to retain buffer

To understand the key hyperparameters in the MBRL algorithm, adjustments were made to align MBPO's performance with model-free SAC in sparse reward settings. By modifying existing model-based hyperparameters, the performance of MBPO was approximated to that of SAC. This was achieved with 20 seeds and an ensemble size of 1. Experimenting with scaled-up hyperparameters individually further analyzed their influence on MBPO's performance in sparse rewards. MBPO's performance was matched to model-free SAC by leveraging the fact that MBPO employs model-free SAC as a policy optimization component, making it an extended iteration of model-free SAC in a model-based context.

Figure 14: Learning curves representing the reduced performance of an MBPO agent operating close to model-free SAC agent in both sparse pendulum environment and mountain car continuous environment. The results are the average (solid lines) and standard error (shaded regions). over 20 random seeds.

We performed an extensive study on the hyperparameters after reducing the performance of model-based MBPO close to model-free SAC. We discuss here the most important findings of the study and for more detailed insights into the study please refer to the final thesis report attached at the beginning of the page.

Figure 15 shows the impact of longer rollout lengths in sparse reward environments using different rollout lengths. In sparse pendulum, a rollout length of value 20 was most effective, while in mountain car, varying rollout length values had similarly positive impacts, exceeding baseline performance. Wasserstein and MMD distance metrics confirmed these trends.

Figure 15: Ablation study over longer rollout lengths using MBPO agent operating close to a model-free SAC agent in both sparse pendulum and mountain car continuous environment. The results are the average (solid lines) and standard error (shaded regions) over 20 random seeds.

Figure 16 investigates buffer update retention effects in sparse reward environments. In mountain cars, more off-policy data improves learning significantly, while in the sparse pendulum, off-policy data has little impact, favoring on-policy updates. This observation aligns with the evaluation return plot and distance metric results, reinforcing the role of off-policy data in enhancing complex learning scenarios like mountain cars.

updates to retain bufffer final ablation

Figure 16: Ablation study over updates to retain buffer using MBPO agent operating close to a model-free SAC agent in sparse pendulum and mountain car continuous. The results are the average (solid lines) and standard error (shaded regions) over 20 random seeds.

In this study, we examined the influence of different model-based hyperparameters on model learning and observed which combination of these hyperparameters had the most substantial effect on the policy optimization process. We discovered that longer rollout lengths made a substantial contribution to the evaluation returns in both sparse reward environments. We also observed that an ensemble size of one failed to effectively capture the environment dynamics in both sets of sparse reward environments. It was essential to opt for an ensemble size greater than one to more accurately capture the environment dynamics. Maintaining a moderate number of rollouts per step, combined with more frequent updates to the model parameters, proved sufficient to notably enhance evaluation returns, as this pattern was evident in both sparse reward environments. Regarding the retention of off-policy data, a higher level of retained off-policy data in the mountain car environment led to a significant increase in returns, whereas retaining off-policy data in the sparse pendulum environment was not necessary.

Ravi Akash, Masters Student at University of Stuttgart,
Carlos Luis, PhD Student at Bosch Center for Artificial Intelligence,
Felix Berkenkamp, Research Scientist at Bosch Center for Artificial Intelligence , and
Mathias Niepert, Full professor (W3) at University of Stuttgart

I extend my sincere appreciation to my advisors, colleagues, family, and friends who have guided and supported me throughout this journey as a master's thesis student. Your expertise, encouragement, and valuable insights have been pivotal in shaping the outcomes of my research. This accomplishment would not have been possible without your unwavering support, and I am truly grateful for the opportunity to learn and grow under your guidance.

Janner, Michael, et al. "When to trust your model: Model-based policy optimization." Advances in neural information processing systems 32 (2019).