RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$

In NeurIPS Workshop on Generalization in Planning (NeurIPS GenPlan), 2023

Cite: Bhatia, A., Nashed, SB., & Zilberstein, S. (2023). RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$. In NeurIPS Workshop on Generalization in Planning. https://openreview.net/pdf?id=ozqaF9YBce

TL;DR: Incorporating task-specific Q-value estimates as inputs to a meta-RL policy can lead to improved generalization and better performance over longer adaptation periods.


Meta reinforcement learning (meta-RL) methods such as RL$^2$ have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but they do converge to an optimal policy in the limit. We propose RL$^3$, a principled hybrid approach that incorporates action-values, learned per task through traditional RL, in the inputs to meta-RL. We show that RL$^3$ earns greater cumulative reward in the long term, compared to RL$^2$, while maintaining data-efficiency in the short term, and generalizes better to out-of-distribution tasks. Experiments are conducted on both custom and benchmark discrete domains from the meta-RL literature that exhibit a range of short-term, long-term, and complex dependencies.

PDF (Latest ArXiv Draft) Slides Code

RL$^3$ vs RL$^2$ Demo