OmniRL: Large-Scale Meta-Training in Randomized Worlds [Back to Main Page] [CHN]
OmniRL: Large-Scale Meta-Training in Randomized Worlds
Give AI a fish, and you feed it for a day. Teach AI to fish, and it feeds itself for a lifetime.
[Paper]
[Code]

OmniRL Introduction

OmniRL: Real-Time Contextual Reinforcement Learning Base Model

Pre-Training and Meta-Training: Giving Models Fish vs. Teaching Models to Fish


Training initially aimed to teach models to perform a variety of tasks. In large-scale pre-training, besides memorizing the training tasks themselves, the most important emergent capability is in-context learning, which helps large models build the ability to tackle new tasks through prompts.
OmniRL proposes large-scale meta-training, which differs from pre-training in that its goal is not to memorize the skills of the training tasks themselves, but to learn the process of reinforcement learning. Meta-learning, also known as learning to learn, was proposed as early as the 1980s. However, the OmniRL paper argues that meta-learning lacking large-scale tasks and long-sequence support tends to fall into a "task recognition" mode: the model merely memorizes the training environments and activates corresponding skills by recognizing which environment it is in during inference. This mode lacks generalizability to unseen tasks and out-of-distribution tasks.

Randomized Worlds: AnyMDP


Example of a Randomized World Generated by AnyMDP

Example of a randomized world generated by AnyMDP. The color of the points indicates the average reward of the state, and the depth of the lines indicates the average transition probability between states.

AnyMDP builds randomized transition probabilities and reward functions based on Markov Decision Processes (MDPs), enabling the rapid and low-cost generation of vast, scalable environments for meta-reinforcement learning. We generated over 500,000 different tasks and synthesized over 10 billion time steps of data for meta-training. The longest sequence of time steps exceeds 1 million.

Unifying Multi-Reinforcement Learning and Imitation Learning Through In-Context Learning for the First Time


OmniRL proposes leveraging both prior information and posterior feedback for in-context learning, allowing the model to autonomously switch between different learning modes as needed. Figure 2 demonstrates that the OmniRL model, trained in randomized worlds, can achieve strong performance solely through in-context learning without relying on any gradient optimization. Whether starting from scratch or given a demonstration trajectory (including expert demonstrations or suboptimal demonstrations), it can autonomously switch between online reinforcement learning (Online-RL), offline reinforcement learning (Offline-RL), and imitation learning (IL), proving the great flexibility of in-context learning. Furthermore, it can continue to improve its capabilities through autonomous exploration based on demonstrations.
Cliff Lake Pendulum Switch
gym

OmniRL's performance in completely unseen Gymnasium environments


OmniRL-trained agents can even accomplish multi-agent collaboration tasks. By incorporating the other agent's state into the observation, it can complete simple tasks like Switch. These tasks require agents to exhibit different behavioral patterns to achieve collaboration. Through the model's in-context learning and adaptation capabilities, two OmniRL-controlled agents can effectively complete such tasks.

Revealing the Root Cause of Data Diversity and Sequence Length Importance for the First Time


Seen_Training Unseen_Training

The relationship between the model's positional loss, meta-training steps, and context length

OmniRL uses a Transformer with tens of millions of parameters and an efficient linear attention structure for modeling. The number of training tasks exceeds 500,000, and the number of time steps exceeds 1 million. OmniRL compares the effects of the same amount of data but from different numbers of tasks in experiments. It finds that when the number of tasks is insufficient, the model tends to switch to a memory + environment recognition mode, where all training environments are stored in parameter memory and quickly identified through context. In this mode, the agent can adapt to environments seen during training with fewer samples but cannot generalize to unseen environments. Only when the number of tasks is sufficient can the general in-context learning capability be activated. This capability can effectively generalize to unseen tasks but requires a longer in-context learning period for all tasks. This conclusion suggests:

Linear Self-Attention Mechanism Demonstrates Clear Advantages in Efficiency and Long-Sequence Performance


gsa task_16 task_64

The relationship between the model's positional loss, meta-training steps, and context length

OmniRL also demonstrates the advantages of the linear attention mechanism for the first time. As the problem scale increases, the context length needs to increase accordingly, and the efficiency bottleneck of Transformers becomes more apparent. In contrast, the linear attention mechanism shows clear advantages in efficiency and long-sequence modeling, and it also demonstrates significant performance advantages over sliding window attention mechanisms in long-time segments. This proves that AnyMDP provides an excellent evaluation environment for long sequences.

Technological Exploration for Next-Generation General Embodied Agents


Our ultimate goal is to create agents capable of fully autonomous exploration and learning in any environment. This is particularly significant for embodied intelligence. Large language models capture a vast amount of commonsense knowledge, encyclopedic information, and mathematical logic through parameter memory, forming the basis of their zero-shot capabilities. However, embodied intelligence faces diverse environments, tasks, and complex ontological heterogeneity, making commonsense knowledge insufficient as a foundation for solving embodied problems. We believe that autonomous learning capabilities and long-term memory will be key to general embodied agents.

Similarities and Differences with Long-Sequence Reasoning and Chain-of-Thought in Current Large Language Models.

Currently, OmniRL focuses more on System 1 (intuitive thinking) learning capabilities, while the latter focuses on System 2 (logical thinking and planning) itself. Whether for System 1 or System 2 capabilities, current mainstream large models have not explored them sufficiently, and OmniRL fills many gaps in this area.