Understanding the World Through Action: RL as a Foundation for Scalable Self-Supervised Learning
Machine learning systems have mastered a breadth of challenging problems in domains ranging from computer vision to speech recognition and…
Machine learning systems have mastered a breadth of challenging problems in domains ranging from computer vision to speech recognition and natural language processing, and yet the question of how to design learning-enable systems that match the flexibility and generality of human reasoning remains out of reach. This has prompted considerable discussion about what the “missing ingredient’’ in modern machine learning might be, with a number of hypotheses put forward as the big question that the field must resolve. Is the missing ingredient causal reasoning, inductive bias, better algorithms for self-supervised or unsupervised learning, or something else entirely?
This question is difficult one, and any answer must necessarily involve a great deal of guesswork, but the lessons we can glean from recent progress in artificial intelligence can provide us with several guiding principles.
The first lesson is in the “unreasonable” effectiveness of large, generic models supplied with large amounts of training data. As eloquently articulated by Richard Sutton in his essay on the “bitter lesson,’’ as well as a number of other researchers in machine learning, a persistent theme in the recent history of ML research has been that methods that effectively leverage large amounts of computation and large amounts of data tend to often outperform methods that rely on manually engineered priors and heuristics. While a full discussion of the reasons for this trend are outside the scope of this article, in short they could be summarized (or perhaps caricatured) as follows: when we engineer biases or priors for our models, we are injecting our own imperfect knowledge of how the world works, which biases the model toward some solutions over others. When the model instead gleans this knowledge from data, it will arrive at conclusions that are more accurate than those that we engineered ourselves, and therefore will work better. Indeed, a similar pattern has been observed in how people acquire proficiency. As discussed by Dreyfus, “rule-based’’ reasoning that follows rules we can articulate clearly tends to only provide people with “novice-level’’ performance at various skills, while “expert-level’’ performance is associated with a mess of special cases, exceptions, and patterns that people struggle to articulate clearly, and yet can leverage seamlessly in the moment when the situation demands it. As Dreyfus points out, real human experts are rarely able to articulate the rules they actually follow when exhibiting their expertise, and so it should come as no surprise that in the same way that we must acquire expertise from experience, so must our machines. And to do that, they will need powerful, high-capacity models that impose comparatively few biases and can handle the large amounts of experience that will be needed.
The second, more recent lesson is that manual labeling and supervision do not scale nearly as well as unsupervised or self-supervised learning. We’ve already seen that unsupervised pre-training has become standard in natural language processing, and perhaps will soon become standard across other fields. In a sense, this lesson is a corollary to the first one: if large models and large datasets are the most effective, then anything that limits how large the models and datasets can be will eventually end up as the bottleneck. Human supervision can be one such bottleneck: if all data must be labeled manually by a person, then less data will be available to the system to learn from. However, here we reach a conundrum: current methods for learning without human labels often violate the principles outlined in the first lesson, requiring considerable human insight (which is often domain-specific!) to engineer self-supervised learning objectives that allow large models to acquire meaningful knowledge from unlabeled datasets. These include relatively natural tasks such as language modeling, as well as comparatively more esoteric tasks, such as predicting whether two transformed images were produced by the same original image, or two different ones. The latter is a widely used and successful approach in modern self-supervised learning in computer vision. While such approaches can be effective up to a point, it may well be that the next bottleneck we will face is in deciding how to train large models without requiring manual labeling or manual design of self-supervised objectives, so as to acquire models that distill a deep and meaningful understanding of the world and can perform downstream tasks with robustness generalization, and even a degree of common sense.
I will argue that such methodology could be developed out of current algorithms for learning-based control (reinforcement learning), though it will require a number of substantial algorithmic innovations that will allow such methods to go significantly beyond the kinds of problems they’ve been able to tackle so far. Central to this idea is the notion that, in order to control the environment in diverse and goal-directed ways, autonomous agents will necessarily need to develop an understanding of their environment that is causal and generalizable, and hence will address many of the shortcomings of current supervised models. At the same time, this will require going beyond the current paradigm in reinforcement learning in two important ways. First, reinforcement learning algorithms require a task goal (i.e., a reward function) to be specified by hand by the user, and then learn the behaviors necessary to accomplish that task goal. This of course greatly limits their ability to learn without human supervision. Second, reinforcement learning algorithms in common use today are not inherently data-driven, but rather learn from online experience, and although such methods can be deployed directly in real-world environments, online active data collection limits their generalization in such settings, and many uses of reinforcement learning instead take place in simulation, where there are few opportunities to learn about how the real world works.
Learning Through Action
Insofar as artificial intelligence systems are useful, it is because they provide inferences that can be used to make decisions, which in turn affect something in the world. Therefore, it is reasonable to conclude that a general learning objective should be one that provides impetus to learn those things that are most useful for affecting the world in meaningful ways. Making decisions that create desired outcomes is the purview of reinforcement learning and control. Therefore, we should consider how reinforcement learning can provide the sort of automated and principled objectives for training high-capacity models that can endow them with the ability to understand, reason, and generalize.
However, this will require addressing the two limitations: reinforcement learning requires manually-defined reward functions, and it requires an active learning paradigm that is difficult to reconcile with the need to train on large and diverse datasets. To address the issue with objectives, we can develop algorithms that, instead of aiming to perform a single user-specified task, rather aim to accomplish whatever outcomes they infer are possible in the world. Potential objectives for such methods could include learning to reach any feasible state, learning to maximize mutual information between latent goals and outcomes, or learning through principled intrinsic motivation objectives that lead to broad coverage of possible outcomes. To address the issue with data, we must develop reinforcement learning algorithms that can effectively utilize previously collected datasets. These are offline reinforcement learning algorithms, and they can provide a path toward training RL systems on broad and diverse datasets in much the same manner as in supervised learning, followed by some amount of active online finetuning to attain the best performance.
To provide a hypothetical example of a system that instantiates these ideas, imagine a robot that performs a variety of manipulation tasks (e.g., as in the example figure above). When given a user-specified goal, the robot performs that goal. However, in its “spare time,’’ the robot imagines potential outcomes that it can produce, and then “practices’’ taking actions to produce them. Each such practice session deepens its understanding of the causal structure in the world. By utilizing offline RL, such a system would not only learn from the experience that it gathers actively online, but from all prior logged experiences in all of the varied situations that it has encountered.
Of course, the notion of a real-world commercially deployed robotic system that “plays’’ with its environment in this way might seem far-fetched (it is also of course not a new idea). This is precisely why offline RL is important: since an offline algorithm would be comparatively indifferent to the source of the experience, the fraction of time that the robot spends accomplishing user-specified objectives versus “playing’’ could be adjusted to either extreme, and even a system that spends all of its time performing user-specified tasks can still use all of its collected experience as offline training data for learning to achieve any outcome. Such a system would still “play’’ with its environment, but only virtually, in its “memories.’’
While robotic systems might be the most obvious domain in which to instantiate this design, it is not restricted to robotics, nor to systems that are embodied in the world in an analogous way to people. Any system that has a well-defined notion of actions can be trained in this way: recommender systems, autonomous vehicles, systems for inventory management and logistics, dialogue systems, and so forth. Online exploration may not be feasible in many of these settings, but learning with unsupervised outcome-driven objectives via offline RL is still possible. As mentioned previously, ML systems are useful insofar as they enable making intelligent decisions. It therefore stands to reason that any useful ML system is situated in a sequential process where decision-making is possible, and therefore such a self-supervised learning procedure should be applicable.
Unsupervised and Self-Supervised Reinforcement Learning
An unsupervised or self-supervised reinforcement learning method should fulfill two criteria: it should learn behaviors that control the world in meaningful ways, and it should provide some mechanism to learn to control it in as many ways as possible. This problem should not be confused with the closely related problem of exploration, which has also often been formulated as a problem of attaining broad coverage, but which is not generally concerned with learning to control the world in meaningful ways in the absence of a task objective. That is, exploration methods provide an objective for collecting data, rather than utilizing it.
Perhaps the most direct way to formulate a self-supervised RL objective is to frame it as the problem of reaching a goal state. The problem then corresponds to training a goal-conditioned policy. This problem formulation provides considerable depth, with connections to density estimation, variational inference, and model-based reinforcement learning.
What does a policy trained to reach all possible goals learn about the world? Solving such goal-conditioned RL problems corresponds to learning a kind of dynamics model. Intuitively, being able to bring about any potential desired outcome requires a deep understanding of how actions affect the environment over a long horizon. However, unlike model-based RL, where the model objective is largely disconnected from actually bringing about desired outcomes, the goal-conditioned RL objective is connected to long-horizon outcomes very directly. Therefore, insofar as the end goal of an ML system is to bring about desired outcomes, we would expect that the objective of goal-conditioned RL would be well-aligned.
However, current methods are not without limitations. Even standard goal-conditioned RL methods can be difficult to use and unstable. But even more importantly, goal-reaching does not span the full set of possible tasks that could be specified in RL. Even if an agent learns to accomplish every outcome that is possible in a given environment, there may not exist a single desired outcome that would maximize an arbitrary user-specified reward function. It may still be that such a goal-conditioned policy would have learned powerful and broadly applicable features, and could be readily finetuned to the downstream task, but an interesting problem for future work is to better understand whether more universal self-supervised objectives could lift this limitation, perhaps building on methods for general unsupervised skill learning.
Offline Reinforcement Learning
As discussed previously, offline RL can make it possible to apply self-supervised or unsupervised RL methods even in settings where online collection is infeasible, and such methods can serve as one of the most powerful tools for incorporating large and diverse datasets into self-supervised RL. This is likely to be essential to make this a truly viable and general tool for large-scale representation learning. However, offline RL presents a number of challenges. Foremost among these is that offline RL requires answering counterfactual questions: given data that shows one outcome, can we predict what would have happened if we had taken a different action? This is of course very challenging in general.
Nonetheless, our understanding of offline RL has progressed significantly over the past few years, with significant improvements in performance (see, e.g., IQL).
Advances in offline RL have the potential to significantly increase the applicability of self-supervised RL methods. Using the tools of offline RL, it is possible to construct self-supervised RL methods that do not require any exploration on their own. Much like the “virtual play’’ mentioned before, we can utilize offline RL in combination with goal-conditioned policies to learn entirely from previously collected data. A couple examples are shown in the figures above, illustrating applications of goal-conditioned policies to complex real-world robotic learning problems, where robots learn to navigate diverse environments or perform a wide range of manipulation tasks entirely using data collected previously for other applications. Such methods can even provide powerful self-supervised auxiliary objectives or pretraining for downstream user-specified tasks, in a manner analogous to unsupervised pretraining methods in other fields (e.g., BERT). However, offline RL algorithms inherit many of the difficulties of standard (deep) RL methods, including sensitivity to hyperparameters. These difficulties are further exacerbated by the fact that we cannot perform multiple online trials to determine the best hyperparameters. In supervised learning we can deal with such issues by using a validation set, but a corresponding equivalent in offline RL is lacking. We need algorithms that are more stable and reliable, as well as effective methods for evaluation, in order to make such approaches truly broadly applicable.
Concluding Remarks
I discussed how self-supervised reinforcement learning combined with offline RL could enable scalable representation learning. Insofar that learned models are useful, it is because they allow us to make decisions that bring about the desired outcome in the world. Therefore, self-supervised training with the goal of bringing about any possible outcome should provide such models with the requisite understanding of how the world works. Self-supervised RL objectives, such as those in goal-conditioned RL, have a close relationship with model learning, and fulfilling such objectives is likely to require policies to gain a functional and causal understanding of the environment in which they are situated. However, for such techniques to be useful, it must be possible to apply them at scale to real-world datasets. Offline RL can play this role, because it enables using large, diverse previously collected datasets. Putting these pieces together may lead to a new class of algorithms that can understand the world through action, leading to methods that are truly scalable and automated.
This article is a modified and slightly condensed version of the paper “Understanding the World Through Action,” which will appear in CoRL 2021 (Blue Sky Track), presented in November 2021 in London, UK.