Self-Improving Robots and the Importance of Data

How can robots help us figure out machine learning?

Aug 22, 2022

There is a popular idea that, with 10,000 hours of experience, a person could achieve mastery at whatever task they set themselves to. A bit of back-of-the-envelope calculation tells us that so far, our machine learning models are quite a bit more data-hungry. For example, GPT-3 is trained on over 10 terabytes of data, which corresponds to some billions of pages of text that, at average human reading speed, would take hundreds of millions of hours to read (about a hundred lifetimes). State-of-the-art image models come out a bit better, but still far behind the curve: assuming one unique image per second, a generous assumption given that a real visual stream is strongly correlated in time, 10,000 hours would give us a dataset with 10s of millions of images, a couple orders of magnitude less than the billion-image datasets used for production vision systems. We know that for current machine learning models, the quality of the data matters, but quantity can be a quality of its own.

Perhaps we shouldn’t be too hard on these models: 10,000 hours is about two years of a person’s waking life, and we don’t approach each new skill or problem with a blank slate, but rather with the benefit of a lifetime’s worth of basic capabilities to draw on. We don’t need to rediscover visual perception in order to learn to drive a vehicle or write a computer program, and while we don’t have a hundred lifetimes to spend reading reams of Internet text, the experience we do have is densely concentrated, uniquely suited for our own embodied experience, and arguably significantly richer than individual pictures or out-of-context text snippets harvested from the web. What does all of this tell us about how we could build more intelligent and more capable machines?

One lesson that could be drawn from this discrepancy is that, if we need orders of magnitude more data to attain a fraction of human visual and verbal capability, we must be missing something truly fundamental in our machine learning frameworks. That may well be the case, and I certainly am of the opinion that there is tremendous room for algorithmic innovation and that we should not become exclusively focused on the most effective current algorithms. But if we take a more pragmatic and data-centric view, the observations that machine learning methods improve in performance with more data and that humans and animals can acquire remarkable capabilities with significantly less embodied experience might suggest that, if we had training data with the quality of embodied experience (i.e., synchronized sensorimotor streams capturing physical interaction with the real world in all its complexity) and the quantity used by current data-driven models (i.e., hundreds of lifetimes), we may be able to train models with vastly improved capability to reason about all of the complexities of the real world. This is where robotics and embodied learning comes into the picture.

Robots as a Data Factory

Robots learning in the real world from autonomously collected experience (source). Autonomous data collection can produce large datasets of embodied experience, and robots that are deployed in the real world (vs. a laboratory) could collect such data at scale while performing useful tasks.

Today, robotic learning researchers tend to think of robotics as a consumer of machine learning methods. It is generally considered that data in robotics is expensive, and the path to effective generalization that somehow approaches, even remotely, the sort of generalization seen in the world of supervised learning for perception must involve leveraging the same type of plentiful Internet-scale data that enabled advances in computer vision and NLP. A particularly active area of study for example is the use of videos from the web to facilitate more effective robotic control. Researchers generally don’t consider the opposite – how data from robots can help build more capable machine learning systems in general.

However, the promise of autonomous robots, and indeed most types of automation, is that it can be scaled up entirely with an investment of capital with minimal need for skilled labor: while it might be tremendously challenging to build one robot that can reliably restock grocery store shelves or cook meals, building a million such robots after that is largely a matter of scaling up production. Thus, if autonomous robots can improve from autonomously collected experience, their large-scale deployment may well provide more data than mining the web. We already see some of this in autonomous driving, where there is limited utility for example for Tesla to try to harvest videos from the web when their own fleet gathers orders of magnitude more highly relevant data every day. But the richness of object interaction, the semantics of indoor spaces, and other human-relevant factors of embodied intelligence will likely become far more accessible when robots are performing useful tasks in open-world settings, and may not only facilitate smarter robotic systems, but a more useful source of training data for machine learning systems in many other domains.

While this might seem far-fetched given the ubiquitous use of Internet data for training current computer vision and NLP systems, embodied robotics experience should in principle provide a source of data that is both more plentiful and provides a more complete view of the physical phenomena that constitute the real world. If we imagine a service robot with comparable market penetration to a Roomba (on the order of 10 million units), such a fleet would collect on the order of 10 billion hours of experience in just a single year, exceeding the size of all NLP and computer vision datasets in use today. At the same time, we might speculate that such embodied data, which couples action, multimodal perception, and the cause-and-effect structure present in continual sensory streams, should contain more useful knowledge than disembodied (and out-of-context) images and text. If the rich structure of embodied sensory experience indeed provides part of the answer for why humans can learn so much more efficiently than machines, as I discussed previously, then such data may indeed be more useful.

If we can utilize embodied experience to address machine learning in challenges in other domains – e.g., determining how to interact with humans on the web, visual perception in other (non-robot) domains, and even injecting a degree of grounded common-sense reasoning into otherwise “symbolic” tasks like natural language processing, – then insofar as these applications are limited by data availability (a notion that is well supported by current research on large language models), we can likely significantly advance the capabilities of our AI systems. Of course, it might be difficult to imagine how some domains could benefit from this embodied data. Could an algorithmic stock trading system really use data from a robot cleaning the floor or doing the laundry? Perhaps not directly, and certainly not immediately, but it is important to remember that everything that people can do themselves ultimately was learned from embodied experience, and even seemingly abstract tasks like stock trading ultimately ground out in physical events that are governed by the same physical laws that mediate robotic interaction, so even these more removed applications may eventually benefit from the improved common sense provided by training on embodied interaction data. And in the near-term, more grounded applications in computer vision, autonomous vehicles, robotics, and video understanding, which are the ones that often pose the greatest challenge in terms of common sense and visual and spatial reasoning (cf., Moravec’s paradox), would benefit more directly.

Algorithms for Learning from Embodied Experience

An example of a goal-conditioned offline RL system (Actionable Models) that learns representations from embodied experience via self-supervised goal-conditioned training.

Current methodology in machine learning is, arguably, not set up to take advantage of such data in an effective way. Current supervised learning methods require informative labels, which autonomously collected robot data will not have. Standard reinforcement learning methods focus on interactive online learning which, though quite effective at acquiring individual skills, likely is not enough to provide the kinds of powerful general-purpose representations that can leverage autonomously collected robot data for downstream tasks. Self-supervised or unsupervised learning methods, though applicable in principle to any type of data, likely will also miss out on a lot of useful signal: autonomously collected robot data illustrates the causal relationships present in real-world phenomena, and its utility goes far beyond understanding visual invariances or other low-level feature properties that are learned by current self-supervised techniques (e.g., in computer vision). However, each of these areas offers us a partial solution, and together these parts can likely be combined into a powerful methodology for leveraging autonomously collected robot data to solve a plethora of downstream learning problems.

To put together the pieces, we should first ask what kinds of problems we actually want machine learning systems to solve. While there are numerous tasks we might want solved that involve processing and analyzing complex inputs such as images, text, or audio, ultimately all of these tasks can be viewed as some sort of decision making problem. For example, a visual recognition system looks at a picture and decides whether to label it with one class or another. This decision has consequences. A user might add the picture to a photo album, or click on an ad. Thus, we could say that what we really need from a strong pretrained model that can serve as the basis for many downstream tasks is the ability to make decisions that achieve desired outcomes. On the surface, this closely resembles the problem of reinforcement learning, with two major differences. First, to use large amounts of diverse unlabeled data to pretrain representations for many downstream applications, such a methodology would need to be task-agnostic and self-supervised, in contrast to the conventional RL framework that is guided with hand-specified and task-specific rewards. Second, we would like to learn from data that has already been collected, rather than engaging in the classic interactive online learning process that RL typically entails. These two differences bring us closer to self-supervised learning.

I believe we can approach this problem by training a model to achieve every possible outcome that can be informed by the provided data. Put another way, if the downstream tasks we want to solve are essentially decision making tasks, then we can learn good representations for these downstream tasks by learning to make decisions that can bring about any physically possible outcome. This should seem quite natural to us. Our minds automatically inform us about the affordances in the world around us: we see a door as something that can be opened, a cup as something we can drink from, and a vehicle as something we can use for transportation. These are not representations of the physical attributes of these objects so much as of the outcomes that we can achieve with them. Indeed, some of our research work in robotics and reinforcement learning has already shown that idealized representations that capture capabilities and affordances can make downstream control exceedingly simple, with either a small amount of additional training or entirely in zero shot (see, e.g., value function spaces and visual affordance learning).

It might not at first seem obvious, however, that learning to achieve any possible outcome necessarily provides a good basis for solving downstream tasks. For example, does a robot learning various ways to fold the laundry really help us learn how to propose video recommendations that satisfy human users? I believe the key here is to use a sufficiently expansive definition of state, together with a sufficiently extended time horizon. The laundry folding robot exists in a world populated by objects, but also by people. Certainly the reactions of people to its behavior will be part of its sensory experience, and likely a significant and important part, because the disposition of its user will likely have a profound influence on its operation (as a side note, selection pressures for domesticated animals typically change their behavior and physiology in profound ways to accommodate the preferences of their human owners – from purring cats to dogs that love to play fetch, pleasing human “users” is clearly a key part of their adaptation). Thus, if the robot’s internal state includes the (potentially unobserved) disposition of humans – which any reasonable latent state learning method should figure out at an appropriate time scale – then learning to achieve any outcome will include learning to achieve any value of this disposition, a skill likely to come in handy for any other system that interacts with human beings.

Instantiating such a framework requires two basic parts. First, we need a way to propose potential tasks or outcomes that could be achieved using the data, and second, we need a way to learn to achieve these outcomes, thus acquiring the desired representation. The first step can likely be quite simple: borrowing some principles from self-supervised learning, if the process of training each task is entirely computational (i.e., the same data is used for all tasks), the only “downside” to having excessive or unnecessary tasks is the need for more compute. Thus, we could simply propose every single state seen in the data as a potential goal that the algorithm could learn to reach, or even sample from a distribution over every possible reward function that can be optimized using the observed data. This topic has been studied extensively as goal-conditioned reinforcement learning and unsupervised reinforcement learning. The extensive literature on goal-conditioned RL includes methods that learn to achieve every possible goal entirely from previously collected data, and the literature on unsupervised RL (for example via mutual information maximization) extends these ideas further to learn more general task sets, including for automated task proposals. The question of how to learn all possible tasks remains an open one, though some works have sought to address this by also integrating concepts from successor representations. Thus, while the problem of automatically proposing an “overcomplete” set of tasks for self-supervised learning is not fully solved, there is ample literature to provide a foundation on which to build.

The second part requires us to be able to run reinforcement learning with these self-supervised and automatically generated tasks using the large embodied interaction datasets collected by the robots. This is essentially the offline reinforcement learning problem: the problem of learning an effective policy from a static dataset. While large amounts of robotic interaction could conceivably also be used in concert with online RL, the offline RL solution makes it possible to simultaneously learn a huge number of tasks using the same dataset. We can think of this process as a kind of massively parallel “counterfactual training” procedure: the algorithm is repeatedly asking “if my goal were to accomplish some other outcome, besides what I saw in the data, would I have succeeded?” It stands to reason that a corresponding “counterfactual” representation resulting from such a procedure would then be better suited for answering other counterfactual questions for downstream tasks, such as “if I were to take this candidate action, would that allow me to eventually make the human user happy?” Fortunately, offline RL methods have progressed dramatically over the past two years, with current methods being effectively deployed for large-scale training in robotics, dialogue systems, and other domains.

In my earlier article on understanding the world through action, I discuss the technical aspects of this framework in more detail.

A Broader View of the Algorithmic Questions

While the discussion above provides one algorithmic framework that could effectively utilize large amounts of embodied experience in a self-supervised manner, it is by no means the only one. More generally, we can devise a broad family of potential methods that all somehow provide for two ingredients: (i) the ability to solve some sort of prediction problem, which effectively makes use of the temporal structure in the data and learns cause and effect relationships; (ii) the ability to reason counterfactually, which requires understanding which kinds of conclusions can be supported by the data (and any predictive model trained on it), and which cannot. This provides us with a very large space of methods to consider. For example, we can consider how a classic predictive modeling approach might fit into this framework. Learning to predict the future has often been put forward as a simple and promising approach for self-supervised learning from sequential data, and indeed the widespread success of modern large language models can be attributed to solving essentially this forward prediction problem on text data. However, we know from the literature on offline RL that simply learning to predict the future naively is not enough to be able to effectively make decisions: we also need to know which decisions are supported by the data and which aren’t. For example, if we train an autonomous vehicle to drive using data from good human drivers, this data simply cannot be used to predict that driving off the road is a bad idea, because this was never seen in the data itself. Rather, we need to disallow any behaviors that deviate from the data too much, as their outcomes are unpredictable. Interestingly, some predictive models do already provide for this functionality, including large language models, by learning the full probability density function over sequences. This principle has been used in several model-based RL methods, such as the trajectory transformer, which represents probabilities over full sequences of states and actions, and plans over sequences that have a high enough probability (and hence are unlikely to be incorrect predictions).

The model-based trajectory transformer algorithm represents full probability density functions over trajectories, which not only provides a predictive model, but also facilitates decision making by staying close to the data. Here, the trajectory transformer predicts a humanoid running motion several hundred time steps in length, with complex physical interactions and high system dimensionality.

From this we might conclude that a variety of algorithmic frameworks that span the gamut from more classical predictive models to model-free offline RL could be used to acquire viable representations from massive amounts of embodied experience. However, it seems probable that all of these frameworks would in some way reflect the basic principles of offline RL, in that they would learn how decisions lead to outcomes, and provide a representation that can not only infer the likely outcomes of candidate decisions, but also accommodate reliable counterfactual queries by correctly predicting which decisions and outcomes are or are not supported by the data.

Takeaways and Conclusions

In this article, I argued that the path to increasingly more capable learning-enabled AI systems should involve the use of large-scale embodied experience, and that such experience can be gathered in a scalable way by robots performing potentially useful tasks in the real world. Of course, this entails a particularly challenging “bootstrapping” problem: while 10 million robots might collect many lifetimes of data every hour, getting to a point where 10 million sufficiently capable robots are economically viable to deploy itself presents a significant engineering and economic challenge. In some domains, such as autonomous driving, this challenge is nearly addressed already by the millions of instrumented cars that are already on the roads (nearly 2 million in the case of Tesla alone). But in domains such as service robotics, it is still in the future. But perhaps more relevant for individual scientists and engineers is the accompanying algorithmic challenge: while it is very likely that massive amounts of embodied experience can enable remarkable learned capabilities, including for non-robotics tasks, the algorithmic frameworks for most effectively leveraging such experience have yet to be developed. While I have sought to provide some potential speculative ideas for how this could be done in this article, they are just that: speculation, at times with preliminary evidence from very small-scale experiments. This makes self-supervised methods for learning from embodied experience a particularly exciting area of study today: in preparing the methodology that can leverage such data effectively, we can lay the groundwork for the most powerful AI systems of the future. And even if we fail at producing self-supervised methods that can bootstrap every possible learning problem, perhaps we will still get effective general-purpose robotic learning methods as a consolation prize.

I would like to thank Abhishek Gupta and Chelsea Finn for helpful comments and suggestions on an earlier version of this article.

Learning and Control

Discussion about this post