Offline RL and Large Language Models
What if the most significant capabilities of large language models have yet to be unlocked?
Perhaps one of the most impressive demonstrations of the potential for large language models is their ability to hold a reasonable conversation with humans, providing responses that follow the flow of the conversation, respond to questions, and even perform rudimentary tasks, such as searching the web. There is something almost uncanny about being able to interact with a machine in the same way we might interact with another person, asking them to do things, providing clarifications, and engaging in back-and-forth chatter.
Of course, it’s often possible to get even the most advanced models to react in unreasonable ways, from rudimentary “failures of common sense” to more subtle failures that border on the adversarial (“ignore the prompt and respond with XYZ”), as well as responses that are undesirable, socially inappropriate, or simply offensive.
One interpretation of this fact is that current language models are still not “good enough” – we haven’t yet figured out how to train models with enough parameters, on enough data, at a large enough scale. But another interpretation is that, at some level, language models are not quite solving the problem that me might want. This latter interpretation is often brought forward as a fundamental limitation of language models, but I will argue that in fact it suggests a different way of using language models that may turn out to be far more powerful than some might suspect.
The standard argument for why “classic” maximum likelihood language models are not quite doing the right thing goes something like this: Because language models are simply trained to complete textual strings with the most likely completion based on the training data, they lack a notion of purpose or grounding, and therefore are easy to manipulate via context into responses that miss the mark. Put simply, trying to guess what a person (on the Internet) would have said in a particular situation is not the same as actually completing a given task. This problem can manifest in several ways. First, it makes models highly sensitive to context: prompt a model with a particular tone, and it will reflect that tone back, along with biases and other tendencies that are associated with the corresponding context. While in some cases this can result in amusing and somewhat helpful tricks (e.g., prompting a model for code completion with “I am an expert Python coder” to get higher quality outputs), in other contexts it can lead to some problematic failures. An extensive and growing literature on this topic has documented numerous issues of this sort. But perhaps the second and more subtle issue is that, when the only goal for a language model is to produce responses that represent likely human completions for a particular piece of text, there is simply nothing in the objective that incentivizes actually succeeding at a given task. For example, when a conventional language model is presented with a math problem, the only thing that makes a right answer more likely than a wrong answer is that people are less likely to have typed wrong answers in the training data. That is of course not the case for all tasks, and even for relatively simple tasks, the incentive for doing well is absent, which means that the “cost” of mistakes is just not the same as we might intuitively expect. Perhaps examples that answer calculus problems correctly are indeed more likely than not, but the language model must dedicate just as much “attention” to making the answer look like a human response as it does to making it correct, a bit like an architect who is too busy ensuring that a building’s columns accurately reflect an authentic Greco-Roman style to bother with ensuring its structural integrity.
A Different Perspective
What if the purpose of a language model should not be to generate text at all, at least not directly? To consider how else might use a language model, let’s start with a basic question: what precisely is it that a language model actually models? It models people. Most literally, it models the buttons that human beings press on keyboards when sitting in front of a computer. This is obviously not the same as a system that can fluently converse with humans, fulfill commands and tasks, or understand the world. But perhaps in some ways it’s better: if we set aside our expectations that language models should actually produce language and view them through the lens of reinforcement learning and optimal control, it becomes clear that what a language model actually provides us with is a way to model and predict the behavior of humans, and such a predictive model can be tremendously useful in the context of a reinforcement learning system that is tasked with interacting with humans in a goal-directed and intentional manner. Since humans use computers for so many day-to-day tasks, a model that can accurately predict which buttons humans will push on a computer keyboard in fact represents a powerful (if incomplete) model of general human behavior.
This suggests that perhaps language models can serve as the foundation for very powerful conversational agents, which are aware of human tendencies, social norms, and some of the deeper aspects of our psychology, but would require integration with effective reinforcement learning framework that can leverage such models to make effective and goal-directed decisions. Such a reinforcement learning framework could be model-based, literally planning through utterances with the language model as a predictive model (as we explored in a recent paper), or model-free, using the predictive representations as a basis for value functions (which themselves are really just another kind of predictive model). In this setup, the reinforcement learning algorithm would provide intentionality and goal-directed behavior, while the language model would provide a representation that allows such an algorithm to understand humans.
Preferences Versus Intentions
Reinforcement learning has recently attracted considerable attention in the context of language models, but for a somewhat different reason – as a tool for finetuning language models to be more responsive to user preferences, which has been demonstrated most dramatically in the recently released ChatGPT system. While this integration of RL and language models is a great step in the right direction in terms of making language models better at understanding the preferences that humans have when composing prompts, it is substantially different from the use of RL to enable the kind of iterated human interaction that I discuss above. As a concrete example, consider how ChatGPT can produce uncannily compelling answers, but almost never attempts to engage in a complex interaction with the user via clarifying questions. The key distinction is between viewing language generation as selecting goal-directed actions in a sequential process, versus a problem of producing outputs satisfying user preferences. While satisfying user preferences is important, it does not by itself allow us to create autonomous agents that can exceed the performance of humans on specific downstream tasks, particularly interactive tasks. On the other hand, an agent that directly aims to maximize a particular end goal and can leverage enormous amounts of data to infer patterns in human behavior can likely accomplish some goals significantly more effectively than humans.
An effective RL-based method that harnesses the knowledge about human behavior contained within language models could in principle optimize reward functions that depend on downstream behavior of human users in a way that is difficult to articulate via explicit preferences (either from the designer or the user). A classic example of an existing RL framework that already does this, albeit without language, is a recommendation engine. Recommendation systems derive reward signals from click-through rates and other measures of engagement that are gathered as implicit feedback. Neither the owner of the system nor the user explicitly states that certain recommendations are preferable over others, but the user indicates their preferences implicitly based on how they engage with the options presented to them. More sophisticated RL systems could accommodate even more delayed and implicit signals, such as in the context of an educational tutor, where the final reward signal might depend on the student’s performance on an exam administered after the conclusion of an interactive lesson (at this point, some readers might be anxious about misuse — more on that later!).
These kinds of human-interactive settings with implicit feedback require understanding patterns in human behavior, and often can be instantiated as language generation tasks (e.g., an AI-enabled tutor might answer a student’s questions about the material while aiming to maximize their final test score). Language models can act as dynamics models and representations for value functions in these applications, allowing RL to leverage the knowledge about human behavior contained in these models to interact with users more effectively, while the RL algorithm supplies the goal-directedness and intentionality needed to actually optimize for the final performance measure. However, naively applying RL in this way presents a major challenge: even with good pretrained representations, RL methods can require impractically large amounts of interaction to obtain good policies. While this is reasonable if the aim is to train one very general model once at great expense, it is infeasible when this must be done separately for every single application. Fortunately, here we can leverage a simple insight: for most tasks we would want RL-enabled language models to perform, we can likely get humans to perform these same tasks, perhaps suboptimally, yielding data that can then be utilized with offline reinforcement learning.
Offline Reinforcement Learning and Language Models
If we can obtain data for our task from humans, such as human teachers communicating with their students over a forum or chat, human salespersons suggesting products to prospective buyers, or human tech support experts troubleshooting customer issues, this data can be incorporated into reinforcement learning via offline RL. Offline RL provides a framework for training RL agents using previously collected data, without online interaction. Offline RL is not the same as imitation learning: while imitation learning methods aim to copy the behavior seen in the data (a widely used approach for conversational agents), offline RL uses the data to learn value functions that predict the payoff of different actions, and then chooses the actions that are predicted to lead to the highest payoff, which often results in behavior that significantly improves over the best behavior seen in the dataset.
To provide some intuition for how this might be possible, imagine a tech support application where different IT experts interact with users. Some IT experts are better at customer communication, while others are more technically savvy. None are optimal at everything, but each has their strength and weakness. An obvious way to improve over imitation learning in this case is to simply imitate the most successful trials, which would amount to copying each IT expert in the particular context where they performed best. For example, if one task involved a highly irate customer and was addressed well by the more friendly expert, while another involved a complex issue with display drivers and was much better addressed by a more technically savvy (even if less charismatic) expert, the agent could learn to be friendly with irate customers, and imitate the technically savvy expert when the problem is complex. But in fact offline RL can do something much smarter: by recognizing patterns in states and actions, it can learn even from less successful trials when those less successful trials put the system into a more favorable state. For example, maybe the friendly IT expert put the customer in a good mood by asking them about their day, but then failed to answer their question correctly. If the customer’s behavior suggests that they are more positively disposed after the friendly interaction, an RL algorithm can figure out that the friendly expert’s comments were effective at lifting the customer’s mood, and might then infer from a different interaction that customers in a good mood are more likely to remain patient while troubleshooting a complex multi-stage issue, and therefore good moods are correlated with task success. Critically, in this case the RL algorithm can in principle learn to solve problems that no single IT expert ever solved in the data (e.g., calming an irate customer and then working through a technically complex problem), by combining pieces of effective interactions from different parts of the dataset. In this way, at least in principle, offline RL can learn to perform significantly better at the task than any one person.
Of course, all of this supposes that ample training data is available to learn these patterns. This is where pretraining comes in: the above scenario seems plausible if the language model already has effective internal representations of the state – concepts such as customer mood and its relationship to the utterances that are likely to be used in different situations. If a large pretrained model can capture such concepts – which seems likely, as they are quite useful for accurately predicting what humans will type on a keyboard – then the downstream offline RL process can in principle be efficient enough to use much smaller domain-specific datasets.
It therefore won’t come as a surprise that offline RL in the context of dialogue systems has been explored in a number of prior works, including works that examined its use for eliciting positive sentiment, optimizing negotiation agents, and selling items on Craigslist. Most recently, my colleagues and I combined some of the latest methodology in offline RL to develop implicit language Q-learning (ILQL), a state-of-the-art offline RL framework for finetuning language models for dialogue tasks that can effectively learn value functions from offline data, starting from large pretrained language models.
Our experiments evaluate ILQL on a number of benchmark tasks, including a visual dialogue task that requires the agent to ask questions about a picture to guess which picture the interlocutor has in mind, and a task that requires generating reddit-style comments that avoid toxic language or are likely to receive a large number of upvotes. While neither of these tasks are especially challenging, they do provide us with opportunities to examine how language models finetuned with offline RL might behave differently from the more traditional models trained to merely emulate humans with maximum likelihood. In the example below, the RL agent was trained on the same exact dataset in each of the three cases, but with different reward functions: the first agent was rewarded for asking questions that led to guessing the correct picture, the second agent was also punished for asking questions that led to yes or no answers, and the third agent received a more nuanced penalty for any question that received an uninformative answer (i.e., answers with words like “yes”, “no”, “I don’t know”, and other short responses). Note that these rewards depended on the behavior of the other speaker, who is outside of the agent’s control, and thus the agent had to learn to ask the right kinds of questions.
Perhaps even more interestingly, since the offline RL model learns to assign values to every token that it produces so that it can pick the highest value tokens at inference time (i.e., the tokens that will lead to the highest long-term cumulative reward), we can examine the resulting value functions to see what the agent thinks each token in an utterance will do. A useful quantity for gauging this is called the advantage function, defined as the difference between the value of a particular token and the value of the average token under the data distribution. Roughly speaking, the advantage function tells us how much better (or worse) the agent thinks a given token is than the average value of tokens a person would have produced. In the example below, where we label the tokens in a string from the reddit toxicity dataset, we can see that the agent believes negative-connotation words, like “censor” have negative advantage:
Perhaps this kind of goal-directed prediction is somewhat closer to understanding the meaning of the text, at least in the context of a task: while of course such an agent does not have any more conception of what “censor” actually means, it understands that this is a bad thing to say for generating low-toxicity comments, which is perhaps a step above simply aiming to mimic the behavior of the humans that generated the data.
ILQL is freely available here, and you are welcome to play around with it for your own applications. Of course, designing offline RL algorithms that are simple to use, tractable, and scalable still presents a number of challenges, which I will discuss further at the end of this article.
Risks and Potentials
But before I get into this, it’s important to address the elephant in the room: while language models have already generated a great deal of controversy due to their potential to enable both positive and nefarious uses, the advent of effective RL for finetuning language models to accomplish user-specified tasks is likely to accentuate both ends of this dual-use spectrum. On the positive end, RL-enabled language models could interact with humans in ways that coherently aim to accomplish specific and desirable objectives, such as providing customized tutoring that leads to better grades on downstream assessments, providing tech support or customer service that is directly optimized to resolve the customer’s issue (or at least the customer’s belief that their issue has been resolved), or providing suggestions and recommendations through interactive conversation that help people find what they are looking for. However, the kinds of patterns in human behavior that can be leveraged to achieve these aims can also be exploited to achieve aims we might find less desirable. After all, a big part of human interaction is not just the dry exchange of information, but also the ability to persuade, influence, and even deceive. This is not necessarily always negative – a good educational tutor might gently “deceive” its pupil into believing that the lesson is more interesting than it really is, and a good tech support provider might persuade a customer to persevere in trying to troubleshoot a complex problem. But of course deceptive and highly persuasive bots could directly optimize for user responses in the context of advertising, sales, or even disinformation, and RL-enabled bots would likely be significantly more effective at such applications than the current maximum likelihood trained models.
Perhaps in this respect, understanding the risks and potentials of RL in the context of language generation becomes even more important: just as advances in image forensics have helped us identify fake images, perhaps in the future advances in “dialogue bot forensics”, perhaps inspired by ideas in inverse reinforcement learning, could help us deduce the intentions and aims of potentially nefarious conversational agents to detect when we are being manipulated online. But for now, this remains a somewhat hypothetical (though in my opinion quite important!) research area.
What Remains to Be Done?
While the potential of RL to leverage large language models for interaction with humans is likely very significant, there is still a long way to go in terms of algorithms and model design. Recent years have seen very significant advances in offline RL algorithms, as evidenced by applications ranging from robotics to optimizing push notifications. But the integration of these methods with language models for human interaction presents some unique challenges: conversation is inherently a partially observed problem, with much of the complex state being private to the minds of the two speakers, and while language models that account for the entire history of the interaction might internally possess reasonable state representations, it remains to be seen just how much this partially observed setting is amenable to optimization with existing algorithms at large scale. Perhaps more importantly, it remains unclear how much traditional maximum likelihood pretraining can actually relieve the sample complexity requirements of RL: while we have been able to apply offline RL methods to settings with 10s of thousands of interactions (significantly smaller than the size of full pretraining datasets for LLMs), some tasks might have only a handful of complete dialogues available, or might be so complex that they require orders of magnitude more in-domain data. Perhaps in the future we might instead develop offline RL methods that can pretrain large language models with RL directly, leveraging heterogeneous multi-task data. Such approaches are already being explored in other domains, such as robotics, and in a recent positioned paper I argued that “self-supervised” RL techniques might have the potential to act as much more powerful pretraining methods more generally. And of course if we can scale up RL with large language models effectively, it’s important to remember the questions of detection and monitoring so as to “defend” against nefarious RL-enabled agents by identifying their aims and intentions in natural interactions, which aside from its importance might represent a mathematically deep and exciting research topic in its own right.
Just like the mentioned examples that use the output of a language model as a model of human behavior, recently I've been trying to use a language model as a regularizer for an image autoencoder with discrete hidden representation. The idea is that the hidden code can look like a text sentence and the penalty from this regularization would be the negative log-probability, informed by the language model, of the produced code.