Diffusion-LM is a generative language model that uses a plug-and-play control scheme, where the language model is fixed, and an external classifier steers its generation
Researchers at Stanford University have open-sourced Diffusion-LM, a non-autoregressive generative language model that allows for fine-grained control of the model’s output text. When evaluated on controlled text generation tasks, Diffusion-LM outperforms existing methods.
The model and experiments were described in a paper published on arXiv. Diffusion-LM is a generative language model that uses a plug-and-play control scheme, where the language model is fixed, and its generation is steered by an external classifier that determines how well the generated text matches the desired parameters. Users can specify several features of the desired output, including required parts of speech, syntax tree, or sentence length. During generation, Diffusion-LM iteratively denoises a set of latent vectors, with the external controller providing gradient updates to steer the latent vectors to generate the desired output. When evaluated on a set of control tasks, Diffusion-LM “significantly” outperformed baseline methods.
According to the research team, “We find the complex controls enabled by Diffusion-LM to be compelling, and we are excited by how Diffusion-LM is a substantial departure from the current paradigm of discrete autoregressive generation.”
Many generative language models (LM), such as GPT-3, are autoregressive; that is, they recursively generate text by predicting the next word in a sequence, then add that word to the existing sequence and use the updated sequence as input for further prediction. These models can generate text that is indistinguishable from that written by humans, and the models can generate text to solve a wide range of problems from question-answering to interactive chat. However, it is difficult to provide any user control over the generated output; for example, the desired sentence length, structure, or sentiment.
Instead of trying to steer an autoregressive LM, the Stanford researchers chose to use a new technique for language generation: a diffusion model. These models have shown good results in computer vision and other continuous domains; however, they have not been applied to text generation, which is a discrete domain. According to the team, Diffusion-LM is the first diffusion model for text generation.
To make Diffusion-LM work, the team modified the standard diffusion model in two ways. First, they defined an embedding function that maps words into vectors in the continuous latent space of the diffusion model. Second, they defined a “rounding” method to map these vectors back to discrete words. To generate text, the model begins with a random vector in the latent space; this is treated as a noisy version of the output sentence embedding. The model then iteratively denoises it; at each step, the embedding is passed to an external classifier, producing a gradient update of the embedding for the next iteration step. When the iterations are done, the rounding method maps the final embedding to a text output.
The Stanford team evaluated Diffusion-LM on five classifier-guided text generation control tasks and compared its performance to baseline methods using a GPT-2 autoregressive LM, using both plug-and-play and fine-tuning. On all five tasks, Diffusion-LM outperformed the other plug-and-play methods; it also outperformed fine-tuning on two tasks with “similar” performance on the other three. The team also evaluated Diffusion-LM on an unguided text-infilling task against three different baseline models; it outperformed two of them and achieved “comparable” performance to an autoregressive model specifically trained for infilling.