My Take on the Popularity of Decoder-Only Language Models Versus Encoder-Decoder Models.
Many popular models like GPTs use decode-only structures even though encode-decoder structures sound more complex and capable of doing more, but is it better, and why?
What is the difference between decode and encoder structures? The key difference is whether the input attention is (causally) masked. The decoder structure used masked attention.
Imagine we have an input sentence, “I like machine learning.” Each word can only see its previous words for a causally masked attention layer. For “machine,” it can only attend attention to “I like machine” for the decoder, but the entire sentence for the encoder.
This key difference makes training the encoder-related and decoder-only models very different because any model with an encoder can see bi-directionally and, therefore, “cheating” when predicting the next word when supposedly only given the current word and its preceding words.
For the decoder-only models, the training task is called the causal language modeling objective, simply predicting the next word/token given the previously seen partial sentence. And this is the only objective it needs! Since everything can be rephrased into a single sentence, such as translation or other tasks. For example, “The translation from English to French for apple is pomme” is a translation task rephrased into a causal language modeling task. The nice thing is, most of the time, we don’t need to do much…