This section summarizes the mechanisms that drive the development of large language models, which ultimately pave the way for applications like ChatGPT. It is based on the literature review and the lecture below.
Transformer
The Transformer architecture, such as that used in OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Bidirectional Encoder with Transformer (BERT), changed the landscape of natural language processing.
The seminal paper Attention Is All You Need announces the Transformer.
Transformers build on the self-attention mechanism. "Attention" in machine learning is inspired by the biological concept of attention. Humans don’t process all the environmental information at once, but concentrate on relevant data for the current task. This selective focus applies to various senses like vision, hearing, smell, etc. Likewise, "attention" is employed in different areas of machine learning, including natural language processing and computer vision.
Transformers have several advantages over prior architectures, such as recurrent neural networks. First, they allow the models to learn the contextual connections of words by looking at other words in the input sequence to get a better understanding of a certain word and keeping the context of a word far across the input data.
Besides, transformers allow the model to attend to different parts of an input sequence in parallel. They therefore enable more efficient use of hardware advancements such as GPUs. Such computational efficiency often leads to improved accuracy because of the ability to build larger models with more robust training.
The third advantage of transformers is their general-purpose nature. They are not strongly tied to a particular task or domain and can be fine-tuned for a large number of downstream natural language processing tasks and applications. These include sentiment analysis, text classification, translation, summarization, natural language inference, semantic textual similarity, word sense disambiguation, and question answering.
The major components of a transformer are a set of encoders and decoders. Encoders encode input into a representation, which is then decoded by the decoders. The main difference between the encoder and decoder is that the former is bidirectional, whereas the latter is unidirectional. GPT models use only the decoder portion of the architecture and thereby inheriting its unidirectional nature; BERT uses the encoder with its bidirectional characteristics.
Mechanisms before transformers include Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM). RNN processes data sequentially and is not very efficient in handling long sequences. Neither RNN nor LSTM can be trained in parallel.
Reference
Self-supervision
Self-supervised learning produces prediction problems from unlabeled data without the need for human annotators. Self-supervised tasks are therefore more scalable. Additionally, they are designed to force the model to predict parts of the inputs, making them richer and potentially more useful than models trained on a more limited label space.
Large-scale pre-training
Pre-training establishes the basis of the abilities of LLMs. By pre-training on large-scale corpora, LLMs acquire essential language understanding and generation skills. Pre-trained language models are considered NLP’s ImageNet moment.
Early work of pre-training started with static word representations.
word2vec
GloVe
ELMo was the first model proposed to capture contextual word presentations.
BERT sets the “pre-training and fine-tuning” learning paradigm. Pre-training refers to the initial training phase where the model is trained on a large corpus with one task before being fine-tuned on a downstream task with less training data. Fine-tuning adapts a pre-trained model by training on data specific to the desired task.
Finally, the task-agnostic generative pre-training model. GPT differentiates from BERT in that GPT is an autoregressive model (predicting the next word given the previous words) that only considers the left context for making predictions. In contrast, BERT is bidirectional and considers both left and right context. This distinction makes BERT more appropriate for tasks like sentiment analysis or natural language understanding, where understanding the full context of a sentence is essential.
GPT-3
GPT
Reinforcement Learning from Human Feedback
To align LLMs with human values, Reinforcement Learning from Human Feedback (RLHF) has been proposed to fine-tune LLMs with the collected human feedback data to better capture human preferences such as BLEU or ROUGE.
Also see the Methods section on OpenAI's blog.
The initial GPT models exhibited three abilities.
The GPT-3.5 models enhanced the code generation and understanding abilities. But it traded the in-context learning ability for alignment with humans through RLHF.
Despite the advances and influence of LLMs, the underlying principles of LLMs remain insufficiently understood. One fundamental question is why LLMs exhibit certain emergent abilities, such as in-context learning.
The emergent ability of LLMs are formally defined as “An ability is emergent if it is not present in smaller models but is present in larger models”.
Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed.It is both the source of scientific excitement and anxiety about unanticipated consequences.
Three typical emergent abilities for LLMs are in-context learning, instruction following, and step-by-step reasoning.
In-context learning refers to the ability that the language model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task). It is an emergent property that was neither specifically trained for nor anticipated to arise.