Skip to Main Content

Machines and Society

A growing guide on the latest in data-driven research and emerging technologies at the intersection of society, information and technology.

Mechanisms behind GPT Models

This section summarizes the mechanisms that drive the development of large language models, which ultimately pave the way for applications like ChatGPT. It is based on the literature review and the lecture below.

Potts, C. (Feb 1, 2023). GPT3 & Beyond: Key concepts and open questions in a golden age for natural language understanding. Video, Slides.


Zhao, W. X. et al. (March 31, 2023). A Survey of Large Language Models. GitHub.


Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models.

The Transformer architecture, such as that used in OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Bidirectional Encoder with Transformer (BERT), changed the landscape of natural language processing. 

The seminal paper Attention Is All You Need announces the Transformer. 

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30.


Transformers build on the self-attention mechanism. "Attention" in machine learning is inspired by the biological concept of attention. Humans don’t process all the environmental information at once, but concentrate on relevant data for the current task. This selective focus applies to various senses like vision, hearing, smell, etc. Likewise, "attention" is employed in different areas of machine learning, including natural language processing and computer vision.

Transformers have several advantages over prior architectures, such as recurrent neural networks. First, they allow the models to learn the contextual connections of words by looking at other words in the input sequence to get a better understanding of a certain word and keeping the context of a word far across the input data.  

Besides, transformers allow the model to attend to different parts of an input sequence in parallel. They therefore enable more efficient use of hardware advancements such as GPUs. Such computational efficiency often leads to improved accuracy because of the ability to build larger models with more robust training.

The third advantage of transformers is their general-purpose nature. They are not strongly tied to a particular task or domain and can be fine-tuned for a large number of downstream natural language processing tasks and applications. These include sentiment analysis, text classification, translation, summarization, natural language inference, semantic textual similarity, word sense disambiguation, and question answering.
The major components of a transformer are a set of encoders and decoders. Encoders encode input into a representation, which is then decoded by the decoders. The main difference between the encoder and decoder is that the former is bidirectional, whereas the latter is unidirectional. GPT models use only the decoder portion of the architecture and thereby inheriting its unidirectional nature; BERT uses the encoder with its bidirectional characteristics. 

Mechanisms before transformers include Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM). RNN processes data sequentially and is not very efficient in handling long sequences. Neither RNN nor LSTM can be trained in parallel.


Aggarwal, C.C. (2022). Attention Mechanisms and Transformers. In Machine Learning for Text (pp. 369–391). Springer, Cham.

Self-supervised learning produces prediction problems from unlabeled data without the need for human annotators. Self-supervised tasks are therefore more scalable. Additionally, they are designed to force the model to predict parts of the inputs, making them richer and potentially more useful than models trained on a more limited label space.

Large-scale pre-training
Pre-training establishes the basis of the abilities of LLMs. By pre-training on large-scale corpora, LLMs acquire essential language understanding and generation skills. Pre-trained language models are considered NLP’s ImageNet moment.

Early work of pre-training started with static word representations.


Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26.



Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543).


ELMo was the first model proposed to capture contextual word presentations.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237).


BERT sets the “pre-training and fine-tuning” learning paradigm. Pre-training refers to the initial training phase where the model is trained on a large corpus with one task before being fine-tuned on a downstream task with less training data. Fine-tuning adapts a pre-trained model by training on data specific to the desired task. 

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).


Finally, the task-agnostic generative pre-training model. GPT differentiates from BERT in that GPT is an autoregressive model (predicting the next word given the previous words) that only considers the left context for making predictions. In contrast, BERT is bidirectional and considers both left and right context. This distinction makes BERT more appropriate for tasks like sentiment analysis or natural language understanding, where understanding the full context of a sentence is essential.


Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901., GitHub.


Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. [Link]

Reinforcement Learning from Human Feedback

To align LLMs with human values, Reinforcement Learning from Human Feedback (RLHF) has been proposed to fine-tune LLMs with the collected human feedback data to better capture human preferences such as BLEU or ROUGE.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.




Also see the Methods section on OpenAI's blog.

Emergent Abilities of LLMs

The initial GPT models exhibited three abilities. 

  • Language generation to follow a prompt and then generate a completion of the given prompt. This ability comes from the language modeling training objective.
  • World knowledge including factual knowledge and commonsense, which comes from the large training corpora.
  • In-context learning to follow a few examples of a given task and then generate the solution for a new test case. The source of this ability is still elusive.

The GPT-3.5 models enhanced the code generation and understanding abilities. But it traded the in-context learning ability for alignment with humans through RLHF.


Despite the advances and influence of LLMs, the underlying principles of LLMs remain insufficiently understood. One fundamental question is why LLMs exhibit certain emergent abilities, such as in-context learning.

The emergent ability of LLMs are formally defined as “An ability is emergent if it is not present in smaller models but is present in larger models”. 

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., ... & Fedus, W. (2022). Emergent Abilities of Large Language Models.

Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed.It is both the source of scientific excitement and anxiety about unanticipated consequences.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models.

Three typical emergent abilities for LLMs are in-context learning, instruction following, and step-by-step reasoning. 

In-context learning refers to the ability that the language model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task). It is an emergent property that was neither specifically trained for nor anticipated to arise. 

Zhao, W. X. et al. (2023). A Survey of Large Language Models.


Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models.