Skip to Main Content

Machines and Society

A growing guide on the latest in data-driven research and emerging technologies at the intersection of society, information and technology.

Mechanisms behind LLMs like GPT Models

This section summarizes the mechanisms that drive the development of large language models, which ultimately pave the way for applications like ChatGPT. It is based on the literature and lecture below.

Potts, C. (Feb 1, 2023). GPT3 & Beyond: Key concepts and open questions in a golden age for natural language understanding. Video, Slides.

 

Zhao, W. X. et al. (March 31, 2023). A Survey of Large Language Models. https://doi.org/10.48550/arXiv.2303.18223. GitHub.

 

Daniel Jurafsky & James H. Martin. (January 12, 2025). Ch 8 RNNs and LSTMs; Ch 9 Transformers; Ch 10 Large Language Models; Ch 11 Masked Language Models. Speech and Language Processing. Chapters and slides.

 

Dell, M. (2025). Deep Learning for Economists. Journal of Economic Literature, 63(1), 5-58. https://doi.org/10.1257/jel.20241733

 

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models. https://doi.org/10.48550/arXiv.2108.07258

Transformer 
The Transformer architecture, such as that used in OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Bidirectional Encoder with Transformer (BERT), changed the landscape of natural language processing. 

The seminal paper Attention Is All You Need announces the Transformer. 

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762

 

Transformers build on the self-attention mechanism. Attention is a mechanism for helping compute the embedding for a token by selectively attending to and integrating information from surrounding tokens. In this sense, we say that a word "attends to" certain neighboring words more than others. 

"Attention" in machine learning is inspired by the biological concept of attention. Humans don’t process all the environmental information at once, but concentrate on relevant data for the current task. This selective focus applies to various senses like vision, hearing, smell, etc. Likewise, "attention" is employed in different areas of machine learning, including natural language processing and computer vision.

Transformers have several advantages over prior architectures, such as recurrent neural networks. First, they allow the models to learn the contextual connections of words by looking at other words in the input sequence to get a better understanding of a certain word and keeping the context of a word far across the input data.  

Besides, transformers allow the model to attend to different parts of an input sequence in parallel. They therefore enable more efficient use of hardware advancements such as GPUs. Such computational efficiency often leads to improved accuracy because of the ability to build larger models with more robust training.

The third advantage of transformers is their general-purpose nature. They are not strongly tied to a particular task or domain and can be fine-tuned for a large number of downstream natural language processing tasks and applications. These include sentiment analysis, text classification, translation, summarization, natural language inference, semantic textual similarity, word sense disambiguation, and question answering.
 
The major components of a transformer are a set of encoders and decoders. Encoders encode input into a representation, which is then decoded by the decoders. The main difference between the encoder and decoder is that the former is bidirectional, whereas the latter is unidirectional. GPT models use only the decoder portion of the architecture and thereby inheriting its unidirectional nature; BERT uses the encoder with its bidirectional characteristics. 

Mechanisms before transformers include Recurrent Neural Networks (RNN) and Long-Short Term Memory (LSTM). RNN processes data sequentially and is not very efficient in handling long sequences. Neither RNN nor LSTM can be trained in parallel.

Aggarwal, C.C. (2022). Attention Mechanisms and Transformers. In Machine Learning for Text (pp. 369–391). Springer, Cham. https://doi.org/10.1007/978-3-030-96623-2_11

Self-supervision 
Self-supervised learning produces prediction problems from unlabeled data without the need for human annotators. Self-supervised tasks are therefore more scalable. Additionally, they are designed to force the model to predict parts of the inputs, making them richer and potentially more useful than models trained on a more limited label space.


Large-scale pre-training
Pre-training establishes the basis of the abilities of LLMs. By pre-training on large-scale corpora, LLMs acquire essential language understanding and generation skills. 

A pre-trained model can be fine-tuned to a particular domain, dataset, or task using a relatively small size of relevant data from the new domain or language. While the pre-training stage focuses on a large quantity of data, the fine-tuning stage prioritizes quality over quantity. 

The ability of LLMs taking the knowledge learnt from one task and applying it to another task is enabled by transfer learning. Pre-training is the dominant approach to transfer learning in deep learning.

Early work of pre-training started with static word representations.

word2vec

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26. https://doi.org/10.48550/arXiv.1310.4546

 

GloVe

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543). http://dx.doi.org/10.3115/v1/D14-1162

 

ELMo was the first model proposed to capture contextual word presentations. In contextual embedding, each word has a different vector that expresses different meanings depending on the surrounding words.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237). http://dx.doi.org/10.18653/v1/N18-1202

 

BERT sets the “pre-training and fine-tuning” learning paradigm

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). http://dx.doi.org/10.18653/v1/N19-1423

 

Pre-training is learning knowledge about language and the world from vast amounts of data, and the model compresses this knowledge into its parameters. The result of this stage is a base model. Pre-training is a very computationally involved process. 

Fine-tuning is the process of taking a fully pre-trained model and further adapting some or all of its parameters to new data. The result of this stage is an assistant model. Fine-tuning is less computationally expensive.

Andrej Karpathy. (Nov 23, 2023). [1hr Talk] Intro to Large Language Models (YouTube video).

 

Finally, the task-agnostic generative pre-training model. GPT models differentiate from BERT in that GPTs are autoregressive decoders, which are designed to predict the next word given the previous words from left to right. 

GPT-3

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901. https://doi.org/10.48550/arXiv.2005.14165, GitHub.


GPT

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-training. [Link]

 

In contrast, BERT and its descendants (e.g., RoBERTa) are bidirectional encoders, which compute contextualized representations of the input tokens from both left and right directions. This distinction makes BERT more appropriate for interpretative tasks like sentiment analysis or natural language understanding, where understanding the full context of a sentence is essential. They are not used for generation. 


Reinforcement Learning from Human Feedback

To align LLMs with human values, Reinforcement Learning from Human Feedback (RLHF) has been proposed to fine-tune LLMs with the collected human feedback data to better capture human preferences such as BLEU or ROUGE.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, 27730-27744. https://doi.org/10.48550/arXiv.2203.02155

 

 

 

Also see the Methods section on OpenAI's blog.

Emergent Abilities of LLMs

The initial GPT models exhibited three abilities. 

  • Language generation to follow a prompt and then generate a completion of the given prompt. This ability comes from the language modeling training objective.
  • World knowledge including factual knowledge and commonsense, which comes from the large training corpora.
  • In-context learning to follow a few examples of a given task and then generate the solution for a new test case. The source of this ability is still elusive.

The GPT-3.5 models enhanced the code generation and understanding abilities. But it traded the in-context learning ability for alignment with humans through RLHF.

 

Despite the advances and influence of LLMs, the underlying principles of LLMs remain insufficiently understood. One fundamental question is why LLMs exhibit certain emergent abilities, such as in-context learning.

The emergent ability of LLMs are formally defined as “An ability is emergent if it is not present in smaller models but is present in larger models”. 

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., ... & Fedus, W. (2022). Emergent Abilities of Large Language Models. https://arxiv.org/abs/2206.07682


Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed.It is both the source of scientific excitement and anxiety about unanticipated consequences.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models. https://doi.org/10.48550/arXiv.2108.07258


Three typical emergent abilities for LLMs are in-context learning, instruction following, and step-by-step reasoning. 

In-context learning refers to the ability that the language model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task). It is an emergent property that was neither specifically trained for nor anticipated to arise. 

Zhao, W. X. et al. (2023). A Survey of Large Language Models. https://doi.org/10.48550/arXiv.2303.18223

 

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models. https://doi.org/10.48550/arXiv.2108.07258

Capabilities, Limitations, and Future Directions

A growing reading list.
 

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

 

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the Opportunities and Risks of Foundation Models. https://doi.org/10.48550/arXiv.2108.07258

 

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., ... & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712. [YouTube]

 

Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., & McHardy, R. (2023). Challenges and Applications of Large Language Models. https://doi.org/10.48550/arXiv.2307.10169

 

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., … Wu, Z. (2023). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://doi.org/10.48550/arXiv.2206.04615

 

Tamkin, A., Brundage, M., Clark, J., & Ganguli, D. (2021). Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. https://doi.org/10.48550/ARXIV.2102.02503