GPT-1

Generative Pre-trained Transformer 1 (GPT-1)
Original author(s)	OpenAI
Initial release	June 2018; 5 years ago
Repository	github.com/openai/finetune-transformer-lm ;
Successor	GPT-2
Type	Large language model; Generative pre-trained transformer;
MIT
Website	openai.com/blog/language-unsupervised/

Generative Pre-trained Transformer 1 (GPT-1) was the first of

transformer architecture in 2017.^[2] In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training",^[3] in which they introduced that initial model along with the general concept of a generative pre-trained transformer.^[4]

Up to that point, the best-performing neural NLP models primarily employed supervised learning from large amounts of manually labeled data. This reliance on supervised learning limited their use of datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models;^[3]^[5] many languages (such as Swahili or Haitian Creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building.^[5] In contrast, a GPT's "semi-supervised" approach involved two stages: an unsupervised generative "pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task.^[3]

The use of a

transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT models with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".^[3]

Reason for choosing BookCorpus

BookCorpus was chosen as a training dataset partly because the long passages of continuous text helped the model learn to handle long-range information.^[6] It contained over 7,000 unpublished fiction books from various genres. The rest of the datasets available at the time, while being larger, lacked this long-range structure (being "shuffled" at a sentence level).^[3]

The BookCorpus text was cleaned by the

tokenized by spaCy.^[3]

Architecture

The GPT-1 architecture was a twelve-layer decoder-only

transformer, using twelve masked self-attention heads, with 64-dimensional states each (for a total of 768). Rather than simple stochastic gradient descent, the Adam optimization algorithm was used; the learning rate was increased linearly from zero over the first 2,000 updates to a maximum of 2.5×10⁻⁴, and annealed to 0 using a cosine schedule.^[3] GPT-1 has 117 million parameters.^[4]

While the fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture.[3] Despite this, GPT-1 still improved on previous benchmarks in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on several diverse tasks.^[3]

Performance and evaluation

GPT-1 achieved a 5.8% and 1.5% improvement over previous best results^[3] on natural language inference (also known as textual entailment) tasks, evaluating the ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral".^[3] Examples of such datasets include QNLI (Wikipedia articles) and MultiNLI (transcribed speech, popular fiction, and government reports, among other sources);^[7] It similarly outperformed previous models on two tasks related to question answering and commonsense reasoning—by 5.7% on RACE,^[8] a dataset of written question-answer pairs from middle and high school exams, and by 8.9% on the Story Cloze Test.^[9]

GPT-1 improved on previous best-performing models by 4.2% on semantic similarity (or paraphrase detection), evaluating the ability to predict whether two sentences are paraphrases of one another, using the Quora Question Pairs (QQP) dataset.^[3]

GPT-1 achieved a score of 45.4, versus a previous best of 35.0^[3] in a text classification task using the Corpus of Linguistic Acceptability (CoLA). Finally, GPT-1 achieved an overall score of 72.8 (compared to a previous record of 68.9) on GLUE, a multi-task test.^[10]

References

^ "gpt-2". GitHub. Archived from the original on 11 March 2023. Retrieved 13 March 2023.
^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived (PDF) from the original on 26 January 2021. Retrieved 23 January 2021.
^ ^a ^b "GPT-1 to GPT-4: Each of OpenAI's GPT Models Explained and Compared". 11 April 2023. Archived from the original on 2023-04-15. Retrieved 2023-04-29.
^ ^a ^b Tsvetkov, Yulia (22 June 2017). "Opportunities and Challenges in Working with Low-Resource Languages" (PDF). Carnegie Mellon University. Archived (PDF) from the original on 31 March 2020. Retrieved 23 January 2021.
arXiv:1506.06724 [cs.CV
]. # of books: 11,038 / # of sentences: 74,004,228 / # of words: 984,846,357 / mean # of words per sentence: 13 / median # of words per sentence: 11

^ Williams, Adina; Nangia, Nikita; Bowman, Samuel (1 June 2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference" (PDF). Association for Computational Linguistics. Archived (PDF) from the original on 11 February 2020. Retrieved 23 January 2021. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), [...] offering data from ten distinct genres of written and spoken English [...] while supplying an explicit setting for evaluating cross-genre domain adaptation.

arXiv:1704.04683 [cs.CL
].

^ Mostafazadeh, Nasrin; Roth, Michael; Louis, Annie; Chambers, Nathanael; Allen, James F. (3 April 2017). "LSDSem 2017 Shared Task: The Story Cloze Test" (PDF). Association for Computational Linguistics. Archived (PDF) from the original on 22 November 2020. Retrieved 23 January 2021. The LSDSem'17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge.

arXiv:1804.07461 [cs.CL
].

v
t
e
OpenAI
Products

ChatGPT
in education

DALL-E

GitHub Copilot

OpenAI Five

Sora

Whisper

Foundation models

OpenAI Codex

Generative pre-trained transformer
GPT-1

GPT-2

GPT-3

GPT-4

People
CEOs

Sam Altman
removal

Mira Murati

Emmett Shear

Board of directors
Current

Sam Altman

Adam D'Angelo

Sue Desmond-Hellmann

Nicole Seligman

Fidji Simo

Larry Summers

Bret Taylor

Former

Greg Brockman (2017–2023)

Reid Hoffman (2019–2023)

Will Hurd (2021–2023)

Holden Karnofsky (2017–2021)

Elon Musk (2015–2018)

Ilya Sutskever (2017–2023)

Helen Toner (2021–2023)

Shivon Zilis (2019–2023)

Related

AI Dungeon

Auto-GPT

"Deep Learning"

LangChain

Microsoft Copilot

Category

v
t
e
Differentiable computing
General

Differentiable programming

Information geometry

Statistical manifold

Automatic differentiation

Neuromorphic engineering

Pattern recognition

Tensor calculus

Computational learning theory

Inductive bias

Concepts

Gradient descent
SGD

Clustering

Regression
Overfitting

Hallucination

Adversary

Attention

Convolution

Loss functions

Backpropagation

Batchnorm

Activation
Softmax

Sigmoid

Rectifier

Regularization

Datasets

Augmentation

Diffusion

Autoregression

Applications

Machine learning
In-context learning

Artificial neural network

Deep learning

Scientific computing

Artificial Intelligence

Language model
Large language model

Hardware

IPU

TPU

VPU

Memristor

SpiNNaker

Software libraries

TensorFlow

PyTorch

Keras

Theano

JAX

Flux.jl

MindSpore

Implementations
Audio–visual

AlexNet

WaveNet

Human image synthesis

HWR

OCR

Speech synthesis

Speech recognition

Facial recognition

AlphaFold

Text-to-image models
DALL-E

Midjourney

Stable Diffusion

Text-to-video models
Sora

VideoPoet

Whisper

Verbal

Word2vec

Seq2seq

BERT

Gemini

LaMDA
Bard

NMT

Project Debater

IBM Watson

IBM Watsonx

Granite

GPT-1

GPT-2

GPT-3

GPT-4

ChatGPT

GPT-J

Chinchilla AI

PaLM

BLOOM

LLaMA

PanGu-Σ

Decisional

AlphaGo

AlphaZero

Q-learning

SARSA

OpenAI Five

Self-driving car

MuZero

Action selection
Auto-GPT

Robot control

People

Yoshua Bengio

Alex Graves

Ian Goodfellow

Stephen Grossberg

Demis Hassabis

Geoffrey Hinton

Yann LeCun

Fei-Fei Li

Andrew Ng

Jürgen Schmidhuber

David Silver

Ilya Sutskever

Organizations

Anthropic

EleutherAI

Google DeepMind

Hugging Face

OpenAI

Meta AI

Mila

MIT CSAIL

Huawei

Architectures

Neural Turing machine

Differentiable neural computer

Transformer

Recurrent neural network (RNN)

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Echo state network

Multilayer perceptron (MLP)

Convolutional neural network

Residual neural network

Mamba

Autoencoder

Variational autoencoder (VAE)

Generative adversarial network (GAN)

Graph neural network

Portals
Computer programming

Technology

Categories
Artificial neural networks

Machine learning

Retrieved from "https://en.wikipedia.org/w/index.php?title=GPT-1&oldid=1216525878"

[1] "gpt-2". GitHub. Archived from the original on 11 March 2023. Retrieved 13 March 2023.

[:0-2] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[gpt1paper-3] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived (PDF) from the original on 26 January 2021. Retrieved 23 January 2021.

[makeuseof-4] "GPT-1 to GPT-4: Each of OpenAI's GPT Models Explained and Compared". 11 April 2023. Archived from the original on 2023-04-15. Retrieved 2023-04-29.

[tsvetkov-5] Tsvetkov, Yulia (22 June 2017). "Opportunities and Challenges in Working with Low-Resource Languages" (PDF). Carnegie Mellon University. Archived (PDF) from the original on 31 March 2020. Retrieved 23 January 2021.

[bookscorpus-6] rXiv:1506.06724 [cs.CV
]. # of books: 11,038 / # of sentences: 74,004,228 / # of words: 984,846,357 / mean # of words per sentence: 13 / median # of words per sentence: 11

[multinli-7] Williams, Adina; Nangia, Nikita; Bowman, Samuel (1 June 2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference" (PDF). Association for Computational Linguistics. Archived (PDF) from the original on 11 February 2020. Retrieved 23 January 2021. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), [...] offering data from ten distinct genres of written and spoken English [...] while supplying an explicit setting for evaluating cross-genre domain adaptation.

[race-8] rXiv:1704.04683 [cs.CL
].

[cloze-9] Mostafazadeh, Nasrin; Roth, Michael; Louis, Annie; Chambers, Nathanael; Allen, James F. (3 April 2017). "LSDSem 2017 Shared Task: The Story Cloze Test" (PDF). Association for Computational Linguistics. Archived (PDF) from the original on 22 November 2020. Retrieved 23 January 2021. The LSDSem'17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge.

[glue-10] rXiv:1804.07461 [cs.CL
].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]