BERT (language model)
Bidirectional Encoder Representations from Transformers (BERT) is a
BERT was originally implemented in the English language at two model sizes:[1] (1) BERTBASE: 12 encoders with 12 bidirectional self-attention heads totaling 110 million parameters, and (2) BERTLARGE: 24 encoders with 16 bidirectional self-attention heads totaling 340 million parameters. Both models were pre-trained on the Toronto BookCorpus[4] (800M words) and English Wikipedia (2,500M words).
Design
BERT is an "encoder-only"
On a high level, BERT consists of three modules:
- embedding. This module converts an array of one-hot encoded tokens into an array of vectors representing the tokens.
- a stack of encoders. These encoders are the Transformer encoders. They perform transformations over the array of representation vectors.
- un-embedding. This module converts the final representation vectors into one-hot encoded tokens again.
The un-embedding module is necessary for pretraining, but it is often unnecessary for downstream tasks. Instead, one would take the representation vectors output at the end of the stack of encoders, and use those as a vector representation of the text input, and train a smaller model on top of that.
BERT uses WordPiece to convert each English word into an integer code. Its vocabulary has size 30,000. Any token not appearing in its vocabulary is replaced by [UNK] for "unknown".
Pretraining
BERT was pre-trained simultaneously on two tasks:[5]
language modeling: 15% of tokens were selected for prediction, and the training objective was to predict the selected token given its context. The selected token is
- replaced with a [MASK] token with probability 80%,
- replaced with a random word token with probability 10%,
- not replaced with probability 10%.
For example, the sentence "my dog is cute" may have the 4-th token selected for prediction. The model would have input text
- "my dog is [MASK]" with probability 80%,
- "my dog is happy" with probability 10%,
- "my dog is cute" with probability 10%.
After processing the input text, the model's 4-th output vector is passed to a separate neural network, which outputs a probability distribution over its 30,000-large vocabulary.
next sentence prediction: Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either [IsNext] or [NotNext]. The first span starts with a special token [CLS] (for "classify"). The two spans are separated by a special token [SEP] (for "separate"). After processing the two spans, the 1-st output vector (the vector coding for [CLS]) is passed to a separate neural network for the binary classification into [IsNext] and [NotNext].
- For example, given "[CLS] my dog is cute [SEP] he likes playing" the model should output token [IsNext].
- Given "[CLS] my dog is cute [SEP] how do magnets work" the model should output token [NotNext].
As a result of this training process, BERT learns
Architecture details
This section describes BERTBASE. The other one, BERTLARGE, is similar, just larger.
The lowest layer is the embedding layer, which contains three components: word_embeddings, position_embeddings, token_type_embeddings.
- word_embeddings takes in a one-hot vector of the input token. The one-hot vector input has dimension 30,000, because BERT has a vocabulary size that large.
- position_embeddings performs absolute position embedding. It is like word_embeddings, but on a vocabulary consisting of just the time-stamps 0 to 511, since BERT has a context window of 512.
- token_type_embeddings is like word_embeddings, but on a vocabulary consisting of just 0 and 1. The only type-1 tokens are those that appear after the [SEP]. All other tokens are type-0.
The three outputs are added, then pushed through a LayerNorm (layer normalization), obtaining an array of representation vectors, each having 768 dimensions.
After this, the representation vectors move through 12 Transformer encoders, then they are un-embedded by an affine-Add & LayerNorm-linear.
Performance
When BERT was published, it achieved state-of-the-art performance on a number of natural language understanding tasks:[1]
- GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
- SQuAD (Stanford Question Answering Dataset[7]) v1.1 and v2.0
- SWAG (Situations With Adversarial Generations[8])
Analysis
The reasons for BERT's
The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the word fine can have two different meanings depending on the context (I feel fine today, She has fine blond hair). BERT considers the words surrounding the target word fine from the left and right side.However it comes at a cost: due to encoder-only architecture lacking a decoder, BERT can't be prompted and can't generate text, while bidirectional models in general do not work effectively without the right side,[clarification needed] thus being difficult to prompt, with even short text generation requiring sophisticated computationally expensive techniques.[15]
In contrast to deep learning neural networks which require very large amounts of data, BERT has already been pre-trained which means that it has learnt the representations of the words and sentences as well as the underlying semantic relations that they are connected with. BERT can then be
History
BERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, including
On October 25, 2019,
A later paper proposes RoBERTa, which preserves BERT's architecture, but improves its training, changing key hyperparameters, removing the next-sentence prediction task, and using much larger mini-batch sizes.[23]
Recognition
The research paper describing BERT won the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).[24]
References
- ^ arXiv:1810.04805v2 [cs.CL].
- ^ "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. November 2, 2018. Retrieved November 27, 2019.
- S2CID 211532403.
- arXiv:1506.06724 [cs.CV].
- ^ "Summary of the models — transformers 3.4.0 documentation". huggingface.co. Retrieved February 16, 2023.
- ^ Horev, Rani (2018). "BERT Explained: State of the art language model for NLP". Towards Data Science. Retrieved September 27, 2021.
- arXiv:1606.05250 [cs.CL].
- arXiv:1808.05326 [cs.CL].
- ^ S2CID 201645145.
- ^ .
- S2CID 21700944.
- S2CID 4460159.
- S2CID 52090220.
- .
- arXiv:2209.14500 [cs.LG].
- ^ "BERT". GitHub. Retrieved March 28, 2023.
- arXiv:1511.01432 [cs.LG].
- arXiv:1802.05365v2 [cs.CL].
- arXiv:1801.06146v5 [cs.CL].
- ^ Nayak, Pandu (October 25, 2019). "Understanding searches better than ever before". Google Blog. Retrieved December 10, 2019.
- ^ Montti, Roger (December 10, 2019). "Google's BERT Rolls Out Worldwide". Search Engine Journal. Retrieved December 10, 2019.
- ^ "Google: BERT now used on almost every English query". Search Engine Land. October 15, 2020. Retrieved November 24, 2020.
- arXiv:1907.11692 [cs.CL].
- ^ "Best Paper Awards". NAACL. 2019. Retrieved March 28, 2020.
Further reading
- Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What we know about how BERT works". arXiv:2002.12327 [cs.CL].