BERT
此内容尚不支持你的语言。
[!tip] BERT is the product of training technique applied to the Why Transformer? architecture.
Masked LM
- 80% of the tokens are actually replaced with the token
[MASK]
. - 10% of the time tokens are replaced with a random token.
- 10% of the time tokens are left unchanged.
Next Sentence Prediction
To predict if the second sentence is indeed connected to the first, the following steps are performed:
- The entire input sequence goes through the transformers model.
- The output of the
[CLS]
token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases). - Calculating the probability of IsNextSequence with
softmax
.
[!quote] When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.
[!todo] BERT combined loss function detail
Different Tasks
[!success] BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model.
sentiment analysis
Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
question answering
In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
named entity recognition
In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, created, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.