跳转到内容

BERT

此内容尚不支持你的语言。

[!tip] BERT is the product of training technique applied to the Why Transformer? architecture.

Masked LM

  • 80% of the tokens are actually replaced with the token [MASK].
  • 10% of the time tokens are replaced with a random token.
  • 10% of the time tokens are left unchanged.

Next Sentence Prediction

To predict if the second sentence is indeed connected to the first, the following steps are performed:

  1. The entire input sequence goes through the transformers model.
  2. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
  3. Calculating the probability of IsNextSequence with softmax.

[!quote] When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

[!todo] BERT combined loss function detail

Different Tasks

[!success] BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model.

sentiment analysis

Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.

question answering

In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.

named entity recognition

In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, created, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.

Reference