dumplingsSamee Arif
Samee Arif
Hugging Face

To BERT, with RoBERTa: Part I

August 27, 2024

Slide from the talk

Figure 1: Illustration generated using GPT-4o

Introduction

Both BERT[1] and RoBERTa[2] are encoder-only transformer models that transform an input sequence into a set of continuous representations (vectors) to capture the information and context of the input. These models can be applied to tasks listed below, by adding a classification layer on top.

  1. 1. Text Classification:

    Example: Sentiment Analysis - Determining whether a review is positive, negative, or neutral.

  2. 2. Token Classification:

    Example: Named Entity Recognition (NER) - Identifying and classifying each token in a sentence, such as names of people, organizations, or locations.

  3. 3. Masked Word Prediction:

    Example: Text Completion - Predicting missing words in a sentence like “Pair the [MASK] with spicy ramen and chilli oil.”

Model Architecture

Slide from the talk

Figure 2: The architecture of encoder-only model

The first image in Figure 2 presents the encoder only architecture with one encoder block. The second image presents the BERT and RoBERTa architecture with multiple encoder blocks stacked on top of each other.

1. Input Embeddings

Input embeddings are composed of three embeddings in BERT or two embeddings in RoBERTa:

Input Embeddings = Token + Positional + Segment Embeddings

In RoBERTa, Segment Embeddings are omitted, as RoBERTa does not use the Next Sentence Prediction task and treats the entire input as a single sequence.

a. Token Embeddings

Each word in the input sequence is first converted into tokens. BERT and RoBERTa use a WordPiece[3] tokenizer, which splits words into subwords or characters. Figure 3 shows the conversion of word “Dumplings” into tokens.

Dumplings
Dump##lings

Figure 3: Conversion of word to tokens. ## indicates that the token is a continuation of the previous token

In addition to word and subword tokens, there are special tokens like:

[CLS]Indicates the start of the sequence.
[SEP]Used to separate different segments in a sequence.
[MASK]Represents masked tokens that the model attempts to predict.
[PAD]Used for padding multiple sequences to the same length.

This vocabulary for the model is created by tokenizing the training dataset, where tokens are ranked by their frequency in the corpus—more common tokens are assigned lower integer identifier ID numbers, while less common tokens receive higher ID numbers. Figure 4 shows the vocabulary.json file for the model.

{
  ...
  "##lings": 11227,
  ...
  "Dump": 15653,
  ...
}

Figure 4: vocabulary.json file. It contains the mapping of tokens to IDs

After tokenizing the input text, each token is given a ID that was assgined to it while creating the vocabulary. Figure 5 shows the conversion of tokens to the corresponding IDs.

Dump##lings
1565311227

Figure 4: Conversion of tokens to IDs

These token IDs are then used to look up corresponding vectors in a large matrix called the embedding matrix. The embedding matrix is a learned parameter of the model and is initialized randomly before training. If the model’s dimensionality is h (the size of the embeddings vector), and the vocabulary size is n, the embeddings matrix will have dimensions (n x h). Each row in this matrix corresponds to the embedding vector of a specific token ID. For each token ID, the corresponding row from the embedding matrix is retrieved. Matrix 1 is an example of the embeddings matrix.

Token IDDimension1Dimension2Dimensionh
11227 (Dump)0.025-0.0340.112
15653 (##lings)-0.0450.0810.067
Token IDn

Matrix 1: Embeddings matrix. These values are random and not the actual values

b. Positional Embeddings

The positional embeddings are learnable parameters, just like the token embeddings. During model initialization, the positional embedding vectors are initialized with random values. For example, if the model is designed to handle sequences of up to 512 tokens, there will be 512 positional embedding vectors, one for each possible position in the input sequence. Since Transformers do not have a built-in notion of sequence order, positional embeddings are added to the token embeddings. These embeddings encode the position of each token in the sequence, allowing the model to understand the order of words. In the case of Dumplings, 1st postition vector will be added the the input embedding of “Dump” and 2nd position vector will be added to the input embedding of “##lings”. These embeddings ensure the model understands the sequential nature of the data, which is critical for capturing context.

c. Segment Embeddings

BERT is designed to handle pairs of sentences (like in tasks such as Question Answering or Next Sentence Prediction). To distinguish between the two sentences, BERT adds segment embeddings (a learned parameter), where each token is assigned an embedding indicating whether it belongs to the first or the second sentence. RoBERTa does not use segment embeddings because it is not trained with the Next Sentence Prediction task.

3. Encoder Block

a. Multi-Head Attention

Multi-head attention, enables the model to attend to different parts of the sequence simultaneously. This mechanism helps the model capture complex relationships between words, regardless of their distance in the sequence.

After adding the Token Embeddings, Positional Embeddings and Segment Embeddings (in case of BERT), we’ll get Input Embeddings Xembed of shape (s, h) where s is the input sequence length (number of input tokens) and h is the model’s dimensionality (embedding vector size). Therefore in case of input “Dumplings are tasty” our matrix Xembed would be as shown in Matrix 2.

Dumpe11e12e1h
##lingse21e22e2h
aree31e32e3h
tae41e42e4h
##stye51e52e5h

Matrix 2: Embedding Matrix Xembed.

BERT and RoBERTa have n attention heads and for each ith head the input embedding Xembed is linearly transformed into three different matrices: Query Qi, Key Ki, and Value Vi. This is done by multiplying Xembed by three different weight matrices: WQi, WKi, and WVi. The shape of these weight matrices is (h, hhead), where hhead = h/n. The resulting Qi, Ki, and Vi matrices have the shape (s, hhead).

Qi = Xembed × WQiKi = Xembed × WKiVi = Xembed × WVi

After this, the attention scores are calculated by multiplying the Query matrix Qi with the transpose of Key matrix KiT. The result of dot products is an attention score (similarity) matrix Si with a shape of (s, s) that indicates the similarity between each token. This shows how much each token (represented by a query) “attends” to every other token (represented by a key). The attention scores are divided by the square root of hhead (to prevent the scores from becoming too large as the dimension increases) and then the scaled attention scores are passed through a softmax function to obtain the attention weights. The attention weights matrix Si is shown in Matrix 3 and s21 gives the importance of “Dump” while processing “##lings”.

Dump##lingsareta##sty
Dumps11s12s13s14s15
##lingss21s22s23s24s25
ares31s32s33s34s35
tas41s42s43s44s45
##stys51s52s53s54s55
Matrix 3: Attention Weight Matrix Si with the shape (s, s): Softmax(Q × KT√h)

After we have the attention weight matrix Si, we calcuate the output of the attention head using the equation Ai = Si x Vi for the ith head. We get Ai of shape (s, hhead) and it is a context-aware representations for each token in the sequence, allowing the model to focus on the most important parts of the input when generating its final output.

In multi-head attention, this process is repeated for several heads (each with different weight matrices). This allows the model to focus on different parts of the sequence. The output matrix Ai from each head is concatenated to get the final output Aatten = [A1; A2; …, Ai] of multi-head attention. After concatenation, the shape of Aatten will be (s, nheadxhhead) = (s, h).

d. Add & Norm

The output of the multi-head attention Aatten is added to the original input Xembed. This residual connection helps to mitigate the vanishing gradient problem by allowing gradients to flow directly through the network, making it easier to train deep networks. The residual connection also helps in preserving the original information from the input. After adding the residual connection, the result is passed through a layer normalization step. Layer normalization stabilizes the training by normalizing the output of Add step to have a mean of 0 and a standard deviation of 1 across the hidden dimension. The output of this layer is denoted by Xnorm and it’s shape is (s, h).

Xnorm1 = LayerNorm(Xembed + Aatten)

c. Feed Forward Layer

Both BERT and RoBERTa have two feed forward layers. The first feed forward layer projects Xnorm1 from size (s, h) to (s, hff) where hff is greater than h. The second feed forward layer projects the output of the first layer back to (s, h). The output of the feed forward layers is denoted as Off and is passed into the second Add & Norm part of the encoder block. The output of the second Add & Norm is denoted by Xnorm2 and it’s shape is (s, h).

Xnorm2 = LayerNorm(Xnorm1 + Off)

The output of the encoder block Xnorm2 is passed to the next encoder block. And the Xnorm from the final encoder block can be passed to a feed forward classifier layer for prediction.

BERT vs. RoBERTa

BERT is trained on BookCorpus[4] and English Wikipedia[5], while RoBERTa is trained on a larger dataset, including BookCorpus, English Wikipedia, CC-News[6], OpenWebText[7], and Stories. BERT is trained using the Masked Language Model and Next Sentence Prediction tasks, but RoBERTa removed the Next Sentence Prediction task because not using it did not degrade the model’s performance. In BERT training, static masking is used (the same tokens are masked in every epoch), while RoBERTa uses dynamic masking (different tokens are masked across epochs). BERT is trained for fewer steps with smaller batch sizes, whereas RoBERTa is trained for longer with larger batch sizes and higher learning rates.

References

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[2] RoBERTa: A Robustly Optimized BERT Pretraining Approach

[3] Fast WordPiece Tokenization

[4] Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

[5] Wikipedia

[6] Common Crawl New Dataset

[7] OpenWebText Dataset