To BERT, with RoBERTa: Part I

August 27, 2024

Slide from the talk

Figure 1: Illustration generated using GPT-4o

Introduction

Both BERT^[1] and RoBERTa^[2] are encoder-only transformer models that transform an input sequence into a set of continuous representations (vectors) to capture the information and context of the input. These models can be applied to tasks listed below, by adding a classification layer on top.

1. Text Classification:
Example: Sentiment Analysis - Determining whether a review is positive, negative, or neutral.
2. Token Classification:
Example: Named Entity Recognition (NER) - Identifying and classifying each token in a sentence, such as names of people, organizations, or locations.
3. Masked Word Prediction:
Example: Text Completion - Predicting missing words in a sentence like “Pair the [MASK] with spicy ramen and chilli oil.”

Model Architecture

Slide from the talk

Figure 2: The architecture of encoder-only model

The first image in Figure 2 presents the encoder only architecture with one encoder block. The second image presents the BERT and RoBERTa architecture with multiple encoder blocks stacked on top of each other.

1. Input Embeddings

Input embeddings are composed of three embeddings in BERT or two embeddings in RoBERTa:

Input Embeddings = Token + Positional + Segment Embeddings

In RoBERTa, Segment Embeddings are omitted, as RoBERTa does not use the Next Sentence Prediction task and treats the entire input as a single sequence.

a. Token Embeddings

Each word in the input sequence is first converted into tokens. BERT and RoBERTa use a WordPiece^[3] tokenizer, which splits words into subwords or characters. Figure 3 shows the conversion of word “Dumplings” into tokens.

Dumplings➡

Dump##lings

Figure 3: Conversion of word to tokens. ## indicates that the token is a continuation of the previous token

In addition to word and subword tokens, there are special tokens like:

[CLS]Indicates the start of the sequence.

[SEP]Used to separate different segments in a sequence.

[MASK]Represents masked tokens that the model attempts to predict.

[PAD]Used for padding multiple sequences to the same length.

This vocabulary for the model is created by tokenizing the training dataset, where tokens are ranked by their frequency in the corpus—more common tokens are assigned lower integer identifier ID numbers, while less common tokens receive higher ID numbers. Figure 4 shows the vocabulary.json file for the model.

{
  ...
  "##lings": 11227,
  ...
  "Dump": 15653,
  ...
}

Figure 4: vocabulary.json file. It contains the mapping of tokens to IDs

After tokenizing the input text, each token is given a ID that was assgined to it while creating the vocabulary. Figure 5 shows the conversion of tokens to the corresponding IDs.

Dump##lings

➡

1565311227

Figure 4: Conversion of tokens to IDs

These token IDs are then used to look up corresponding vectors in a large matrix called the embedding matrix. The embedding matrix is a learned parameter of the model and is initialized randomly before training. If the model’s dimensionality is h (the size of the embeddings vector), and the vocabulary size is n, the embeddings matrix will have dimensions (n x h). Each row in this matrix corresponds to the embedding vector of a specific token ID. For each token ID, the corresponding row from the embedding matrix is retrieved. Matrix 1 is an example of the embeddings matrix.

Token ID	Dimension₁	Dimension₂	…	Dimension_h
…	…	…	…	…
11227 (Dump)	0.025	-0.034	…	0.112
…	…	…	…	…
15653 (##lings)	-0.045	0.081	…	0.067
…	…	…	…	…
Token ID_n	…	…	…	…

Matrix 1: Embeddings matrix. These values are random and not the actual values

b. Positional Embeddings

The positional embeddings are learnable parameters, just like the token embeddings. During model initialization, the positional embedding vectors are initialized with random values. For example, if the model is designed to handle sequences of up to 512 tokens, there will be 512 positional embedding vectors, one for each possible position in the input sequence. Since Transformers do not have a built-in notion of sequence order, positional embeddings are added to the token embeddings. These embeddings encode the position of each token in the sequence, allowing the model to understand the order of words. In the case of Dumplings, 1st postition vector will be added the the input embedding of “Dump” and 2nd position vector will be added to the input embedding of “##lings”. These embeddings ensure the model understands the sequential nature of the data, which is critical for capturing context.

c. Segment Embeddings

BERT is designed to handle pairs of sentences (like in tasks such as Question Answering or Next Sentence Prediction). To distinguish between the two sentences, BERT adds segment embeddings (a learned parameter), where each token is assigned an embedding indicating whether it belongs to the first or the second sentence. RoBERTa does not use segment embeddings because it is not trained with the Next Sentence Prediction task.

3. Encoder Block

a. Multi-Head Attention

Multi-head attention, enables the model to attend to different parts of the sequence simultaneously. This mechanism helps the model capture complex relationships between words, regardless of their distance in the sequence.

After adding the Token Embeddings, Positional Embeddings and Segment Embeddings (in case of BERT), we’ll get Input Embeddings X_embed of shape (s, h) where s is the input sequence length (number of input tokens) and h is the model’s dimensionality (embedding vector size). Therefore in case of input “Dumplings are tasty” our matrix X_embed would be as shown in Matrix 2.

Dump	e₁₁	e₁₂	…	e_1h
##lings	e₂₁	e₂₂	…	e_2h
are	e₃₁	e₃₂	…	e_3h
ta	e₄₁	e₄₂	…	e_4h
##sty	e₅₁	e₅₂	…	e_5h

Matrix 2: Embedding Matrix X_embed.

BERT and RoBERTa have n attention heads and for each ith head the input embedding X_embed is linearly transformed into three different matrices: Query Q_i, Key K_i, and Value V_i. This is done by multiplying X_embed by three different weight matrices: W_{Q_i}, W_{K_i}, and W_{V_i}. The shape of these weight matrices is (h, h_head), where h_head = h/n. The resulting Q_i, K_i, and V_i matrices have the shape (s, h_head).

Q_i = X_embed × W_{Q_i}K_i = X_embed × W_{K_i}V_i = X_embed × W_{V_i}

After this, the attention scores are calculated by multiplying the Query matrix Q_i with the transpose of Key matrix K_i^T. The result of dot products is an attention score (similarity) matrix S_i with a shape of (s, s) that indicates the similarity between each token. This shows how much each token (represented by a query) “attends” to every other token (represented by a key). The attention scores are divided by the square root of h_head (to prevent the scores from becoming too large as the dimension increases) and then the scaled attention scores are passed through a softmax function to obtain the attention weights. The attention weights matrix S_i is shown in Matrix 3 and s21 gives the importance of “Dump” while processing “##lings”.

	Dump	##lings	are	ta	##sty
Dump	s₁₁	s₁₂	s₁₃	s₁₄	s₁₅
##lings	s₂₁	s₂₂	s₂₃	s₂₄	s₂₅
are	s₃₁	s₃₂	s₃₃	s₃₄	s₃₅
ta	s₄₁	s₄₂	s₄₃	s₄₄	s₄₅
##sty	s₅₁	s₅₂	s₅₃	s₅₄	s₅₅

Matrix 3: Attention Weight Matrix S_i with the shape (s, s): Softmax(Q × K^T√h)

After we have the attention weight matrix S_i, we calcuate the output of the attention head using the equation A_i = S_i x V_i for the ith head. We get A_i of shape (s, h_head) and it is a context-aware representations for each token in the sequence, allowing the model to focus on the most important parts of the input when generating its final output.

In multi-head attention, this process is repeated for several heads (each with different weight matrices). This allows the model to focus on different parts of the sequence. The output matrix A_i from each head is concatenated to get the final output A_atten = [A₁; A₂; …, A_i] of multi-head attention. After concatenation, the shape of A_atten will be (s, n_headxh_head) = (s, h).

d. Add & Norm

The output of the multi-head attention A_atten is added to the original input X_embed. This residual connection helps to mitigate the vanishing gradient problem by allowing gradients to flow directly through the network, making it easier to train deep networks. The residual connection also helps in preserving the original information from the input. After adding the residual connection, the result is passed through a layer normalization step. Layer normalization stabilizes the training by normalizing the output of Add step to have a mean of 0 and a standard deviation of 1 across the hidden dimension. The output of this layer is denoted by X_norm and it’s shape is (s, h).

X_norm₁ = LayerNorm(X_embed + A_atten)

c. Feed Forward Layer

Both BERT and RoBERTa have two feed forward layers. The first feed forward layer projects X_norm₁ from size (s, h) to (s, h_ff) where h_ff is greater than h. The second feed forward layer projects the output of the first layer back to (s, h). The output of the feed forward layers is denoted as O_ff and is passed into the second Add & Norm part of the encoder block. The output of the second Add & Norm is denoted by X_norm₂ and it’s shape is (s, h).

X_norm₂ = LayerNorm(X_norm₁ + O_ff)

The output of the encoder block X_norm₂ is passed to the next encoder block. And the X_norm from the final encoder block can be passed to a feed forward classifier layer for prediction.

BERT vs. RoBERTa

BERT is trained on BookCorpus^[4] and English Wikipedia^[5], while RoBERTa is trained on a larger dataset, including BookCorpus, English Wikipedia, CC-News^[6], OpenWebText^[7], and Stories. BERT is trained using the Masked Language Model and Next Sentence Prediction tasks, but RoBERTa removed the Next Sentence Prediction task because not using it did not degrade the model’s performance. In BERT training, static masking is used (the same tokens are masked in every epoch), while RoBERTa uses dynamic masking (different tokens are masked across epochs). BERT is trained for fewer steps with smaller batch sizes, whereas RoBERTa is trained for longer with larger batch sizes and higher learning rates.

References

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[2] RoBERTa: A Robustly Optimized BERT Pretraining Approach

[3] Fast WordPiece Tokenization

[4] Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

[5] Wikipedia

[6] Common Crawl New Dataset

[7] OpenWebText Dataset