To BERT, with RoBERTa: Part II
August 29, 2024

Figure 1: Illustration generated using GPT-4o
Introduction
In this blog, we will look at the fine-tuning of BERT[1] and RoBERTa [2]for Question-Answering on the UQA Corpus[3] from my paper, UQA: A Corpus for Urdu Question Answering[4]. Closed-ended question-answering can be treated as a classification task, where the model predicts the start token and the end token from the given context. The answer for Q: What are we about to do with the model? using the tokenized context in Figure 1 is A: Fine-tune. Therefore the task of the encoder-only model is to predict the start_token = 2 and end_token = 4.
Figure 1: Tokenized context for closed-ended question-answering
Fine-tuning
1. Setup Environment
We will use the transformers[5] and accelerate[6] library for fine-tuning, and the datasets[7] library to load the UQA dataset from Hugging Face[9].
pip install transformers
pip install accelerate
pip install datasets2. Setup Parameters
You can choose from the multilingual (because we are training for Urdu) BERT and RoBERTa family and within each family of models, you can choose the size of the model.
MODEL = "xlm-roberta-base" # or "bert-base-multilingual-cased"
SAVE_DIR = "xlm-roberta-base-uqa"
LEARNING_RATE = 2e-5
EPOCHS = 6
TRAIN_BATCH_SIZE = 8
EVAL_BATCH_SIZE = 83. Load and Filter the Dataset
The UQA corpus has two types of question, (1) Answerable questions: These are questions for which a clear, definite answer can be extracted directly from the provided context (2) Unanswerable questions: These are questions for which the answer cannot be found in the provided context but they look similar to answerable questions. We will fine-tune our model on only the answerable questions so we need to filter out the unanswerable ones.
def filter_function(example):
return not example['is_impossible']from datasets import load_dataset
raw_datasets = load_dataset("UQA")
raw_datasets["train"] = raw_datasets["train"].filter(filter_function)
raw_datasets["validation"] = raw_datasets["validation"].filter(filter_function)4. Load the Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL)5. Data Pre-processing
Both BERT and RoBERTa can process texts of the maximal length of 512 tokens. However as the size of the input tokens increase, the VRAM (GPU memory) required to train the model increases. Therefore, depending the the dataset, we can check if the the required maximum sequence length is less than 512 using the code below.
max_seq_length = 0
for item in raw_dataset["train"]:
tokens = tokenizer.encode(item["question"] + "\n\n" + item["context"])
if len(tokens) > max_seq_length:
max_seq_length = len(tokens)
for item in raw_dataset["validation"]:
tokens = tokenizer.encode(item["question"] + "\n\n" + item["context"])
if len(tokens) > max_seq_length:
max_seq_length = len(tokens)
if max_seq_length > 512:
max_seq_length = 512
stride = 128The preprocessing function tokenizes the inputs with specific parameters. First, it applies tokenization to the question and context pairs, with the max_length parameter defining the maximum sequence length for the tokenized output. If the combined length of the question and context exceeds this limit, the truncation="only_second" option ensures that only the context is truncated, not the question. The stride parameter allows for the creation of overlapping chunks when the context is too long to fit within the max_length, which is particularly useful for long passages. Additionally, the function is configured with return_overflowing_tokens=True, which ensures that when the context is split into multiple chunks due to its length, these extra chunks (overflowing tokens) are returned instead of being discarded. The return_offsets_mapping=True option provides a mapping between token positions and their corresponding characters in the original context. Finally, the padding="max_length" parameter ensures that all sequences in a batch are padded to the same length.
After tokenization, we extract the offset_mapping, which links token positions to their corresponding characters in the original context, and the overflow_to_sample_mapping, which connects each tokenized chunk back to its original sample. We then align the answer’s character positions with the corresponding token positions by identifying the boundaries of the context within each tokenized sequence. It checks whether the answer lies within each chunk and, if so, calculates the start and end token positions based on the character positions of the answer. These calculated start and end positions are then incorporated into the inputs dictionary, along with the tokenized sequences.
def preprocess_examples(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=max_seq_length,
truncation="only_second",
stride=stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
sample_map = inputs.pop("overflow_to_sample_mapping")
answers = examples["answer"]
answer_starts = examples["answer_start"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
sample_idx = sample_map[i]
answer = answers[sample_idx]
start_char = answer_starts[sample_idx]
end_char = answer_starts[sample_idx] + len(answer)
sequence_ids = inputs.sequence_ids(i)
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
start_positions.append(0)
end_positions.append(0)
else:
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputstrain_dataset = raw_datasets["train"].map(
preprocess_examples,
batched=True,
remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)validation_dataset = raw_datasets["validation"].map(
preprocess_examples,
batched=True,
remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)6. Training Arguments
Gradient accumulation is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update. It can be helpful in scenarios where the available memory is limited, and the batch size that can fit in memory is small. So you can adjust gradient_accumulation_steps accordingly. Data collator handles the task of preparing batches of data before they are fed into the model during training or evaluation.
from transformers import default_data_collator, TrainingArguments, Trainer
args = TrainingArguments(
output_dir=REPO,
num_train_epochs=EPOCHS,
learning_rate=LEARNING_RATE,
evaluation_strategy = "epoch",
save_strategy = "epoch",
gradient_accumulation_steps=8,
per_device_train_batch_size=TRAIN_BATCH_SIZE,
per_device_eval_batch_size=EVAL_BATCH_SIZE,
weight_decay=0.01,
fp16=True,
push_to_hub=True
)
data_collator = default_data_collatortrainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=validation_dataset,
data_collator=data_collator,
tokenizer=tokenizer
)7. Start Training
trainer.train()Inference
1. Setup Environment
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnsweringDEVICE = "cuda" if torch.cuda.is_available() else "cpu"2. Load the Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained("SAVE_DIR\checkpoint-XYZ") # Replace XYZ with the best checkpoint from training
model = AutoModelForQuestionAnswering.from_pretrained("SAVE_DIR\checkpoint-XYZ")3. Run Inference
context = "ہم اپنے فائن ٹیونڈ سوال و جواب ماڈل پر جلد ہی انفریئنس چلانے والے ہیں!"
question = "ہم کیا کرنے والے ہیں؟"
inputs = tokenizer(question, context, truncation="only_second", padding="max_length", truncation=True, max_length=512, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model(**inputs)
start_logits, end_logits = output.start_logits, output.end_logits
start_idx = torch.argmax(start_logits[0])
end_idx = torch.argmax(end_logits[0])
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1])print(f"Context": {context})
print(f"Question": {question})
print(f"Answer": {answer})Context: ہم اپنے فائن ٹیونڈ سوال و جواب ماڈل پر جلد ہی انفریئنس چلانے والے ہیں!
Question: ہم کیا کرنے والے ہیں؟
Answer: انفریئنسReferences
[3] UQA Corpus
[7] Datasets Library
[8] Hugging Face

