Fine-Tuning BERT for 90%+ Accuracy in Text Classification

What are Pretrained Language Models?

Pretrained Language Models (PLMs) are deep learning models trained on large corpus of text to understand the structure and nuances of natural language. These models are used as a foundation for various Natural Language Processing (NLP) tasks, including Fine tuning BERT for text classification, significantly improving performance compared to training from scratch.

Evolution of Language Models (From Word2Vec to Transformers)

The journey of NLP models started with word embedding techniques like Word2Vec and GloVe, which captured semantic relationships between words. Later, RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) introduced the ability to process sequences. However, they suffered from long-range dependency issues. The breakthrough came with Transformers, introduced in the paper Attention Is All You Need, enabling parallel processing and better context understanding.

Why BERT?

BERT (Bidirectional Encoder Representations from Transformers) marked a paradigm shift by using a bidirectional approach rather than traditional left-to-right or right-to-left models. It allowed the model to understand the full context of a word by looking at both preceding and succeeding words in a sentence.

2. Understanding BERT

BERT is based on the Transformer architecture, which relies on self-attention mechanisms and positional encoding instead of sequential processing. It consists of multiple layers of self-attention and feedforward neural networks.

How BERT Differs from Traditional Models

Unlike earlier NLP models, which used unidirectional approaches, BERT is deeply bidirectional. It enables better understanding of word meanings based on complete sentence context.

Pretraining Objectives: MLM & NSP

BERT is pretrained using two tasks:

Masked Language Modeling (MLM): Some words in a sentence are masked, and the model learns to predict them.
Next Sentence Prediction (NSP): The model learns sentence relationships by predicting if one sentence follows another in a given text.

3. Variants of BERT

1. DistilBERT

A lighter version of BERT, reducing parameters while maintaining performance.

2. RoBERTa

Enhances BERT by removing NSP and training on larger batches and longer sequences.

3. ALBERT

Reduces memory consumption and increases efficiency using parameter-sharing and sentence-order prediction (SOP) instead of NSP.

few other Variants

Models like ELECTRA, DeBERTa, T5, and GPT series have built on BERT’s foundation to improve efficiency and task adaptability.

4. Fine-tuning BERT for NLP Tasks

Why Fine-tuning is Necessary

Pretrained BERT models capture general linguistic knowledge, but they require fine-tuning on specific tasks like text classification, named entity recognition, and question answering to optimize performance.

Common NLP Tasks

Sentiment Analysis: Predicts sentiment (positive/negative/neutral).
Text Classification: Assigns categories to text.
Question Answering (QA): Extracts relevant answers from text.
Named Entity Recognition (NER): Identifies entities like names, locations, and dates.

Understanding Input & Output Tokens

BERT requires input in the form of tokenized sequences. Each input is tokenized into Word Pieces with special tokens like:

[CLS] (start of sentence)
[SEP] (separator between sentences)

5. Step-by-Step Guide to Fine-tuning BERT

Setting up the Environment

Before diving into code, you’ll need to install the necessary libraries:

# Install required libraries
!pip install transformers tensorflow datasets

Loading BERT Model and Tokenizer

First, we need to import the necessary classes from the Transformers library to load BERT:

from transformers import TFAutoModel, AutoTokenizer
import tensorflow as tf

# Load the pre-trained BERT model
model = TFAutoModel.from_pretrained("bert-base-uncased")

# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Understanding BERT Model Types

BERT is available in two main sizes:

BERT Base: 12 Transformer encoder layers (used in this tutorial)
BERT Large: 24 Transformer encoder layers (2x the size of BERT Base)

The “uncased” suffix indicates that this model doesn’t distinguish between uppercase and lowercase letters.

Tokenizing Text for BERT

Before feeding text to BERT, we need to tokenize it:

# Example of tokenizing text
inputs = tokenizer("Hello world", return_tensors="tf")
print(inputs)

The tokenizer output includes:

input_ids: Token IDs corresponding to words
token_type_ids: Used for sentence pair tasks
attention_mask: Indicates which tokens are actual words vs. padding

When processing multiple sentences of different lengths, we need to include padding:

# Tokenizing multiple sentences with padding
inputs = tokenizer(["Hello world", "How are you?"], padding=True, return_tensors="tf")
print(inputs)

For BERT, we should also add truncation to limit text to its maximum context size of 512 tokens:

# Complete tokenization with all necessary parameters
inputs = tokenizer(
    ["Hello world", "How are you?"],
    padding=True,
    truncation=True,
    return_tensors="tf"
)

Understanding BERT Outputs

Let’s see what BERT returns when we pass the tokenized input:

outputs = model(inputs)
print(outputs)

BERT outputs two tensors:

Last hidden state: Contains contextualized representations for every token in the input (shape: [batch_size, sequence_length, hidden_size])
Pooler output: A single vector representation for each sentence (shape: [batch_size, hidden_size])

For text classification, we’ll use the pooler output since it provides a condensed representation of the entire sentence.

For this tutorial, we’ll use the “emotions” dataset from Hugging Face:

from datasets import load_dataset

# Load the emotions dataset
emotions = load_dataset("emotions")
print(emotions)

The emotions dataset contains:

16,000 training samples
2,000 validation samples
2,000 test samples

Each sample is a tweet labeled with one of six emotions: joy, sadness, anger, fear, love, and surprise.

Tokenizing the Dataset

We need to tokenize all text in the dataset:

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        return_tensors="tf"
    )

# Map the tokenize function to all examples in the dataset
emotions_encoded = emotions.map(tokenize_function, batched=True)

Converting to TensorFlow Dataset Format

Next, we need to convert our dataset to TensorFlow’s dataset format:

# Convert to TensorFlow dataset
def convert_to_tf_dataset(data_split):
    return tf.data.Dataset.from_tensor_slices((
        {
            "input_ids": data_split["input_ids"],
            "attention_mask": data_split["attention_mask"],
            "token_type_ids": data_split["token_type_ids"],
        },
        data_split["label"]
    )).batch(16)

# Create TensorFlow datasets for training, validation, and testing
train_dataset = convert_to_tf_dataset(emotions_encoded["train"])
val_dataset = convert_to_tf_dataset(emotions_encoded["validation"])
test_dataset = convert_to_tf_dataset(emotions_encoded["test"])

Building the Classification Model

Now, we’ll create a classification model by extending BERT with a classification layer:

class BertForClassification(tf.keras.Model):
    def __init__(self, bert_model, num_classes):
        super().__init__()
        self.bert = bert_model
        self.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')
        
    def call(self, inputs):
        # Extract the pooler output from BERT
        outputs = self.bert(inputs)
        pooled_output = outputs[1]
        
        # Pass the pooled output through the classifier
        return self.classifier(pooled_output)

# Create an instance of our classification model
classification_model = BertForClassification(model, num_classes=6)

Compiling and Training the Model

Now we’ll compile and train our model:

# Compile the model
classification_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model
history = classification_model.fit(
    train_dataset,
    epochs=3,
    validation_data=val_dataset
)

Important note: When fine-tuning pre-trained models, it’s crucial to use a low learning rate (like 3e-5) to avoid overwriting the valuable pre-trained weights.

After training, we can evaluate our model on the test dataset:

# Evaluate the model on the test dataset
test_results = classification_model.evaluate(test_dataset)
print(f"Test accuracy: {test_results[1]:.4f}")

The fine-tuned BERT model achieves approximately 93% accuracy on the test dataset after just 5 epochs of training!

Practical Applications

Fine-tuned BERT models can be used for multiple NLP tasks apart from text classification:

Question answering
Named entity recognition
Sentiment analysis
Text summarization
Machine translation

Conclusion

Fine-tuning BERT for text classification offers a powerful way to achieve high-accuracy results with minimal effort. By leveraging the pre-trained knowledge in BERT, we can create highly effective NLP models without training from scratch.

The process involves:

Loading the pre-trained BERT model and tokenizer
Preparing and tokenizing your dataset
Creating a classification model by adding a dense layer on top of BERT
Fine-tuning with a low learning rate
Evaluating on a test dataset

With just five epochs of training, our model achieved 93% accuracy on the emotions classification task, demonstrating the power of transfer learning in NLP.

Author

Naveen

Naveen Pandey has more than 2 years of experience in data science and machine learning. He is an experienced Machine Learning Engineer with a strong background in data analysis, natural language processing, and machine learning. Holding a Bachelor of Science in Information Technology from Sikkim Manipal University, he excels in leveraging cutting-edge technologies such as Large Language Models (LLMs), TensorFlow, PyTorch, and Hugging Face to develop innovative solutions.
View all posts