Select Page

Paper Reading #1 – BERT

by | Machine Learning, Research

Part of my blog will cover the latest and most significant developments in machine learning research. My aim in these paper readings is to extract the key points, so you can easily digest what the research is contributing. I will go into some detail into the methods so that you get familiar with some advanced machine learning concepts. But these posts will be summaries, not paper rewrites. The best way to get the most out of a paper is to read it and read the referenced papers. Hopefully, this post will serve to spark your curiosity about state-of-the-art machine learning and to get you reading if you have not done so already. If you want to improve your knowledge base of machine learning and linear algebra concepts to get more out of paper readings, you can visit my blog post which provides the best books for machine learning. Now, on to BERT!

BERT stands as an essential milestone in the progress of NLP and machine learning in general. The research paper written by Google AI Language remains one of the most cited and used pieces of research in recent times. Due to the success of the model, different flavors of BERT have been developed to add further strides to the innovation and state-of-the-art results observed.


  • BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art, pre-trained, and unsupervised model that outperforms existing high-performance models on various NLP tasks.
  • Introduces the use of bidirectional training of the Transformer model, which uses attention for language modeling.
  • Bidirectionality extracts more in-depth contextual information from natural language sequences compared to unidirectionality.
  • Introduces a novel training method, which uses masking tokens. This method allows bidirectional unsupervised training of sequential problems to be possible.
  • The bigger the model is, the more performative it can be on small-scale tasks. However, there is a trade-off with time to convergence.

Why Is BERT Important?

NLP tasks such as Question-Answering and common sense reasoning are crucial for automated applications such as online support, troubleshooting, and machine translation. Neural language modeling is used to capture information on the token and sentence level.

Using a pre-trained language model with minimal task-specific parameters means that a model can be trained to diverse downstream tasks easily by fine-tuning all of the parameters. This has been demonstrated by the OpenAI GPT model.

Unidirectionality limits fine-tuning approaches. They argue that these models have restricted context extraction because the contextual relationships between tokens in a sequence are only in one direction. This training approach leads to sub-optimal performance on downstream tasks.

What Does BERT Do?

BERT introduces bidrectional learning by using the masked language model learning objective for pretraining. The pretraining architecture inherits from the Transfomer. A Transformer is a sequence-to-sequence architecture consisting of an Encoder and a Decoder. The Encoder is responsible for mapping an input sequence to an n-dimensional abstract vector. The Decoder takes the vector and transforms it into an output sequence, where the output can be in another language or a copy of the input for example.

The critical aspect of a Transformer is attention. In simple terms, attention is a measure of importance and interdependence given to words in a sequence used to predict the next step in the sequence. Self-attention refers to using all attention scores of the words both prior, ahead of, and including the current word. With self-attention, attention modules receive a segment of words and learn the dependencies between all words at once using three trainable weight matrices – Query, Key, and Value – that form an attention head. Multi-headed attention is a concatenation of attention scores from different attention heads with differently initialized weight matrices. The multi-head approach captures a range of different relationships between words and latent structures across word sequences. An encoder block consists of a multi-head attention stack and a feed-forward layer. BERT’s goal is to generate a language model, therefore only the encoder mechanism from the Transformer is necessary. BERT is effectively a stack of Transformer encoder blocks. There are two model configurations considered in the paper:

  • BERT base – 12 layers, 12 attention heads, and 110 million parameters.
  • BERT large – 24 layers, 16 attention heads, and 340 million parameters.
This image has an empty alt attribute; its file name is scaled_attention.png
The attention computation. Source.
This image has an empty alt attribute; its file name is multi_head_attention.png
Multi-head attention. The attention score from each head is concatenated and put through a final dense layer. Source.
Encoder block of the Transformer architecture. Source.

For a full description of the Transformer and attention, see the “Attention Is All You Need” paper here.

BERT provides pre-trained representations that can be fine-tuned for a wide range of sentence-level and token-level tasks. This means BERT can be used as an input to a simpler model that is handling a significantly smaller dataset than what was used to build the original BERT model, and achieve state-of-the-art performance. This is similar to the transfer learning research in computer vision with deep convolutional neural networks and ImageNet.

How Do They Do It?


BERT uses a summation of three embeddings as an input representation. This input allows single sentences or a pair of sentences to be represented as one token sentence. A token embedding is used to map the natural language sequence to a numerical representation. A segment embedding is used to indicate whether a token belongs to sentence A or B. A positional embedding is used to tell the model where the token is in the entire sequence. The sequences are decorated with special tokens:

  • The Classification token [CLS] is an indicator of the final hidden state for the sequence representation for classification tasks.
  • The Separation token [SEP] indicates whether a token belongs in sentence A or sentence B.
BERT Input representation

The Issue of Sequential Training

The use of a Transformer allows an input sequence to be encoded at once, removing the sequential dependencies that occur with recurrent neural networks. Attention is used to extract global dependencies from a sequence. However, when training language models using Transformers, the prediction goal is sequential, e.g., predicting the next word in a sequence. There is still an inherent directionality in the learning process. BERT overcomes this by implementing two strategies: Masked LM and Next Sentence Prediction.

Masked LM

Prior to training, 15% of the tokens in each sequence are masked, i.e. hidden. The model has to predict the original token based on the surrounding non-masked tokens in the sequence. Because the masking tokens do not appear in the fine-tuning tasks, the 15% that are selected to be masked are randomly chosen to be (a) the mask token 80% of the time, (b) a random token 10% of the time (c) the original token 10% of the time. This technique reduces the likelihood of the model to produce poor representations for non-masked words However, no ablation was performed to determine the optimal ratio of masked to random to non-masked tokens.

The prediction of the masked word requires:

  • Adding a classification layer on top of the output from the encoder.
  • Transforming the output vectors to the vocabulary dimension by multiplying them by the embedding matrix.
  • Assign probability to each word in the vocabulary using softmax.

Next Sentence Prediction

The model observes a pair of sentences and learns to predict whether the second sentence occurs after the first or is a random sentence from the document. 50% of the inputs consist of pairs where the second sentence is the actual following sentence. The other 50% has a random sentence as the second sentence. The output from the previously mentioned [CLS] token is fed into a classification layer with softmax to calculate the probability the sentence is the subsequent sentence. The combined loss function of Masked LM and Next Sentence Prediction loss functions is minimized during training.

Fine-tuning BERT

The adaptability of BERT is one of its strengths. By making small changes to the model, it can be used for a wide range of NLP tasks. The model can be fine-tuned for classification, using the output of the [CLS] token, question-answering by learning two extra vectors that mark the beginning and end of an answer, and Named Entity Recognition where the output of the BERT model can be fed into a classification layer to predict entity labels in a sequence of text. Fine-tuning with BERT requires a very small number of additional parameters. Therefore task-specific models can benefit from the large and expressive pre-trained representations even when the task data is very small.

Extracting Features from BERT

One of the exciting aspects of BERT is the ability to extract fixed features from the pre-trained model, similar to generating word embeddings from Word2Vec or ELMo. The added context from introducing bidirectionality can lead to richer feature sets for downstream tasks that are not easily represented by a Transformer architecture. The expensive pre-training process only needs to be done once and then can be plugged into cheaper models to inform other tasks. The paper demonstrated feature-based BERT outperforms other state-of-the-art models on Named Entity Recognition.

If I Remove A Part of the Model, Is It Just as Good?

Feature ablation is a practice often used in machine learning research. It is a useful validation study for models with several distinct features declared to have some performative benefit. By systematically removing each element and testing the model without them, the researchers can rank the impact of each feature and determine if they have any impact at all. Typical metrics used in feature ablation studies are F1 score, accuracy, and loss.

In this paper, the researchers subtract the Next-Sequence Prediction training and Masked LM and test the model variants on five pre-training tasks.

Feature Ablation Study Results
  • No NSP maps to using the masked LM without “next sentence prediction.” There is a significant impact on the Natural Language Inference Tasks and Question-Answering.
  • LTR & No NSP maps to not using masked LM nor “next sentence prediction.” This configuration is directly comparable to the OpenAI GPT model. Removing bidirectionality worsens the model performance on all tasks.
  • + BiLSTM maps to adding a randomly initialized bidirectional LSTM layer on top of the LTR & No NSP system. The results improve for the Question-Answering task (SQuAD) because a right-side context is reintroduced. However, this configuration performs much worse than the pre-trained bidirectional models.
  • We can see that consistency and high performance attained across the five NLP tasks using the BERT base model which implements all the discussed features.

What is the Trade-off Using BERT?

In the paper, it was demonstrated that increasing the size of the model, i.e., going from 110M parameters to 340M parameters is additive to performance. However, large pretrained models are prone to degradation in performance when fine-tuned on small training datasets. This behavior is observed as binary, i.e., the model works very well or not at all. The fine-tuning stability with small training datasets can be improved by using multiple random restarts. This problem and the ongoing research question are highlighted in this paper.

The bidirectional approach (Masked LM) takes significantly longer to train than a standard language model, as masking only a fraction of the tokens in the sequences produces a smaller signal. Combining this with the large number of parameters to train from scratch and the model will take even slower to train. However, in the case of feature extraction and fine-tuning, the model only needs to be pre-trained once.


BERT is a momentous achievement in machine learning and its application to NLP. This research serves as a stepping stone to building models that can learn long-term context, one of the aspects of human language interpretation that we oft-take for granted. BERT further demonstrated the applicability of transfer learning for NLP tasks, alongside similar milestones like ULMFiT, ELMo, and the OpenAI Transformer. In this post, I sought to share the main concepts of the paper, without all of the technicalities. As stated earlier, if you want to learn more, please read through the full article and the links provided on the page and the referenced articles in the paper.

How Can I Implement BERT?

You can quickly get started using BERT either via the source code or using the PyTorch library AllenNLP, which includes reference implementations of other state-of-the-art deep neural models for NLP problems. You can also implement BERT End to End (fine-tuning and predicting) with a Cloud TPU on Google Colab here.

I hope you enjoyed the first post of the Paper Reading series. Share this post and sign-up to the mailing list for more posts in the future!