The Intuition Behind Transformers — Attention is All You Need

Traditionally recurrent neural networks and their variants have been used extensively for Natural Language Processing problems. In recent years, transformers have outperformed most RNN models. Before looking at transformers, let’s revisit recurrent neural networks, how they work, and where they fall behind.

Recurrent Neural Networks (RNN) work with sequential data like language translation and time-series data. There are different types of recurrent neural networks.

Vector to Sequence Models — That take in a vector and return a sequence of any length.
Sequence To Vector Models — These models take in a sequence as input and return a vector as an output. For instance, these models are commonly used in sentiment analysis problems.
Sequence To Sequence Models, as you guessed it by now, they take a sequence as the input and output another sequence. They are commonly seen in language translation applications.

Natural Language Processing and RNNs

When it comes to natural language processing RNNs, they work in an encoder-decoder architecture.

Encoders will summarize all the information from the input sentence, and the decoder will use the encoder’s output to create the right output. The final state of the encoder conveys information to start decoding. The decoder uses the previous state and the output to compute a new hidden state and word. Multiple RNNs are used both for the encoder and decoder layers.

Recurrent Neural Networks, however, have their limitations.

First, they are slow, in fact, extremely slow to train, and often we have to truncate the training using techniques like Truncated Back Propagation In Time.
Secondly and more commonly, RNNs suffer from a problem of vanishing and exploding gradients. When applied to NLP problems, the information from the beginning of the sentence gets lost.

Long short-term memory (LSTM)

The Long short-term Memory (LSTM) Networks were introduced to solve these problems with Recurrent Neural Networks.

These worked by having a hidden state called the memory cell that allowed the information from the previous cell to flow to the current cell while skipping most of the current cell’s processing. This allowed the previously dumb neurons to now have a memory that they could use to retain information as needed. This, in turn, allowed the model to retain information for longer sequences.

However, while regular recurrent neural networks are slow to train, it turns out that LSTMs are even slower to train. Secondly, since each word of the sequence is passed individually to the network, and the processing still happens sequentially within the network, this architecture does not take advantage of today’s GPUs’ parallel processing.

Attention Mechanism and RNNs

An attention mechanism was added to them to address some of the limitations with traditional RNNs and LSTMs. The attention mechanism worked by using a global vector, the context vector containing the weighted sum of all the encoder’s hidden states.

The context vector says how the current state of the decoder is related to the global input sequence. While the attention mechanism solved some of the inherent flaws with RNNs, we were still feeding words individually and processing them sequentially, which meant these architectures still did not enable us to take advantage of parallel processing offered by today’s hardware.

Attention Is All You Need — Transformers

Can we do away with the RNNs altogether? Enter transformers. In 2017 the transformer architecture was introduced in the paper aptly titled Attention Is All You Need. As it turns out, attention is all you needed to solve the most complex natural language processing tasks. Let’s take a look.

The transformer architecture uses an encoder and a decoder but only uses attention, no RNNs.

The key difference between the previous architectures being, the input for the encoder is the whole sentence, rather than one word at a time as we do with RNNs. Similarly, the inputs for the decoder are also the entire sentence (shifted right). We pass all the words of the sentence simultaneously and determine the word embeddings simultaneously.

Let’s break up this architecture and dive deep into the individual components, beginning with the encoder block.

We start by feeding the network inputs of all word embeddings in the sequence. Word embeddings are a vector representation of words so that words with similar meanings are closer; specifically, related words are closer to each other within the embedding space.

However, in languages, that the position of a word in a sentence can change the meaning. For example, between the sentences “The cat is an animal.” and “You eat like an animal.” the word animal’s position changes its meaning. When we were working CNNs and RNNs, the position of the word was retained. However, in the transformer model, we need an explicit positional encoding layer to retain the word’s position in the sequence after the embeddings are done.

The paper mentions the use of sine and cosine functions of different frequencies.

Once we have the word embedding and the positional encoding, we can then pass it to the multi-headed attention block.

The multi-headed attention block focuses on self-attention; that is, how each word in a sequence is related to other words within the same sequence. The self-attention is represented by an attention vector that is generated within the attention block. The idea is to capture the contextual relationships between the words in the sentence.

How does this work? We find the relationship between two vectors by taking the scaled dot products.

Mathematically, dot product gives the similarity between two vectors. In summary, two vectors are closely related if the dot product is 1 (or -1 in case of negative correlation) and have no correlation if the dot product is 0.

Our transformer model uses a scaled dot-product function to calculate the attention.

The attention function used by the transformer takes three inputs: Q (query), K (key), V (value), with the following equation used to calculate the weights.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

The attention block is called a multi-headed attention block, as we use multiple attention vectors for each word and then take a weighted average.

Each of the attention vectors is independent of each other, and this allows us to use parallelization. Remember GPUs?

Next, we have a Feed-Forward Network (FFN). This is a regular feed-forward network that is applied to every attention vector. The FFN is applied so that the output can be consumed by the next encoder block or the decoder block.

Each FFN is made up of two dense linear layers with ReLU activations in between.

FFN(x) = max(0, xW1 + b1)W2 + b2

A couple of key points regarding the FFN layer

The FFN is applied to each position separately and identically.
The FFN is different for each sublayer.

Finally, we have an Add & Normalization layer applied after each attention block and after each FFN block. The FFN layer normalizes the outputs and also aids in learning during backpropagation via residual connections.

The decoder block works similarly. We pass in the target sequence of words encoded in an embedding along with the positional encodings.

The decoder’s self-attention block generates the attention vectors for the target sequence to find out how much each word in the target sequence is related to other words in the sequence. This first attention block of the decoder is called the masked attention block, this because we apply a masking layer to this block. This ensures that while generating the attention vectors for the target sequence, we can use all the words from the input sequence, but only the previous word of the target sequence.

The decoder has an additional attention block that takes the embeddings from both the input sequence and the target sequence to determine how each word in the input sequence is related to each word in the target sequence.

The second attention layer’s output is sent to an FFN layer, which is similar to the FFN layer of the encoder block with similar functionality.

Finally, in the end, we have a linear layer, which just another FFN and a softmax function to get the probability distribution of all the next words and, as such, the next predicted word with the highest probability score.

This process is executed multiple times until the end of the sentence token is generated for the sequence.

This post gives you the basic intuition behind the Transforms architecture for NLP. You can also read the original paper here.

TensorFlow has an excellent step-by-step tutorial if you would like to get your hands dirty by implementing transformers.