Explanation of “Attention Is All You Need” with Code by Abhishek Thakur

ABHISHEK GUPTA
5 min readDec 19, 2021

--

Hello Everyone!

This is a very special blog post on a paper named Attention Is All You Need that introduces Transformer Architecture. Recently, World’s First 4x Grand Master, Abhishek Thakur shared a Twitter thread explaining the implementation of Attention Is All You Need from scratch. Couple of months ago, I came across this paper and wanted to understand it in depth but somehow I couldn’t get a clear and broader picture. After finding the tread, for my personal convenience (and hopefully many others) of reading and understanding, I decided to convert that into a blog post. And here it is. However, before you begin going through this post, I want to make it very clear, full credit for this post goes to one and only @abhi1thakur. What I have done is, simply collected the Twitter thread and added here in sequence for better readability.

Hope it helps.

Here is the detailed explanation as per @abhi1thakur:

There are two parts: encoder and decoder. Encoder takes source embeddings and source mask as inputs and decoder takes target embeddings and target mask. Decoder inputs are shifted right. What does shifted right mean? Keep reading the thread.

The encoder is composed of N encoder layers. Let’s implement this as a black box too. The output of one encoder goes as input to the next encoder and so on. The source mask remains the same till the end

Similarly, we have the decoder composed of decoder layers. The decoder takes input from the last encoder layer and the target embeddings and target mask. enc_mask is the same as src_mask as explained previously

Let’s take a look at the encoder layer. It consists of multi-headed attention, a feed forward network and two layer normalization layers. See forward(…) function to understand how skip-connection works. Its just adding original inputs to the outputs.

Now comes the fun part. Multi-head attention. We see it 3 times in the architecture. Multi-headed attention is nothing but many different self-attention layers. The outputs from these self-attentions are concatenated to form output the same shape as input.

self-attention in simple words is attention on the same sequence. I like to define it as a layer that tells you which token loves another token in the same sequence. for self-attention, the input is passed through 3 linear layers: query, key, value.

In the forward function, we apply the formula for self-attention. softmax(Q.K´/ dim(k))V. torch.bmm does matrix multiplication of batches. dim(k) is the sqrt of k. Please note: q, k, v (inputs) are the same in the case of self-attention.

Let’s look at the forward function and the formula for self-attention (scaled). Ignoring the mask part, everything is pretty easy to implement.

The mask just tells where not to look (e.g. padding tokens)

Let’s take a look at decoder now. The implementation is similar to that of the encoder except for the fact that each decoder also takes the final encoder’s output as input

The decoder layer consists of two different types of attention. the masked version has an extra mask in addition to padding mask. We will come to that. The normal multi-head attention takes key and value from final encoder output. key and value here are same.

Query comes from output of masked multi-head attention (after layernorm). Checkout the forward function and things are very easy to understand

Now we come to the special mask for targets, aka subsequent mask. The subsequent mask just tells the decoder not to look at tokens in the future. This is used in addition to the padding mask and is used only for training part.

Now we have all the building blocks except positional encoding. Positional encoding tells the model an idea about where the tokens are located relative to each other. To implement positional encoding, we can simply use an embedding layer!

And this is how inputs and outputs will look like. Here, batch size = 32, len of input seq = 128, len of output seq = 64. We add a linear + softmax to decoder output. This gives us a token prediction for each position (a classification problem)

That’s all for now.

If you have any comments, feel free to share.

Alternate Resource: Transformers from Scratch

References:

Original Article: https://arxiv.org/abs/1706.03762

Twitter Tread: https://twitter.com/abhi1thakur/status/1470406419786698761

--

--

ABHISHEK GUPTA
ABHISHEK GUPTA

Written by ABHISHEK GUPTA

Data Scientist in Banking Industry.

Responses (2)