{"id":1159,"date":"2021-11-07T08:38:17","date_gmt":"2021-11-07T08:38:17","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/07\/the-transformer-attention-mechanism\/"},"modified":"2021-11-07T08:38:17","modified_gmt":"2021-11-07T08:38:17","slug":"the-transformer-attention-mechanism","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/07\/the-transformer-attention-mechanism\/","title":{"rendered":"The Transformer Attention Mechanism"},"content":{"rendered":"<div id=\"\">\n<p>Before the introduction of the Transformer model, the use of attention for neural machine translation was being implemented by RNN-based encoder-decoder architectures. The Transformer model revolutionized the implementation of attention by dispensing of recurrence and convolutions and, alternatively, relying solely on a self-attention mechanism.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>We will first be focusing on the Transformer attention mechanism in this tutorial, and subsequently reviewing the Transformer model in a separate one.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover the Transformer attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How the Transformer attention differed from its predecessors.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Transformer computes a scaled-dot product attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Transformer computes multi-head attention.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_13017\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_cover.jpg\"><img aria-describedby=\"caption-attachment-13017\" loading=\"lazy\" class=\"wp-image-13017 size-large\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_cover-1024x576.jpg\" alt=\"\" width=\"1024\" height=\"576\"><img decoding=\"async\" aria-describedby=\"caption-attachment-13017\" loading=\"lazy\" class=\"wp-image-13017 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/transformer_cover-1024x576.jpg\" alt=\"\" width=\"1024\" height=\"576\"><\/a><\/p>\n<p id=\"caption-attachment-13017\" class=\"wp-caption-text\">The Transformer Attention Mechanism<br \/>Photo by <a class=\"N2odk RZQOk Vk1a0 AsGGe pgmwB KHq0c\" href=\"https:\/\/unsplash.com\/photos\/mawU2PoJWfU\">Andreas G\u00fccklhorn<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into two parts; they are:<\/p>\n<ul>\n<li>Introduction to the Transformer Attention<\/li>\n<li>The Transformer Attention\n<ul>\n<li>Scaled-Dot Product Attention<\/li>\n<li>Multi-Head Attention<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2><b>Prerequisites<\/b><\/h2>\n<p>For this tutorial, we assume that you are already familiar with:<\/p>\n<h2><b>Introduction to the Transformer Attention<\/b><\/h2>\n<p>We have, thus far, familiarised ourselves with the use of an attention mechanism in conjunction with an RNN-based encoder-decoder architecture. We have seen that two of the most popular models that implement attention in this manner have been those proposed by <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Luong et al. (2015)<\/a>.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The Transformer architecture revolutionized the use of attention by dispensing of recurrence and convolutions, on which the formers had extensively relied.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>\u2026 the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/p>\n<\/blockquote>\n<p>In their paper, Attention Is All You Need, <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Vaswani et al. (2017)<\/a> explain that the Transformer model, alternatively, relies solely on the use of self-attention, where the representation of a sequence (or sentence) is computed by relating different words in the same sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/p>\n<\/blockquote>\n<p><b>The Transformer Attention<\/b><\/p>\n<p>The main components in use by the Transformer attention are the following:<\/p>\n<ul>\n<li>$mathbf{q}$ and $mathbf{k}$ denoting vectors of dimension, $d_k$, containing the queries and keys, respectively.<\/li>\n<li>$mathbf{v}$ denoting a vector of dimension, $d_v$, containing the values.<\/li>\n<li>$mathbf{Q}$, $mathbf{K}$ and $mathbf{V}$ denoting matrices packing together sets of queries, keys and values, respectively.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>$mathbf{W}^Q$, $mathbf{W}^K$ and $mathbf{W}^V$ denoting projection matrices that are used in generating different subspace representations of the query, key and value matrices.<\/li>\n<li>$mathbf{W}^O$ denoting a projection matrix for the multi-head output.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ul>\n<p>In essence, the attention function can be considered as a mapping between a query and a set of key-value pairs, to an output.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/p>\n<\/blockquote>\n<p>Vaswani et al. propose a <i>scaled dot-product attention<\/i>, and then build on it to propose <i>multi-head attention<\/i>. Within the context of neural machine translation, the query, keys and values that are used as inputs to the these attention mechanisms, are different projections of the same input sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Intuitively, therefore, the proposed attention mechanisms implement self-attention by capturing the relationships between the different elements (in this case, the words) of the same sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Scaled Dot-Product Attention<\/b><\/h2>\n<p>The Transformer implements a scaled dot-product attention, which follows the procedure of the <a href=\"https:\/\/machinelearningmastery.com\/the-attention-mechanism-from-scratch\/\">general attention mechanism<\/a> that we had previously seen.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>As the name suggests, the scaled dot-product attention first computes a <i>dot product<\/i> for each query, $mathbf{q}$, with all of the keys, $mathbf{k}$. It, subsequently, divides each result by $sqrt{d_k}$ and proceeds to apply a softmax function. In doing so, it obtains the weights that are used to <i>scale<\/i> the values, $mathbf{v}$.<\/p>\n<div id=\"attachment_12893\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_3.png\"><img aria-describedby=\"caption-attachment-12893\" loading=\"lazy\" class=\"wp-image-12893\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_3-609x1024.png\" alt=\"\" width=\"177\" height=\"298\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12893\" loading=\"lazy\" class=\"wp-image-12893\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_3-609x1024.png\" alt=\"\" width=\"177\" height=\"298\"><\/a><\/p>\n<p id=\"caption-attachment-12893\" class=\"wp-caption-text\">Scaled Dot-Product Attention <br \/>Taken from \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>\u201c<\/p>\n<\/div>\n<p>In practice, the computations performed by the scaled dot-product attention can be efficiently applied on the entire set of queries simultaneously. In order to do so, the matrices, $mathbf{Q}$, $mathbf{K}$ and $mathbf{V}$, are supplied as inputs to the attention function:<\/p>\n<p>$$text{attention}(mathbf{Q}, mathbf{K}, mathbf{V}) = text{softmax} left( frac{QK^T}{sqrt{d_k}} right) V$$<\/p>\n<p>Vaswani et al. explain that their scaled dot-product attention is identical to the multiplicative attention of <a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Luong et al. (2015)<\/a>, except for the added scaling factor of $tfrac{1}{sqrt{d_k}}$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>This scaling factor was introduced to counteract the effect of having the dot products grow large in magnitude for large values of $d_k$, where the application of the softmax function would then return extremely small gradients that would lead to the infamous vanishing gradients problem. The scaling factor, therefore, serves to pull the results generated by the dot product multiplication down, hence preventing this problem.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Vaswani et al. further explain that their choice of opting for multiplicative attention instead of the additive attention of <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a>, was based on the computational efficiency associated with the former.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>\u2026 dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/p>\n<\/blockquote>\n<p>The step-by-step procedure for computing the scaled-dot product attention is, therefore, the following:<\/p>\n<ol>\n<li>Compute the alignment scores by multiplying the set of queries packed in matrix, $mathbf{Q}$,with the keys in matrix, $mathbf{K}$. If matrix, $mathbf{Q}$, is of size $m times d_k$ and matrix, $mathbf{K}$, is of size, $n times d_k$, then the resulting matrix will be of size $m times n$:<\/li>\n<\/ol>\n<p>$$<br \/>mathbf{QK}^T =<br \/>begin{bmatrix}<br \/>e_{11} &amp; e_{12} &amp; dots &amp; e_{1n} \\<br \/>e_{21} &amp; e_{22} &amp; dots &amp; e_{2n} \\<br \/>vdots &amp; vdots &amp; ddots &amp; vdots \\<br \/>e_{m1} &amp; e_{m2} &amp; dots &amp; e_{mn} \\<br \/>end{bmatrix}<br \/>$$<\/p>\n<ol start=\"2\">\n<li>Scale each of the alignment scores by $tfrac{1}{sqrt{d_k}}$:<\/li>\n<\/ol>\n<p>$$<br \/>frac{mathbf{QK}^T}{sqrt{d_k}} =<br \/>begin{bmatrix}<br \/>tfrac{e_{11}}{sqrt{d_k}} &amp; tfrac{e_{12}}{sqrt{d_k}} &amp; dots &amp; tfrac{e_{1n}}{sqrt{d_k}} \\<br \/>tfrac{e_{21}}{sqrt{d_k}} &amp; tfrac{e_{22}}{sqrt{d_k}} &amp; dots &amp; tfrac{e_{2n}}{sqrt{d_k}} \\<br \/>vdots &amp; vdots &amp; ddots &amp; vdots \\<br \/>tfrac{e_{m1}}{sqrt{d_k}} &amp; tfrac{e_{m2}}{sqrt{d_k}} &amp; dots &amp; tfrac{e_{mn}}{sqrt{d_k}} \\<br \/>end{bmatrix}<br \/>$$<\/p>\n<ol start=\"3\">\n<li>And follow the scaling process by applying a softmax operation in order to obtain a set of weights:<\/li>\n<\/ol>\n<p>$$<br \/>text{softmax} left( frac{mathbf{QK}^T}{sqrt{d_k}} right) =<br \/>begin{bmatrix}<br \/>text{softmax} left( tfrac{e_{11}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{12}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{1n}}{sqrt{d_k}} right) \\<br \/>text{softmax} left( tfrac{e_{21}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{22}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{2n}}{sqrt{d_k}} right) \\<br \/>vdots &amp; vdots &amp; ddots &amp; vdots \\<br \/>text{softmax} left( tfrac{e_{m1}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{m2}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{mn}}{sqrt{d_k}} right) \\<br \/>end{bmatrix}<br \/>$$<\/p>\n<ol start=\"4\">\n<li>Finally, apply the resulting weights to the values in matrix, $mathbf{V}$, of size, $n times d_v$:<\/li>\n<\/ol>\n<p>$$<br \/>begin{aligned}<br \/>&amp; text{softmax} left( frac{mathbf{QK}^T}{sqrt{d_k}} right) cdot mathbf{V} \\<br \/>=&amp;<br \/>begin{bmatrix}<br \/>text{softmax} left( tfrac{e_{11}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{12}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{1n}}{sqrt{d_k}} right) \\<br \/>text{softmax} left( tfrac{e_{21}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{22}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{2n}}{sqrt{d_k}} right) \\<br \/>vdots &amp; vdots &amp; ddots &amp; vdots \\<br \/>text{softmax} left( tfrac{e_{m1}}{sqrt{d_k}} right) &amp; text{softmax} left( tfrac{e_{m2}}{sqrt{d_k}} right) &amp; dots &amp; text{softmax} left( tfrac{e_{mn}}{sqrt{d_k}} right) \\<br \/>end{bmatrix}<br \/>cdot<br \/>begin{bmatrix}<br \/>v_{11} &amp; v_{12} &amp; dots &amp; v_{1d_v} \\<br \/>v_{21} &amp; v_{22} &amp; dots &amp; v_{2d_v} \\<br \/>vdots &amp; vdots &amp; ddots &amp; vdots \\<br \/>v_{n1} &amp; v_{n2} &amp; dots &amp; v_{nd_v} \\<br \/>end{bmatrix}<br \/>end{aligned}<br \/>$$<\/p>\n<h2><b>Multi-Head Attention<\/b><\/h2>\n<p>Building on their single attention function that takes matrices, $mathbf{Q}$, $mathbf{K}$, and $mathbf{V}$, as input, as we have just reviewed, Vaswani et al. also propose a multi-head attention mechanism.<\/p>\n<p>Their multi-head attention mechanism linearly projects the queries, keys and values $h$ times, each time using a different learned projection. The single attention mechanism is then applied to each of these $h$ projections in parallel, to produce $h$ outputs, which in turn are concatenated and projected again to produce a final result.<\/p>\n<div id=\"attachment_12894\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_4.png\"><img aria-describedby=\"caption-attachment-12894\" loading=\"lazy\" class=\"wp-image-12894\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_4-823x1024.png\" alt=\"\" width=\"254\" height=\"316\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12894\" loading=\"lazy\" class=\"wp-image-12894\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/tour_4-823x1024.png\" alt=\"\" width=\"254\" height=\"316\"><\/a><\/p>\n<p id=\"caption-attachment-12894\" class=\"wp-caption-text\">Multi-Head Attention <br \/>Taken from \u201c<a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>\u201c<\/p>\n<\/div>\n<p>The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would, otherwise, not be possible with a single attention head.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The multi-head attention function can be represented as follows:<\/p>\n<p>$$text{multihead}(mathbf{Q}, mathbf{K}, mathbf{V}) = text{concat}(text{head}_1, dots, text{head}_h) mathbf{W}^O$$<\/p>\n<p>Here, each $text{head}_i$, $i = 1, dots, h$, implements a single attention function characterized by its own learned projection matrices:<\/p>\n<p>$$text{head}_i = text{attention}(mathbf{QW}^Q_i, mathbf{KW}^K_i, mathbf{VW}^V_i)$$<\/p>\n<p>The step-by-step procedure for computing multi-head attention is, therefore, the following:<\/p>\n<ol>\n<li>Compute the linearly projected versions of the queries, keys and values through a multiplication with the respective weight matrices, $mathbf{W}^Q_i$, $mathbf{W}^K_i$ and $mathbf{W}^V_i$, one for each $text{head}_i$.<\/li>\n<\/ol>\n<ol start=\"2\">\n<li>Apply the single attention function for each head by (1) multiplying the queries and keys matrices, (2) applying the scaling and softmax operations, and (3) weighting the values matrix, to generate an output for each head.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"3\">\n<li>Concatenate the outputs of the heads, $text{head}_i$, $i = 1, dots, h$.<\/li>\n<\/ol>\n<ol start=\"4\">\n<li>Apply a linear projection to the concatenated output through a multiplication with the weight matrix, $mathbf{W}^O$, to generate the final result.<\/li>\n<\/ol>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<h3><b>Papers<\/b><\/h3>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered the Transformer attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How the Transformer attention differed from its predecessors.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Transformer computes a scaled-dot product attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Transformer computes multi-head attention.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/the-transformer-attention-mechanism\/<\/p>\n","protected":false},"author":0,"featured_media":1160,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1159"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1159"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1159\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1160"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}