{"id":1059,"date":"2021-10-20T08:41:53","date_gmt":"2021-10-20T08:41:53","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/20\/the-bahdanau-attention-mechanism\/"},"modified":"2021-10-20T08:41:53","modified_gmt":"2021-10-20T08:41:53","slug":"the-bahdanau-attention-mechanism","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/20\/the-bahdanau-attention-mechanism\/","title":{"rendered":"The Bahdanau Attention Mechanism"},"content":{"rendered":"<div id=\"\">\n<p id=\"last-modified-info\">Last Updated on October 17, 2021<\/p>\n<p>Conventional encoder-decoder architectures for machine translation encoded every source sentence into a fixed-length vector, irrespective of its length, from which the decoder would then generate a translation. This made it difficult for the neural network to cope with long sentences, essentially resulting in a performance bottleneck.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The Bahdanau attention was proposed to address the performance bottleneck of conventional encoder-decoder architectures, achieving significant improvements over the conventional approach.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover the Bahdanau attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Where the Bahdanau attention derives its name from, and the challenge it addresses.<\/li>\n<li>The role of the different components that form part of the Bahdanau encoder-decoder architecture.<\/li>\n<li>The operations performed by the Bahdanau attention algorithm.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12942\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_cover-scaled.jpg\"><img aria-describedby=\"caption-attachment-12942\" loading=\"lazy\" class=\"wp-image-12942 size-large\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_cover-1024x681.jpg\" alt=\"\" width=\"1024\" height=\"681\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12942\" loading=\"lazy\" class=\"wp-image-12942 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_cover-1024x681.jpg\" alt=\"\" width=\"1024\" height=\"681\"><\/a><\/p>\n<p id=\"caption-attachment-12942\" class=\"wp-caption-text\">The Bahdanau Attention Mechanism<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/KMn4VEeEPR8\">Sean Oulashin<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into two parts; they are:<\/p>\n<ul>\n<li>Introduction to the Bahdanau Attention<\/li>\n<li>The Bahdanau Architecture\n<ul>\n<li>The Encoder<\/li>\n<li>The Decoder<\/li>\n<li>The Bahdanau Attention Algorithm<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2><b>Prerequisites<\/b><\/h2>\n<p>For this tutorial, we assume that you are already familiar with:<\/p>\n<h2><b>Introduction to the Bahdanau Attention<\/b><\/h2>\n<p>The Bahdanau attention mechanism has inherited its name from the first author of the paper in which it was published.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>It follows the work of <a href=\"https:\/\/arxiv.org\/abs\/1406.1078\">Cho et al. (2014)<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/1409.3215\">Sutskever et al. (2014)<\/a>, who had also employed an RNN encoder-decoder framework for neural machine translation, specifically by encoding a variable-length source sentence into a fixed-length vector. The latter would then be decoded into a variable-length target sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a> argue that this encoding of a variable-length input into a fixed-length vector <i>squashes<\/i> the information of the source sentence, irrespective of its length, causing the performance of a basic encoder-decoder model to deteriorate rapidly with an increasing length of the input sentence. The approach they propose, on the other hand, replaces the fixed-length vector with a variable-length one, to improve the translation performance of the basic encoder-decoder model.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>The most important distinguishing feature of this approach from the basic encoder\u2013decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>, 2014.<\/p>\n<\/blockquote>\n<h2><b>The Bahdanau Architecture<\/b><\/h2>\n<p>The main components in use by the Bahdanau encoder-decoder architecture are the following:<\/p>\n<ul>\n<li>$mathbf{s}_{t-1}$ is the <i>hidden decoder state<\/i> at the previous time step, $t-1$.<\/li>\n<li>$mathbf{c}_t$ is the <i>context vector<\/i> at time step, $t$. It is uniquely generated at each decoder step to generate a target word, $y_t$.<\/li>\n<li>$mathbf{h}_i$ is an <i>annotation<\/i> that captures the information contained in the words forming the entire input sentence, ${ x_1, x_2, dots, x_T }$, with strong focus around the $i$-th word out of $T$ total words.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>$alpha_{t,i}$ is a <i>weight<\/i> value assigned to each annotation, $mathbf{h}_i$, at the current time step, $t$.<\/li>\n<li>$e_{t,i}$ is an <i>attention score<\/i> generated by an alignment model, $a(.)$, that scores how well $mathbf{s}_{t-1}$ and $mathbf{h}_i$ match.<\/li>\n<\/ul>\n<p>These components find their use at different stages of the Bahdanau architecture, which employs a bidirectional RNN as an encoder and an RNN decoder, with an attention mechanism in between:<\/p>\n<div id=\"attachment_12941\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_1.png\"><img aria-describedby=\"caption-attachment-12941\" loading=\"lazy\" class=\"wp-image-12941\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_1-780x1024.png\" alt=\"\" width=\"324\" height=\"426\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12941\" loading=\"lazy\" class=\"wp-image-12941\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/bahdanau_1-780x1024.png\" alt=\"\" width=\"324\" height=\"426\"><\/a><\/p>\n<p id=\"caption-attachment-12941\" class=\"wp-caption-text\">The Bahdanau Architecture <br \/>Taken from \u201c<a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>\u201c<\/p>\n<\/div>\n<h3><b>The Encoder<\/b><\/h3>\n<p>The role of the encoder is generate an annotation, $mathbf{h}_i$, for every word, $x_i$, in an input sentence of length $T$ words.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>For this purpose, Bahdanau et al. employ a bidirectional RNN, which reads the input sentence in the forward direction to produce a forward hidden state, $overrightarrow{mathbf{h}_i}$, and then reads the input sentence in the reverse direction to produce a backward hidden state, $overleftarrow{mathbf{h}_i}$. The annotation for some particular word, $x_i$, concatenates the two states:<\/p>\n<p>$$mathbf{h}_i = left[ overrightarrow{mathbf{h}_i^T} ; ; ; overleftarrow{mathbf{h}_i^T} right]^T$$<\/p>\n<p>The idea behind generating each annotation in this manner was to capture a summary of both preceding and succeeding words.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>In this way, the annotation $mathbf{h}_i$ contains the summaries of both the preceding words and the following words.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>, 2014.<\/p>\n<\/blockquote>\n<p>The generated annotations are then passed to the decoder to generate the context vector.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h3><b>The Decoder<\/b><\/h3>\n<p>The role of the decoder is to produce the target words by focusing on the most relevant information contained in the source sentence. For this purpose, it makes use of an attention mechanism.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>, 2014.<\/p>\n<\/blockquote>\n<p>The decoder takes each annotation and feeds it to an alignment model, $a(.)$, together with the previous hidden decoder state, $mathbf{s}_{t-1}$. This generates an attention score:<\/p>\n<p>$$e_{t,i} = a(mathbf{s}_{t-1}, mathbf{h}_i)$$<\/p>\n<p>The function implemented by the alignment model, here, combines $mathbf{s}_{t-1}$ and $mathbf{h}_i$ by means of an addition operation. For this reason, the attention mechanism implemented by Bahdanau et al. is referred to as <i>additive attention<\/i>.<\/p>\n<p>This can be implemented in two ways, either (1) by applying a weight matrix, $mathbf{W}$, over the concatenated vectors, $mathbf{s}_{t-1}$ and $mathbf{h}_i$, or (2) by applying the weight matrices, $mathbf{W}_1$ and $mathbf{W}_2$, to $mathbf{s}_{t-1}$ and $mathbf{h}_i$ separately:<\/p>\n<ol>\n<li>$$a(mathbf{s}_{t-1}, mathbf{h}_i) = mathbf{v}^T tanh(mathbf{W}[mathbf{h}_i ; ; ; mathbf{s}_{t-1}])$$<\/li>\n<li>$$a(mathbf{s}_{t-1}, mathbf{h}_i) = mathbf{v}^T tanh(mathbf{W}_1 mathbf{h}_i + mathbf{W}_2 mathbf{s}_{t-1})$$<\/li>\n<\/ol>\n<p>Here, $mathbf{v}$, is a weight vector.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The alignment model is parametrized as a feedforward neural network, and jointly trained with the remaining system components.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Subsequently, a softmax function is applied to each attention score to obtain the corresponding weight value:<\/p>\n<p>$$alpha_{t,i} = text{softmax}(e_{t,i})$$<\/p>\n<p>The application of the softmax function essentially normalizes the annotation values to a range between 0 and 1 and, hence, the resulting weights can be considered as probability values. Each probability (or weight) value reflects how important $mathbf{h}_i$ and $mathbf{s}_{t-1}$ are in generating the next state, $mathbf{s}_t$, and the next output, $y_t$.<\/p>\n<blockquote>\n<p><i>Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>, 2014.<\/p>\n<\/blockquote>\n<p>This is finally followed by the computation of the context vector as a weighted sum of the annotations:<\/p>\n<p>$$mathbf{c}_t = sum^T_{i=1} alpha_{t,i} mathbf{h}_i$$<\/p>\n<h3><b>The Bahdanau Attention Algorithm<\/b><\/h3>\n<p>In summary, the attention algorithm proposed by Bahdanau et al. performs the following operations:<\/p>\n<ol>\n<li>The encoder generates a set of annotations, $mathbf{h}_i$, from the input sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>These annotations are fed to an alignment model together with the previous hidden decoder state. The alignment model uses this information to generate the attention scores, $e_{t,i}$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>A softmax function is applied to the attention scores, effectively normalizing them into weight values, $alpha_{t,i}$, in a range between 0 and 1. <span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>These weights together with the previously computed annotations are used to generate a context vector, $mathbf{c}_t$, through a weighted sum of the annotations.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>The context vector is fed to the decoder together with the previous hidden decoder state and the previous output, to compute the final output, $y_t$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>Steps 2-6 are repeated until the end of the sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<p>Bahdanau et al. had tested their architecture on the task of English-to-French translation, and had reported that their model outperformed the conventional encoder-decoder model significantly, irrespective of the sentence length.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>There had been several improvements over the Bahdanau attention that had been proposed<span class=\"Apple-converted-space\">\u00a0 <\/span>thereafter, such as those of <a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Luong et al. (2015)<\/a>, which we shall review in a separate tutorial.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<h3><b>Papers<\/b><\/h3>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered the Bahdanau attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Where the Bahdanau attention derives its name from, and the challenge it addresses.<\/li>\n<li>The role of the different components that form part of the Bahdanau encoder-decoder architecture.<\/li>\n<li>The operations performed by the Bahdanau attention algorithm.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/the-bahdanau-attention-mechanism\/<\/p>\n","protected":false},"author":0,"featured_media":1060,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1059"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1059"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1059\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1060"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}