{"id":1057,"date":"2021-10-20T08:41:52","date_gmt":"2021-10-20T08:41:52","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/20\/the-luong-attention-mechanism\/"},"modified":"2021-10-20T08:41:52","modified_gmt":"2021-10-20T08:41:52","slug":"the-luong-attention-mechanism","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/20\/the-luong-attention-mechanism\/","title":{"rendered":"The Luong Attention Mechanism"},"content":{"rendered":"<div id=\"\">\n<p>The Luong attention sought to introduce several improvements over the Bahdanau model for neural machine translation, particularly by introducing two new classes of attentional mechanisms: a <i>global<\/i> approach that attends to all source words, and a <i>local<\/i> approach that only attends to a selected subset of words in predicting the target sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover the Luong attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The operations performed by the Luong attention algorithm.<\/li>\n<li>How the global and local attentional models work.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Luong attention compares to the Bahdanau attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12962\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_cover-scaled.jpg\"><img aria-describedby=\"caption-attachment-12962\" loading=\"lazy\" class=\"wp-image-12962 size-large\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_cover-1024x575.jpg\" alt=\"\" width=\"1024\" height=\"575\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12962\" loading=\"lazy\" class=\"wp-image-12962 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_cover-1024x575.jpg\" alt=\"\" width=\"1024\" height=\"575\"><\/a><\/p>\n<p id=\"caption-attachment-12962\" class=\"wp-caption-text\">The Luong Attention Mechanism<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/BskqKfpR4pw\">Mike Nahlii<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ul>\n<li>Introduction to the Luong Attention<\/li>\n<li>The Luong Attention Algorithm<\/li>\n<li>The Global Attentional Model<\/li>\n<li>The Local Attentional Model<\/li>\n<li>Comparison to the Bahdanau Attention<\/li>\n<\/ul>\n<h2><b>Prerequisites<\/b><\/h2>\n<p>For this tutorial, we assume that you are already familiar with:<\/p>\n<h2><b>Introduction to the Luong Attention<\/b><\/h2>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Luong et al. (2015)<\/a> inspire themselves from previous attention models, to propose two attention mechanisms:<\/p>\n<blockquote>\n<p><i>In this work, we design, with simplicity and effectiveness in mind, two novel types of attention-based models: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Effective Approaches to Attention-based Neural Machine Translation<\/a>, 2015.<\/p>\n<\/blockquote>\n<p>The <i>global<\/i> attentional model resembles the model of <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a> in attending to <i>all<\/i> source words, but aims to simplify it architecturally.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The <i>local<\/i> attentional model is inspired from the hard and soft attention models of <a href=\"https:\/\/arxiv.org\/abs\/1502.03044\">Xu et al. (2016)<\/a>, and attends to <i>only a few<\/i> of the source positions.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The two attentional models share many of the steps in their prediction of the current word, but differ mainly in their computation of the context vector.<\/p>\n<p>Let\u2019s first take a look at the overarching Luong attention algorithm, and then delve into the differences between the global and local attentional models afterwards.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>The Luong Attention Algorithm<\/b><\/h2>\n<p>The attention algorithm of Luong et al. performs the following operations:<\/p>\n<ol>\n<li>The encoder generates a set of annotations, $H = mathbf{h}_i, i = 1, dots, T$, from the input sentence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"2\">\n<li>The current decoder hidden state is computed as: $mathbf{s}_t = text{RNN}_text{decoder}(mathbf{s}_{t-1}, y_{t-1})$. Here, $mathbf{s}_{t-1}$ denotes the previous hidden decoder state, and $y_{t-1}$ the previous decoder output.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"3\">\n<li>An alignment model, $a(.)$ uses the annotations and the current decoder hidden state to compute the alignment scores: $e_{t,i} = a(mathbf{s}_t, mathbf{h}_i)$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"4\">\n<li>A softmax function is applied to the alignment scores, effectively normalizing them into weight values in a range between 0 and 1: $alpha_{t,i} = text{softmax}(e_{t,i})$.<\/li>\n<\/ol>\n<ol start=\"5\">\n<li>These weights together with the previously computed annotations are used to generate a context vector through a weighted sum of the annotations: $mathbf{c}_t = sum^T_{i=1} alpha_{t,i} mathbf{h}_i$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"6\">\n<li>An attentional hidden state is computed based on a weighted concatenation of the context vector and the current decoder hidden state: $widetilde{mathbf{s}}_t = tanh(mathbf{W_c} [mathbf{c}_t ; ; ; mathbf{s}_t])$.<\/li>\n<\/ol>\n<ol start=\"7\">\n<li>The decoder produces a final output by feeding it a weighted attentional hidden state: $y_t = text{softmax}(mathbf{W}_y widetilde{mathbf{s}}_t)$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"8\">\n<li>Steps 2-7 are repeated until the end of the sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<h2><b>The Global Attentional Model<\/b><\/h2>\n<p>The global attentional model considers all of the source words in the input sentence when generating the alignment scores and, eventually, when computing the context vector.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector, $mathbf{c}_t$.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Effective Approaches to Attention-based Neural Machine Translation<\/a>, 2015.<\/p>\n<\/blockquote>\n<p>In order to do so, Luong et al. propose three alternative approaches for computing the alignment scores. The first approach is similar to Bahdanau\u2019s and is based upon the concatenation of $mathbf{s}_t$ and $mathbf{h}_i$, while the second and third approaches implement <i>multiplicative<\/i> attention (in contrast to Bahdanau\u2019s <i>additive<\/i> attention):<\/p>\n<ol>\n<li>$$a(mathbf{s}_t, mathbf{h}_i) = mathbf{v}_a^T tanh(mathbf{W}_a [mathbf{s}_t ; ; ; mathbf{s}_t)]$$<\/li>\n<\/ol>\n<ol start=\"2\">\n<li>$$a(mathbf{s}_t, mathbf{h}_i) = mathbf{s}^T_t mathbf{h}_i$$<\/li>\n<\/ol>\n<ol start=\"3\">\n<li>$$a(mathbf{s}_t, mathbf{h}_i) = mathbf{s}^T_t mathbf{W}_a mathbf{h}_i$$<\/li>\n<\/ol>\n<p>Here, $mathbf{W}_a$ is a trainable weight matrix and, similarly, $mathbf{v}_a$ is a weight vector.<\/p>\n<p>Intuitively, the use of the dot product in <i>multiplicative<\/i> attention can be interpreted as providing a similarity measure between the vectors, $mathbf{s}_t$ and $mathbf{h}_i$, under consideration.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>\u2026 if the vectors are similar (that is, aligned), the result of the multiplication will be a large value and the attention will be focused on the current\u00a0t,i\u00a0relationship.<\/i><\/p>\n<p>\u2013 <a href=\"https:\/\/www.amazon.com\/Advanced-Deep-Learning-Python-next-generation\/dp\/178995617X\">Advanced Deep Learning with Python<\/a>, 2019.<\/p>\n<\/blockquote>\n<p>The resulting alignment vector, $mathbf{e}_t$, is of variable-length according to the number of source words.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>The Local Attentional Model<\/b><\/h2>\n<p>In attending to all source words, the global attentional model is computationally expensive and could potentially become impractical for translating longer sentences.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The local attentional model seeks to address these limitations by focusing on a smaller subset of the source words to generate each target word. In order to do so, it takes inspiration from the <i>hard<\/i> and <i>soft<\/i> attention models of the image caption generation work of <a href=\"https:\/\/arxiv.org\/abs\/1502.03044\">Xu et al. (2016)<\/a>:<\/p>\n<ul>\n<li>S<i>oft<\/i> attention is equivalent to the global attention approach, where weights are softly placed over all the source image patches. Hence, soft attention considers the source image in its entirety.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li><i>Hard<\/i> attention attends to a single image patch at a time.<\/li>\n<\/ul>\n<p>The local attentional model of Luong et al. generates a context vector by computing a weighted average over the set of annotations, $mathbf{h}_i$, within a window centered over an aligned position, $p_t$:<\/p>\n<p>$$[p_t \u2013 D, p_t + D]$$<\/p>\n<p>While a value for $D$ is selected empirically, Luong et al. consider two approaches in computing a value for $p_t$:<\/p>\n<ol>\n<li><i>Monotonic<\/i> alignment: where the source and target sentences are assumed to be monotonically aligned and, hence, $p_t = t$.<\/li>\n<\/ol>\n<ol start=\"2\">\n<li><i>Predictive<\/i> alignment: where a prediction of the aligned position is based upon trainable model parameters, $mathbf{W}_p$ and $mathbf{v}_p$, and the source sentence length, $S$:<\/li>\n<\/ol>\n<p>$$p_t = S cdot text{sigmoid}(mathbf{v}^T_p tanh(mathbf{W}_p, mathbf{s}_t))$$<\/p>\n<p>To favour source words nearer to the window centre, a Gaussian distribution is centered around $p_t$ when computing the alignment weights.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>This time round, the resulting alignment vector, $mathbf{e}_t$, has a fixed length of $2D + 1$.<\/p>\n<h2><b>Comparison to the Bahdanau Attention<\/b><\/h2>\n<p>The Bahdanau model and the global attention approach of Luong et al. are mostly similar, but there are key differences between the two:<\/p>\n<blockquote>\n<p><i>While our global attention approach is similar in spirit to the model proposed by Bahdanau et al. (2015), there are several key differences which reflect how we have both simplified and generalized from the original model.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Effective Approaches to Attention-based Neural Machine Translation<\/a>, 2015.<\/p>\n<\/blockquote>\n<ol>\n<li>Most notably, the computation of the alignment scores, $e_t$, in the Luong global attentional model depends on the current decoder hidden state, $mathbf{s}_t$, rather than on the previous hidden state, $mathbf{s}_{t-1}$, as in the Bahdanau attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<div id=\"attachment_12963\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_1.png\"><img aria-describedby=\"caption-attachment-12963\" loading=\"lazy\" class=\"wp-image-12963\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_1-1024x426.png\" alt=\"\" width=\"541\" height=\"225\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12963\" loading=\"lazy\" class=\"wp-image-12963\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/luong_1-1024x426.png\" alt=\"\" width=\"541\" height=\"225\"><\/a><\/p>\n<p id=\"caption-attachment-12963\" class=\"wp-caption-text\">The Bahdanau Architecture (Left) vs. the Luong Architecture (Right) <br \/>Taken from \u201c<a href=\"https:\/\/www.amazon.com\/Advanced-Deep-Learning-Python-next-generation\/dp\/178995617X\">Advanced Deep Learning with Python<\/a>\u201c<\/p>\n<\/div>\n<ol start=\"2\">\n<li>Luong et al. drop the bidirectional encoder in use by the Bahdanau model, and instead utilize the hidden states at the top LSTM layers for both encoder and decoder.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<ol start=\"3\">\n<li>The global attentional model of Luong et al. investigates the use of multiplicative attention, as an alternative to the Bahdanau additive attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<h3><b>Papers<\/b><\/h3>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered the Luong attention mechanism for neural machine translation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The operations performed by the Luong attention algorithm.<\/li>\n<li>How the global and local attentional models work.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the Luong attention compares to the Bahdanau attention.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/the-luong-attention-mechanism\/<\/p>\n","protected":false},"author":0,"featured_media":1058,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1057"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1057"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1057\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1058"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1057"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1057"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1057"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}