{"id":925,"date":"2021-09-22T19:18:17","date_gmt":"2021-09-22T19:18:17","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/22\/the-attention-mechanism-from-scratch\/"},"modified":"2021-09-22T19:18:17","modified_gmt":"2021-09-22T19:18:17","slug":"the-attention-mechanism-from-scratch","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/22\/the-attention-mechanism-from-scratch\/","title":{"rendered":"The Attention Mechanism from Scratch"},"content":{"rendered":"<div id=\"\">\n<p id=\"last-modified-info\">Last Updated on September 20, 2021<\/p>\n<p>The attention mechanism was introduced to improve the performance of the encoder-decoder model for machine translation. The idea behind the attention mechanism was to permit the decoder to utilize the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all of the encoded input vectors, with the most relevant vectors being attributed the highest weights.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover the attention mechanism and its implementation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>How the attention mechanism uses a weighted sum of all of the encoder hidden states to flexibly focus the attention of the decoder to the most relevant parts of the input sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the attention mechanism can be generalized for tasks where the information may not necessarily be related in a sequential fashion.<\/li>\n<li>How to implement the general attention mechanism in Python with NumPy and SciPy.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12857\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/attention_mechanism_cover-scaled.jpg\"><img aria-describedby=\"caption-attachment-12857\" loading=\"lazy\" class=\"wp-image-12857 size-large\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/attention_mechanism_cover-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12857\" loading=\"lazy\" class=\"wp-image-12857 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/09\/attention_mechanism_cover-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\"><\/a><\/p>\n<p id=\"caption-attachment-12857\" class=\"wp-caption-text\">The Attention Mechanism from Scratch<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/RbbdzZBKRDY\">Nitish Meena<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ul>\n<li>The Attention Mechanism<\/li>\n<li>The General Attention Mechanism<\/li>\n<li>The General Attention Mechanism with NumPy and SciPy<\/li>\n<\/ul>\n<h2><b>The Attention Mechanism<\/b><\/h2>\n<p>The attention mechanism was introduced by <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a>, to address the bottleneck problem that arises with the use of a fixed-length encoding vector, where the decoder would have limited access to the information provided by the input. This is thought to become especially problematic for long and\/or complex sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\">We had seen<\/a> that Bahdanau et al.\u2019s <i>attention mechanism<\/i> is divided into the step-by-step computations of the <i>alignment scores<\/i>, the <i>weights<\/i> and the <i>context vector<\/i>:<\/p>\n<ol>\n<li><b>Alignment scores<\/b>: The alignment model takes the encoded hidden states, $mathbf{h}_i$, and the previous decoder output, $mathbf{s}_{t-1}$, to compute a score, $e_{t,i}$, that indicates how well the elements of the input sequence align with the current output at position, $t$. The alignment model is represented by a function, $a(.)$, which can be implemented by a feedforward neural network:<\/li>\n<\/ol>\n<p>$$e_{t,i} = a(mathbf{s}_{t-1}, mathbf{h}_i)$$<\/p>\n<ol start=\"2\">\n<li><b>Weights<\/b>: The weights, $alpha_{t,i}$, are computed by applying a softmax operation to the previously computed alignment scores:<\/li>\n<\/ol>\n<p>$$alpha_{t,i} = text{softmax}(e_{t,i})$$<\/p>\n<ol start=\"3\">\n<li><b>Context vector<\/b>: A unique context vector, $mathbf{c}_t$, is fed into the decoder at each time step. It is computed by a weighted sum of all, $T$, encoder hidden states:<\/li>\n<\/ol>\n<p>$$mathbf{c}_t = sum_{i=1}^T alpha_{t,i} mathbf{h}_i$$<\/p>\n<p>Bahdanau et al. had implemented an RNN for both the encoder and decoder.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>However, the attention mechanism can be re-formulated into a general form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) task, where the information may not necessarily be related in a sequential fashion.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>In other words, the database doesn\u2019t have to consist of the hidden RNN states at different steps, but could contain any kind of information instead.<\/i><\/p>\n<p>\u2013 <a href=\"https:\/\/www.amazon.com\/Advanced-Deep-Learning-Python-next-generation\/dp\/178995617X\">Advanced Deep Learning with Python<\/a>, 2019.<\/p>\n<\/blockquote>\n<h2><b>The General Attention Mechanism<\/b><\/h2>\n<p>The general attention mechanism makes use of three main components, namely the <i>queries<\/i>, $mathbf{Q}$, the <i>keys<\/i>, $mathbf{K}$, and the <i>values<\/i>, $mathbf{V}$.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>If we had to compare these three components to the attention mechanism as proposed by Bahdanau et al., then the query would be analogous to the previous decoder output, $mathbf{s}_{t-1}$, while the values would be analogous to the encoded inputs, $mathbf{h}_i$. In the Bahdanau attention mechanism, the keys and values are the same vector.<\/p>\n<blockquote>\n<p><i>In this case,\u00a0we can think of the vector\u00a0$mathbf{s}_{t-1}$\u00a0as a\u00a0query\u00a0executed against a database of key-value pairs, where\u00a0the keys are vectors and\u00a0the hidden states\u00a0$mathbf{h}_i$\u00a0are the values.<\/i><\/p>\n<p>\u2013 <a href=\"https:\/\/www.amazon.com\/Advanced-Deep-Learning-Python-next-generation\/dp\/178995617X\">Advanced Deep Learning with Python<\/a>, 2019.<\/p>\n<\/blockquote>\n<p>The general attention mechanism then performs the following computations:<\/p>\n<ol>\n<li>Each query vector, $mathbf{q} = mathbf{s}_{t-1}$, is matched against a database of keys to compute a score value. This matching operation is computed as the dot product of the specific query under consideration with each key vector, $mathbf{k}_i$:<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<\/ol>\n<p>$$e_{mathbf{q},mathbf{k}_i} = mathbf{q} cdot mathbf{k}_i$$<\/p>\n<ol start=\"2\">\n<li>The scores are passed through a softmax operation to generate the weights:<\/li>\n<\/ol>\n<p>$$alpha_{mathbf{q},mathbf{k}_i} = text{softmax}(e_{mathbf{q},mathbf{k}_i})$$<\/p>\n<ol start=\"3\">\n<li>The generalized attention is then computed by a weighted sum of the value vectors, $mathbf{v}_{mathbf{k}_i}$, where each value vector is paired with a corresponding key:<\/li>\n<\/ol>\n<p>$$text{attention}(mathbf{q}, mathbf{K}, mathbf{V}) = sum_i alpha_{mathbf{q},mathbf{k}_i} mathbf{v}_{mathbf{k}_i}$$<\/p>\n<p>Within the context of machine translation, each word in an input sentence would be attributed its own query, key and value vectors. These vectors are generated by multiplying the encoder\u2019s representation of the specific word under consideration, with three different weight matrices that would have been generated during training.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In essence, when the generalized attention mechanism is presented with a sequence of words, it takes the query vector attributed to some specific word in the sequence and scores it against each key in the database. In doing so, it captures how the word under consideration relates to the others in the sequence. Then it scales the values according to the attention weights (computed from the scores), in order to retain focus on those words that are relevant to the query. In doing so, it produces an attention output for the word under consideration.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>The General Attention Mechanism with NumPy and SciPy<\/b><\/h2>\n<p>In this section, we will explore how to implement the general attention mechanism using the NumPy and SciPy libraries in Python.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>For simplicity, we will initially calculate the attention for the first word in a sequence of four. We will then generalize the code to calculate an attention output for all four words in matrix form.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Hence, let\u2019s start by first defining the word embeddings of the four different words for which we will be calculating the attention. In actual practice, these word embeddings would have been generated by an encoder, however for this particular example we shall be defining them manually.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e78367461896\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# encoder representations of four different words<br \/>\nword_1 = array([1, 0, 0])<br \/>\nword_2 = array([0, 1, 0])<br \/>\nword_3 = array([1, 1, 0])<br \/>\nword_4 = array([0, 0, 1])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># encoder representations of four different words<\/span><\/p>\n<p><span class=\"crayon-v\">word_1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_3<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_4<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>The next step generates the weight matrices, which we will eventually be multiplying to the word embeddings to generate the queries, keys and values. Here, we shall be generating these weight matrices randomly, however in actual practice these would have been learned during training.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e7e437954221\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# generating the weight matrices<br \/>\nrandom.seed(42) # to allow us to reproduce the same attention values<br \/>\nW_Q = random.randint(3, size=(3, 3))<br \/>\nW_K = random.randint(3, size=(3, 3))<br \/>\nW_V = random.randint(3, size=(3, 3))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># generating the weight matrices<\/span><\/p>\n<p><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">seed<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># to allow us to reproduce the same attention values<\/span><\/p>\n<p><span class=\"crayon-v\">W_Q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">W_K<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">W_V<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is three) to allow us to perform the matrix multiplication.<\/p>\n<p>Subsequently, the query, key and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e80530085590\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# generating the queries, keys and values<br \/>\nquery_1 = word_1 @ W_Q<br \/>\nkey_1 = word_1 @ W_K<br \/>\nvalue_1 = word_1 @ W_V<\/p>\n<p>query_2 = word_2 @ W_Q<br \/>\nkey_2 = word_2 @ W_K<br \/>\nvalue_2 = word_2 @ W_V<\/p>\n<p>query_3 = word_3 @ W_Q<br \/>\nkey_3 = word_3 @ W_K<br \/>\nvalue_3 = word_3 @ W_V<\/p>\n<p>query_4 = word_4 @ W_Q<br \/>\nkey_4 = word_4 @ W_K<br \/>\nvalue_4 = word_4 @ W_V<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># generating the queries, keys and values<\/span><\/p>\n<p><span class=\"crayon-v\">query_1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>1<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_Q<\/span><\/p>\n<p><span class=\"crayon-v\">key_1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>1<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_K<\/span><\/p>\n<p><span class=\"crayon-v\">value_1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>1<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_V<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">query_2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>2<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_Q<\/span><\/p>\n<p><span class=\"crayon-v\">key_2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>2<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_K<\/span><\/p>\n<p><span class=\"crayon-v\">value_2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>2<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_V<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">query_3<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>3<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_Q<\/span><\/p>\n<p><span class=\"crayon-v\">key_3<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>3<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_K<\/span><\/p>\n<p><span class=\"crayon-v\">value_3<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>3<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_V<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">query_4<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>4<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_Q<\/span><\/p>\n<p><span class=\"crayon-v\">key_4<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>4<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">W_K<\/span><\/p>\n<p><span class=\"crayon-v\">value_4<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word<\/span><span class=\"crayon-sy\">_<\/span>4<span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">W_V<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Considering only the first word for the time being, the next step scores its query vector against all of the key vectors using a dot product operation.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e81733654911\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# scoring the first query vector against all key vectors<br \/>\nscores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(query_1, key_4)])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># scoring the first query vector against all key vectors<\/span><\/p>\n<p><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">dot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">query_1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">query_1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">query_1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">query_1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_4<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three), to keep the gradients stable.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e82303552611\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# computing the weights by a softmax operation<br \/>\nweights = softmax(scores \/ key_1.shape[0] ** 0.5)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># computing the weights by a softmax operation<\/span><\/p>\n<p><span class=\"crayon-v\">weights<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">softmax<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">key_1<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.5<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Finally, the attention output is calculated by a weighted sum of all four value vectors.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e83250089400\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# computing the attention by a weighted sum of the value vectors<br \/>\nattention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)<\/p>\n<p>print(attention)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># computing the attention by a weighted sum of the value vectors<\/span><\/p>\n<p><span class=\"crayon-v\">attention<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">weights<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">value_1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">weights<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">value_2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">weights<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">value_3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">weights<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">value_4<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">attention<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-614b25b892e84613921201\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n[0.98522025 1.74174051 0.75652026]<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>[0.98522025 1.74174051 0.75652026]<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:<\/p>\n<div id=\"urvanov-syntax-highlighter-614b25b892e85983730654\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\nfrom numpy import array<br \/>\nfrom numpy import random<br \/>\nfrom numpy import dot<br \/>\nfrom scipy.special import softmax<\/p>\n<p># encoder representations of four different words<br \/>\nword_1 = array([1, 0, 0])<br \/>\nword_2 = array([0, 1, 0])<br \/>\nword_3 = array([1, 1, 0])<br \/>\nword_4 = array([0, 0, 1])<\/p>\n<p># stacking the word embeddings into a single array<br \/>\nwords = array([word_1, word_2, word_3, word_4])<\/p>\n<p># generating the weight matrices<br \/>\nrandom.seed(42)<br \/>\nW_Q = random.randint(3, size=(3, 3))<br \/>\nW_K = random.randint(3, size=(3, 3))<br \/>\nW_V = random.randint(3, size=(3, 3))<\/p>\n<p># generating the queries, keys and values<br \/>\nQ = words @ W_Q<br \/>\nK = words @ W_K<br \/>\nV = words @ W_V<\/p>\n<p># scoring the query vectors against all key vectors<br \/>\nscores = Q @ K.transpose()<\/p>\n<p># computing the weights by a softmax operation<br \/>\nweights = softmax(scores \/ K.shape[1] ** 0.5, axis=1)<\/p>\n<p># computing the attention by a weighted sum of the value vectors<br \/>\nattention = weights @ V<\/p>\n<p>print(attention)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-t\">array<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">random<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">dot<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">scipy<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">special <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">softmax<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># encoder representations of four different words<\/span><\/p>\n<p><span class=\"crayon-v\">word_1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_3<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">word_4<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># stacking the word embeddings into a single array<\/span><\/p>\n<p><span class=\"crayon-v\">words<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">word_1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word_2<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word_3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">word_4<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># generating the weight matrices<\/span><\/p>\n<p><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">seed<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">W_Q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">W_K<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">W_V<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">randint<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># generating the queries, keys and values<\/span><\/p>\n<p><span class=\"crayon-v\">Q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">words<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">W<\/span><span class=\"crayon-sy\">_<\/span>Q<\/p>\n<p><span class=\"crayon-v\">K<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">words<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">W<\/span><span class=\"crayon-sy\">_<\/span>K<\/p>\n<p><span class=\"crayon-v\">V<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">words<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">W<\/span><span class=\"crayon-sy\">_<\/span>V<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># scoring the query vectors against all key vectors<\/span><\/p>\n<p><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">Q<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">transpose<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># computing the weights by a softmax operation<\/span><\/p>\n<p><span class=\"crayon-v\">weights<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">softmax<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">K<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">axis<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># computing the attention by a weighted sum of the value vectors<\/span><\/p>\n<p><span class=\"crayon-v\">attention<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">weights<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">V<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">attention<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-614b25b892e86234566255\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-mac print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n[[0.98522025 1.74174051 0.75652026]<br \/>\n [0.90965265 1.40965265 0.5       ]<br \/>\n [0.99851226 1.75849334 0.75998108]<br \/>\n [0.99560386 1.90407309 0.90846923]]<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>[[0.98522025 1.74174051 0.75652026]<\/p>\n<p> [0.90965265 1.40965265 0.5\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ]<\/p>\n<p> [0.99851226 1.75849334 0.75998108]<\/p>\n<p> [0.99560386 1.90407309 0.90846923]]<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<h3><b>Papers<\/b><\/h3>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered the attention mechanism and its implementation.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How the attention mechanism uses a weighted sum of all of the encoder hidden states to flexibly focus the attention of the decoder to the most relevant parts of the input sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How the attention mechanism can be generalized for tasks where the information may not necessarily be related in a sequential fashion.<\/li>\n<li>How to implement the general attention mechanism with NumPy and SciPy.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/the-attention-mechanism-from-scratch\/<\/p>\n","protected":false},"author":0,"featured_media":926,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/925"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=925"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/925\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/926"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=925"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=925"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=925"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}