{"id":881,"date":"2021-09-17T06:57:13","date_gmt":"2021-09-17T06:57:13","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/17\/a-birds-eye-view-of-research-on-attention\/"},"modified":"2021-09-17T06:57:13","modified_gmt":"2021-09-17T06:57:13","slug":"a-birds-eye-view-of-research-on-attention","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/17\/a-birds-eye-view-of-research-on-attention\/","title":{"rendered":"A Bird\u2019s Eye View of Research on Attention"},"content":{"rendered":"<div id=\"\">\n<p id=\"last-modified-info\">Last Updated on September 9, 2021<\/p>\n<p>Attention is a concept that is scientifically studied across multiple disciplines, including psychology, neuroscience and, more recently, machine learning. While all disciplines may have produced their own definitions for attention, there is one core quality they can all agree on: attention is a mechanism for making both biological and artificial neural systems more flexible.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>In this tutorial, you will discover an overview of the research advances on attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>The concept of attention that is of significance to different scientific disciplines.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How attention is revolutionizing machine learning, specifically in the domains of natural language processing and computer vision.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12826\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_cover-scaled.jpg\"><img aria-describedby=\"caption-attachment-12826\" loading=\"lazy\" class=\"wp-image-12826 size-large\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_cover-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12826\" loading=\"lazy\" class=\"wp-image-12826 size-large\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_cover-1024x683.jpg\" alt=\"\" width=\"1024\" height=\"683\"><\/a><\/p>\n<p id=\"caption-attachment-12826\" class=\"wp-caption-text\">A Bird\u2019s Eye View of Research on Attention<br \/>Photo by <a href=\"https:\/\/unsplash.com\/photos\/6tfO1M8_gas\">Chris Lawton<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2><b>Tutorial Overview<\/b><\/h2>\n<p>This tutorial is divided into two parts; they are:<\/p>\n<ul>\n<li>The Concept of Attention<\/li>\n<li>Attention in Machine Learning\n<ul>\n<li>Attention in Natural Language Processing<\/li>\n<li>Attention in Computer Vision<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2><b>The Concept of Attention<\/b><\/h2>\n<p>Research on attention finds its origin in the field of psychology.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>The scientific study of attention began in psychology, where careful behavioral experimentation can give rise to precise demonstrations of the tendencies and abilities of attention in different circumstances.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>Observations derived from such studies could help researchers infer the mental processes underlying such behavioral patterns.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>While the different fields of psychology, neuroscience and, more recently, machine learning, have all produced their own definitions of attention, there is one core quality that is of great significance to all:<\/p>\n<blockquote>\n<p><i>Attention is the flexible control of limited computational resources.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>With this in mind, the following sections review the role of attention in revolutionizing the field of machine learning.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Attention in Machine Learning<\/b><\/h2>\n<p>The concept of attention in machine learning is <i>very<\/i> loosely inspired by the psychological mechanisms of attention in the human brain.<\/p>\n<blockquote>\n<p><i>The use of attention mechanisms in artificial neural networks came about \u2014 much like the apparent need for attention in the brain \u2014 as a means of making neural systems more flexible.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>The idea is to be able to work with an artificial neural network that can perform well on tasks where the input may be of variable length, size or structure, or even handle several different tasks. It is in this spirit that attention mechanisms in machine learning are said to inspire themselves from psychology, rather than because they replicate the biology of the human brain.<\/p>\n<blockquote>\n<p><i>In the form of attention originally developed for ANNs, attention mechanisms worked within an encoder-decoder framework and in the context of sequence models \u2026<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>The task of the <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\">encoder<\/a> is to generate a vector representation of the input, whereas the task of the <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\">decoder<\/a> is to transform this vector representation into an output. The attention mechanism connects the two.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>There have been different propositions of neural network architectures that implement attention mechanisms, which are also tied to the specific applications in which they find their use. Natural Language Processing (NLP) and computer vision are among the most popular applications.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h3><b>Attention in Natural Language Processing<\/b><\/h3>\n<p>An early application for attention in NLP was that of machine translation, where the goal was to translate an input sentence in a source language, to an output sentence in a target language. Within this context, the encoder would generate a set of <i>context<\/i> vectors, one for each word in the source sentence. The decoder, on the other hand, would read the context vectors to generate an output sentence in the target language, one word at a time.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>In the traditional encoder-decoder framework without attention, the encoder produced a fixed-length vector that was independent of the length or features of the input and static during the course of decoding.<\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>Representing the input by a fixed-length vector was especially problematic for long sequences or sequences that were complex in structure, since the dimensionality of their representation was forced to be the same as for shorter or simpler sequences.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>For example, in some languages, such as Japanese, the last word might be very important to predict the first word, while translating English to French might be easier as the order of the sentences (how the sentence is organized) is more similar to each other.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>This created a bottleneck, whereby the decoder has limited access to the information provided by the input \u2013 that which is available within the fixed-length encoding vector. On the other hand, preserving the length of the input sequence during the encoding process, could make it possible for the decoder to utilize its most relevant parts in a flexible manner.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>The latter is how the attention mechanism operates.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>Attention helps determine which of these vectors should be used to generate the output. Because the output sequence is dynamically generated one element at a time, attention can dynamically highlight different encoded vectors at each time point. This allows the decoder to flexibly utilize the most relevant parts of the input sequence.<\/i><\/p>\n<p>\u2013 Page 186, <a href=\"https:\/\/www.amazon.com\/Deep-Learning-Essentials-hands-fundamentals\/dp\/1785880365\">Deep Learning Essentials<\/a>, 2018.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<\/blockquote>\n<p>One of the earliest works in machine translation that sought to address the bottleneck problem created by fixed-length vectors, was by <a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Bahdanau et al. (2014)<\/a>. In their work, Bahdanau et al. employed the use of Recurrent Neural Networks (RNNs) for both encoding and decoding tasks: the encoder employs a bi-directional RNN to generate a sequence of <i>annotations<\/i> that each contain a summary of both preceding and succeeding words, and which can be mapped into a <i>context<\/i> vector through a weighted sum; the decoder then generates an output based on these annotations and the hidden states of another RNN. Since the context vector is computed by a weighted sum of the annotations, then Bahdanau et al.\u2019s attention mechanism is an example of <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\"><i>soft attention<\/i><\/a>.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Another of the earliest works was by<span class=\"Apple-converted-space\">\u00a0<\/span><a href=\"https:\/\/arxiv.org\/abs\/1409.3215\">Sutskever et al. (2014)<\/a>, who, alternatively, made use of multilayered Long Short-Term Memory (LSTM) to encode a vector representing the input sequence, and another LSTM to decode the vector into a target sequence.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Luong et al. (2015)<\/a> introduced the idea of <i>global<\/i> versus <i>local<\/i> attention. In their work, they described a global attention model as one that, when deriving the context vector, considers all the hidden states of the encoder. The computation of the global context vector is, therefore, based upon a weighted average of <i>all<\/i> the words in the source sequence. Luong et al. mention that this is computationally expensive, and could potentially make global attention difficult to be applied to long sequences. Local attention is proposed to address this problem, by focusing on a smaller subset of the words in the source sequence, per target word. Luong et al. explain that local attention trades-off the <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\"><i>soft<\/i><\/a> and <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\"><i>hard<\/i><\/a> attentional models of <a href=\"https:\/\/arxiv.org\/abs\/1502.03044\">Xu et al. (2016)<\/a> (we will refer to this paper again in the next section), by being less computationally expensive than the soft attention, but easier to train than the hard attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>More recently, <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Vaswani et al. (2017)<\/a> proposed an entirely different architecture that has steered the field of machine translation into a new direction. Termed by the name of <i>Transformer<\/i>, their architecture dispenses of any recurrence and convolutions entirely, but implements a <i>self-attention <\/i>mechanism. Words in the source sequence are first encoded in parallel to generate key, query and value representations. The keys and queries are combined to generate attention weightings that capture how each word relates to the others in the sequence. These attention weightings are then used to scale the values, in order to retain focus on the important words and drown out the irrelevant ones.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.<\/i><\/p>\n<p>\u2013 <a href=\"https:\/\/arxiv.org\/pdf\/1706.03762.pdf\">Attention Is All You Need<\/a>, 2017.<\/p>\n<\/blockquote>\n<div id=\"attachment_12821\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1.png\"><img aria-describedby=\"caption-attachment-12821\" loading=\"lazy\" class=\"wp-image-12821\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-727x1024.png\" alt=\"\" width=\"366\" height=\"516\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12821\" loading=\"lazy\" class=\"wp-image-12821\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_1-727x1024.png\" alt=\"\" width=\"366\" height=\"516\"><\/a><\/p>\n<p id=\"caption-attachment-12821\" class=\"wp-caption-text\">The Transformer Architecture <br \/>Taken from \u201cAttention Is All You Need\u201d<\/p>\n<\/div>\n<p>At the time, the proposed Transformer architecture established a new state-of-the-art on English-to-German and English-to-French translation tasks, and was reportedly also faster to train than architectures based on recurrent or convolutional layers. Subsequently, the method called BERT by <a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">Devlin et al. (2019)<\/a>\u00a0built on Vaswani et al.\u2019s work by proposing a multi-layer bi-directional architecture.<\/p>\n<p>As we shall be seeing shortly, the uptake of the Transformer architecture was not only rapid in the domain of NLP, but in the computer vision domain too.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h3><b>Attention in Computer Vision<\/b><\/h3>\n<p>In computer vision, attention has found its way into several applications, such as in the domains of image classification, image segmentation and image captioning.<\/p>\n<p>If we had to reframe the encoder-decoder model to the task of image captioning, as an example, then the encoder can be a Convolutional Neural Network (CNN) that captures the salient visual cues in the images into a vector representation, whereas the decoder can be an RNN or LSTM that transforms the vector representation into an output.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<blockquote>\n<p><i>Also, as in the neuroscience literature, these attentional processes can be divided into spatial and feature-based attention.<span class=\"Apple-converted-space\">\u00a0<\/span><\/i><\/p>\n<p><i>\u2013 <\/i><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/p>\n<\/blockquote>\n<p>In <i>spatial<\/i> attention, different spatial locations are attributed different weights, however these same weights are retained across all feature channels at the different spatial locations.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>One of the fundamental image captioning approaches working with spatial attention has been proposed by <a href=\"https:\/\/arxiv.org\/abs\/1502.03044\">Xu et al. (2016)<\/a>. Their model incorporates a CNN as an encoder that extracts a set of feature vectors (or <i>annotation<\/i> vectors), with each vector corresponding to a different part of the image to allow the decoder to focus selectively on specific image parts. The decoder is an LSTM that generates a caption based on a context vector, the previous hidden state, and the previously generated words. Xu et al. investigate the use of <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\"><i>hard attention<\/i><\/a> as an alternative to <a href=\"https:\/\/machinelearningmastery.com\/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks\/\">soft attention<\/a> in computing their context vector. Here, soft attention places weights <em>softly<\/em> on all patches of the source image, whereas hard attention attends to a single patch alone while disregarding the rest.<span class=\"Apple-converted-space\">\u00a0<\/span>They report that, in their work, hard attention performs better.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12822\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_2.png\"><img aria-describedby=\"caption-attachment-12822\" loading=\"lazy\" class=\"wp-image-12822\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_2-1024x426.png\" alt=\"\" width=\"522\" height=\"217\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12822\" loading=\"lazy\" class=\"wp-image-12822\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_2-1024x426.png\" alt=\"\" width=\"522\" height=\"217\"><\/a><\/p>\n<p id=\"caption-attachment-12822\" class=\"wp-caption-text\">Model for Image Caption Generation <br \/>Taken from \u201cShow, Attend and Tell: Neural Image Caption Generation with Visual Attention\u201d<\/p>\n<\/div>\n<p><i>Feature<\/i> attention, in comparison, permits individual feature maps to be attributed their own weight values. One such example, also applied to image captioning, is the encoder-decoder framework of <a href=\"https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/papers\/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.pdf\">Chen et al. (2018)<\/a>, which incorporates spatial and channel-wise attentions in the same CNN.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<p>Similarly to how the Transformer has quickly become the standard architecture for NLP tasks, it has also been recently taken up and adapted by the computer vision community.<\/p>\n<p>The earliest work to do so was proposed by <a href=\"https:\/\/arxiv.org\/abs\/2010.11929\">Dosovitskiy et al. (2020)<\/a>, who applied their <i>Vision Transformer<\/i> (ViT) to an image classification task. They argued that the long-standing reliance on CNNs for image classification was not necessary, and the same task could be accomplished by a pure transformer. Dosovitskiy et al. reshape an input image into a sequence of flattened 2D image patches, which they subsequently embed by a trainable linear projection to generate the <i>patch embeddings<\/i>. These patch embeddings together with their <i>position embeddings<\/i>, to retain positional information, are fed into the encoder part of the Transformer architecture, whose output is subsequently fed into a Multilayer Perceptron (MLP) for classification.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12823\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_3.png\"><img aria-describedby=\"caption-attachment-12823\" loading=\"lazy\" class=\"wp-image-12823\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_3-1024x543.png\" alt=\"\" width=\"508\" height=\"269\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12823\" loading=\"lazy\" class=\"wp-image-12823\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_3-1024x543.png\" alt=\"\" width=\"508\" height=\"269\"><\/a><\/p>\n<p id=\"caption-attachment-12823\" class=\"wp-caption-text\">The Vision Transformer Architecture <br \/>Taken from \u201cAn Image is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale\u201d<\/p>\n<\/div>\n<blockquote>\n<p><em>Inspired by ViT, and the fact that attention-based architectures are an intuitive choice for modelling long-range contextual relationships in video, we develop several transformer-based models for video classification.<\/em><\/p>\n<p>\u2013 <a href=\"https:\/\/arxiv.org\/abs\/2103.15691\">ViViT: A Video Vision Transformer<\/a>, 2021.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<\/blockquote>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2103.15691\">Arnab et al. (2021)<\/a>, subsequently extended the ViT model to ViViT, which exploits the spatiotemporal information contained within videos for the task of video classification. Their method explores different approaches of extracting the spatiotemporal data, such as by sampling and embedding each frame independently, or by extracting non-overlapping tubelets (an image patch that spans across several image frames, creating a <i>tube<\/i>) and embedding each one in turn. They also investigate different methods of factorising the spatial and temporal dimensions of the input video, for increased efficiency and scalability.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<div id=\"attachment_12824\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_4.png\"><img aria-describedby=\"caption-attachment-12824\" loading=\"lazy\" class=\"wp-image-12824\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_4-1024x338.png\" alt=\"\" width=\"936\" height=\"309\"><img decoding=\"async\" aria-describedby=\"caption-attachment-12824\" loading=\"lazy\" class=\"wp-image-12824\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/08\/attention_research_4-1024x338.png\" alt=\"\" width=\"936\" height=\"309\"><\/a><\/p>\n<p id=\"caption-attachment-12824\" class=\"wp-caption-text\">The Video Vision Transformer Architecture <br \/>Taken from \u201cViViT: A Video Vision Transformer\u201d<\/p>\n<\/div>\n<p>Further to its first application for image classification, the Vision Transformer is already being applied to several other computer vision domains, such as to <a href=\"https:\/\/arxiv.org\/abs\/2106.08061\">action localization<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2105.14424\">gaze estimation<\/a>, and <a href=\"https:\/\/arxiv.org\/abs\/2107.04589\">image generation<\/a>. This surge of interest among computer vision practitioners suggests an exciting near future, where we\u2019ll be seeing more adaptations and applications of the Transformer architecture.<span class=\"Apple-converted-space\">\u00a0<\/span><\/p>\n<h2><b>Further Reading<\/b><\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3><b>Books<\/b><\/h3>\n<h3><b>Papers<\/b><\/h3>\n<ul>\n<li><a href=\"https:\/\/www.frontiersin.org\/articles\/10.3389\/fncom.2020.00029\/full\">Attention in Psychology, Neuroscience, and Machine Learning<\/a>, 2020.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1409.0473\">Neural Machine Translation by Jointly Learning to Align and Translate<\/a>, 2014.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1409.3215\">Sequence to Sequence Learning with Neural Networks<\/a>, 2014.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1508.04025\">Effective Approaches to Attention-based Neural Machine Translation<\/a>, 2015.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention Is All You Need<\/a>, 2017.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding<\/a>, 2019.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1502.03044\">Show, Attend and Tell: Neural Image Caption Generation with Visual Attention<\/a>, 2016.<\/li>\n<li><a href=\"https:\/\/openaccess.thecvf.com\/content_cvpr_2017\/papers\/Chen_SCA-CNN_Spatial_and_CVPR_2017_paper.pdf\">SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning<\/a>, 2018.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2010.11929\">An Image is Worth 16\u00d716 Words: Transformers for Image Recognition at Scale<\/a>, 2020.<\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2103.15691\">ViViT: A Video Vision Transformer<\/a>, 2021.<\/li>\n<\/ul>\n<p><b>Example Applications:<\/b><\/p>\n<h2><b>Summary<\/b><\/h2>\n<p>In this tutorial, you discovered an overview of the research advances on attention.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>The concept of attention that is of significance to different scientific disciplines.<span class=\"Apple-converted-space\">\u00a0<\/span><\/li>\n<li>How attention is revolutionizing machine learning, specifically in the domains of natural language processing and computer vision.<\/li>\n<\/ul>\n<p>Do you have any questions?<br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/a-birds-eye-view-of-research-on-attention\/<\/p>\n","protected":false},"author":0,"featured_media":882,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/881"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=881"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/881\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/882"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}