{"id":304,"date":"2020-09-29T13:04:25","date_gmt":"2020-09-29T13:04:25","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/29\/bert-inference-on-g4-instances-using-apache-mxnet-and-gluonnlp-1-million-requests-for-20-cents\/"},"modified":"2020-09-29T13:04:25","modified_gmt":"2020-09-29T13:04:25","slug":"bert-inference-on-g4-instances-using-apache-mxnet-and-gluonnlp-1-million-requests-for-20-cents","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/29\/bert-inference-on-g4-instances-using-apache-mxnet-and-gluonnlp-1-million-requests-for-20-cents\/","title":{"rendered":"BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents"},"content":{"rendered":"<div id=\"\">\n<p>Bidirectional Encoder Representations from Transformers (BERT) [1] has become one of the most popular models for natural language processing (NLP) applications. BERT can outperform other models in several NLP tasks, including question answering and sentence classification.<\/p>\n<p>Training the BERT model on large datasets is expensive and time consuming, and achieving low latency when performing inference on this model is challenging. Latency and throughput are key factors to deploy a model in production. In this post, we focus on optimizing these factors for BERT inference tasks. We also compare the cost of deploying BERT on different <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Compute Cloud<\/a> (Amazon EC2) instances.<\/p>\n<p>When running inference on the BERT-base model, the g4dn.xlarge GPU instance achieves between 2.6\u20135 times lower latency (3.8 on average) than a c5.24xlarge CPU instance. The g4dn.xlarge instance also achieves the best cost-effective ratio (cost per requests) compared to c5.xlarge, c5.24xlarge, and m5.xlarge CPU instances. Specifically, the cost of processing 1 million BERT-inference requests with sequence length 128 is $0.20 on g4dn.xlarge, whereas on c5.xlarge (the best of these CPU instances), the cost is $3.31\u2014the GPU instance is 16.5 times more efficient.<\/p>\n<p>We achieved these results after a set of GPU optimizations on MXNet, described in the section <strong>Optimizing BERT model performance on MXNET<\/strong><strong> 1.6 and 1.7<\/strong> of this post.<\/p>\n<h2>Amazon EC2 G4 instances<\/h2>\n<p>G4 instances are optimized for machine learning application deployments. They\u2019re equipped with <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/tesla-t4\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA T4 GPUs<\/a>, powered by <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/tensorcore\" target=\"_blank\" rel=\"noopener noreferrer\">Tensor Cores<\/a>, and deliver groundbreaking AI performance: up to 65 TFLOPS in FP16 precision and up to 130 TOPS in INT8 precision.<\/p>\n<p>Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 milliseconds on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance (the cost of this instance at the time of writing is $0.526 per hour on demand in the US East (N. Virginia) Region.<\/p>\n<p>For more information about G4 instances, see <a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/g4\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2 G4 Instances<\/a>.<\/p>\n<h2>GluonNLP and MXNet<\/h2>\n<p>GluonNLP is a deep learning framework built on top of <a href=\"https:\/\/mxnet.apache.org\" target=\"_blank\" rel=\"noopener noreferrer\">MXNet<\/a>, which was specifically designed for NLP applications. It extends MXNet, providing NLP models, datasets, and examples.<\/p>\n<p>GluonNLP includes an efficient implementation of the BERT model, scripts for training and performing inference, and several datasets (such as GLUE benchmark and SQuAD). For more information, see <a href=\"https:\/\/gluon-nlp.mxnet.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">GluonNLP: NLP made easy<\/a>.<\/p>\n<p>For this post, we use the GluonNLP BERT implementation to perform inference on NLP tasks. Specifically, we use MXNet version 1.7 and GluonNLP version 0.10.0.<\/p>\n<h2>BERT-base inference results<\/h2>\n<p>We present results performing two different BERT tasks: question answering and classification (sentimental analysis using the Stanford Sentiment Treebank (SST2) dataset). We achieved the results after a <a href=\"#kix.a5c5es65pdhu\" target=\"_blank\" rel=\"noopener noreferrer\">set of GPU optimizations<\/a> on MXNet.<\/p>\n<p>In the following graphs, we compare latency achieved by a single GPU on a g4dn.xlarge instance with FP16 precision vs. the most efficient CPU instance in terms of latency, c5.24xlarge with INT8 precision, MKL BLAS and 24 OpenMP threads.<\/p>\n<p>The following graph shows BERT-base latency on c5.25xlarge (INT8) and g4dn.xlarge (FP16) instances performing a classification inference task (SST2 dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16235\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/1-Graph-1.jpg\" alt=\"\" width=\"900\" height=\"563\"><\/p>\n<p>The following graph shows BERT-base latency on c5.24xlarge (INT8) and g4dn.xlarge (FP16) instances performing a question answering inference task (SQuAD dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16236\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/2-Grpah.jpg\" alt=\"\" width=\"900\" height=\"563\">\u00a0<\/strong><\/p>\n<p>In the following two graphs, we present a cost comparison between several instances based on the throughput (sentences\/s) and the cost of each instance on demand (cost per hour) in the US East (N. Virginia) Region.<\/p>\n<p>The following graph shows dollars per 1 million sequence classification requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16237\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/3-Graph-1.jpg\" alt=\"\" width=\"900\" height=\"619\"><\/p>\n<p>The following graph shows dollars per 1 million question answering requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.<\/p>\n<p><strong><u><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16238\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/4-Graph.jpg\" alt=\"\" width=\"900\" height=\"619\"><\/u><\/strong><\/p>\n<h2>Deploying BERT on G4 instances<\/h2>\n<p>You can easily reproduce the results in the preceding section on a g4dn.xlarge instance. You can start from a pretrained model and fine-tune it for a specific task before running inference, or you can download one of the following fine-tuned models:<\/p>\n<p>Then complete the following steps:<\/p>\n<ol>\n<li>To initialize a G4 instance, on the Amazon EC2 console, choose <strong>Deep Learning AMI<\/strong> (Ubuntu 18.04) Version 28.1 (or posterior) and a G4 instance.<\/li>\n<li>Connect to the instance and set MXNet 1.7 and GluonNLP 0.10.x:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">pip install mxnet-cu102==1.7.0\r\ngit clone --branch v0.10.x https:\/\/github.com\/dmlc\/gluon-nlp.git\r\ncd gluon-nlp; pip install -e .; cd scripts\/bert\r\npython setup.py install\r\n<\/code><\/pre>\n<\/div>\n<p><strong>\u00a0<\/strong>The command <code>python setup.py install<\/code> generates a custom graph pass (<code>bertpass_lib.so<\/code>) that optimizes the graph, and therefore performance. It can be passed to the inference script as an argument.<\/p>\n<ol start=\"3\">\n<li>If you didn\u2019t download any fine-tuned parameters, you can now fine-tune your model to specify a sequence length and use a GPU.\n<ul>\n<li>For a question answering task, run the following script (approximately 180 minutes):<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python3 finetune_squad.py --max_seq_length 128 --gpu<\/code><\/pre>\n<\/div>\n<ol>\n<li>\n<ul type=\"a\">\n<li>For a classification task, run the following script:<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python3 finetune_classifier.py --task_name [task_name] --max_len 128 --gpu 0<\/code><\/pre>\n<\/div>\n<p>In the preceding code, task choices include \u2018MRPC\u2019, \u2018QQP\u2019, \u2018QNLI\u2019, \u2018RTE\u2019, \u2018STS-B\u2019, \u2018CoLA\u2019, \u2018MNLI\u2019, \u2018WNLI\u2019, \u2018SST\u2019 (refers to SST2), \u2018XNLI\u2019, \u2018LCQMC\u2019, and \u2018ChnSentiCorp\u2019. Computation time depends on the specific task. For SST, it should take less than 15 minutes.<\/p>\n<p>By default, these scripts run 3 epochs (to achieve the published accuracy in [1]).<\/p>\n<p>They generate an output file, <code>output_dir\/net.params<\/code>, where the fine-tuned parameters are stored and from where they can be loaded at inference step. Scripts also perform a prediction test to check accuracy.<\/p>\n<p>You should get an F1 score of 85 or higher in question answering, and a validation metric higher to 0.92 in SST classification task.<\/p>\n<p>You can now perform inference using validation datasets.<\/p>\n<ol start=\"4\">\n<li>Force MXNet to use FP32 precision in Softmax and LayerNorm layers for better accuracy when using FP16.<\/li>\n<\/ol>\n<p>These two layers are susceptible to overflow, so we recommend always using FP32. MXNet takes care of it if you set the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">export MXNET_SAFE_ACCUMULATION=1<\/code><\/pre>\n<\/div>\n<ol start=\"5\">\n<li>Activate True FP16 computation for performance purposes.<\/li>\n<\/ol>\n<p>General matrix multiply operations don\u2019t present accuracy issues in this model. By default, they\u2019re computed using FP32 accumulation (for more information, see the section <strong>Optimizing BERT model performance on MXNET<\/strong><strong> 1.6 and 1.7<\/strong> in this post), but you can activate the FP16 accumulation setting:<\/p>\n<pre><code class=\"lang-python\">\u00a0export MXNET_FC_TRUE_FP16=1<\/code><\/pre>\n<ol start=\"6\">\n<li>Run inference:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python3 deploy.py --model_parameters [path_to_finetuned_params] --task [_task_] --gpu 0 --dtype float16 --custom_pass=bertpass_lib.so<\/code><\/pre>\n<\/div>\n<p>In the preceding code, the \u00a0task can be one of \u2018QA\u2019, \u2018embedding\u2019, \u2018MRPC\u2019, \u2018QQP\u2019, \u2018QNLI\u2019, \u2018RTE\u2019, \u2018STS-B\u2019, \u2018CoLA\u2019, \u2018MNLI\u2019, \u2018WNLI\u2019, \u2018SST\u2019, \u2018XNLI\u2019, \u2018LCQMC\u2019, or \u2018ChnSentiCorp\u2019 [1].<\/p>\n<p>This command exports the model (JSON and parameter files) into the output directory (<code>output_dir\/<em>[task_name]<\/em><\/code>), and performs inference using the validation dataset corresponding to each task.<\/p>\n<p>It reports the average latency and throughput.<\/p>\n<p>The second time you run it, you can skip the export step by adding the tag <code>--only_infer<\/code> and specifying the exported model to use by adding <code>--exported_model<\/code> followed by the prefix name of the JSON or parameter files.<\/p>\n<p>Optimal latency is achieved on G4 instances with FP16 precision. We recommend adding the flag <code>-dtype float16<\/code> and activating <code>MXNET_FC_TRUE_FP16<\/code> when performing inference. These flags shouldn\u2019t reduce the final accuracy in your results.<\/p>\n<p>By default, all these scripts use BERT-base (12 transformer-encoder layers). If you want to use BERT-large, use the flag <code>--bert_model bert_24_1024_16<\/code> when calling the scripts.<\/p>\n<h2>Optimizing BERT model performance on MXNet 1.6 and 1.7<\/h2>\n<p>Computationally, the BERT model is mainly dominated by general matrix multiply operations (GEMMs). They represent up to 56% of time consumed when performing inference. The following chart shows the percentage of computational time spent on each operation type performing BERT-base inference (sequence length 128 and batch size 128).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16239\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/5-Pie-Chart.jpg\" alt=\"\" width=\"900\" height=\"557\"><\/p>\n<p>MXNet uses the <a href=\"https:\/\/developer.nvidia.com\/cublas\" target=\"_blank\" rel=\"noopener noreferrer\">cuBLAS<\/a> library to efficiently compute these GEMMs on the GPU. These GEMMs belong to the multi-head self-attention part of the model (4 GEMMs per transformer layer), and the feed-forward network (2 GEMMs per transformer layer).<\/p>\n<p>In this section, we discuss optimizing the most computational-consuming operations.<\/p>\n<p>The following table shows the improvement of each optimization. The performance improvements were achieved by the different GPU BERT optimizations implemented on MXNet and GluonNLP, performing a question answering inference task (SQuAD dataset), and using a sequence length of 128. Speedup achieved is shown for different batch sizes.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16240\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/6-Screenshot-3.jpg\" alt=\"\" width=\"900\" height=\"189\"><\/p>\n<h3>LayerNorm, Softmax and AddBias<\/h3>\n<p>Although <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/14935\" target=\"_blank\" rel=\"noopener noreferrer\">LayerNorm<\/a> was already optimized for GPUs on MXNet 1.5, the implementation of <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/15545\" target=\"_blank\" rel=\"noopener noreferrer\">Softmax<\/a> was optimized in MXNet 1.6. The new implementation improves inference performance on GPUs by optimizing the device memory accesses and using the CUDA registers and shared memory during reduction operations more efficiently. Additionally, you have the option to apply a <code>max_length<\/code> <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/15169\" target=\"_blank\" rel=\"noopener noreferrer\">mask within the C++ Softmax<\/a> operator, which removes the need to apply the mask at the Python level.<\/p>\n<p>The addition of <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/16039\" target=\"_blank\" rel=\"noopener noreferrer\">bias terms<\/a> following GEMMs was also optimized. Instead of using an <code>mshadow<\/code> broadcast summation, a custom CUDA kernel is now attached to the <code>FullyConnected<\/code> layer, which includes efficient device memory accesses.<\/p>\n<h3>Multi-head self-attention<\/h3>\n<p>The following equation defines the attention mechanism used in the BERT model [2], where <strong>Q<\/strong> represents the query, <strong>K<\/strong> the key, <strong>V<\/strong> the value, and d<sub>k<\/sub> the inner dimension of these three matrices:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16244 aligncenter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/Formula.jpg\" alt=\"\" width=\"726\" height=\"159\"><\/p>\n<p>Three different linear projections (<code>FullyConnected<\/code>: GEMMs and <code>Bias-Addition<\/code>) are performed to obtain <strong>Q<\/strong>, <strong>K<\/strong>, and <strong>V <\/strong>from the same input (when the same input is employed, the mechanism is denominated self-attention), but with different weights:<\/p>\n<ul>\n<li>\n<ul>\n<li>\n<strong>Q<\/strong> = <strong>input<\/strong> <strong>W<sub>q<\/sub><sup>t<\/sup><\/strong>\n<\/li>\n<li>\n<strong>K<\/strong> = <strong>input<\/strong> <strong>W<sub>k<\/sub><sup>t<\/sup><\/strong>\n<\/li>\n<li>\n<strong>V<\/strong> = <strong>input<\/strong> <strong>W<sub>v<\/sub><sup>t<\/sup><\/strong>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>The input size is (<code>BatchSize<\/code>, <code>SeqLength<\/code>, <code>EmbeddingDim<\/code>), and each weight tensor <strong>W<\/strong> size is (<code>ProjectionDim<\/code>, <code>EmbeddingDim<\/code>).<\/p>\n<p>In multi-head attention, many projections and attention functions are applied to the input as the number of heads, augmenting the dimensions of the weights so that each <strong>W<\/strong> size is ((<code>NumHeads<\/code> x <code>ProjectionDim<\/code>), <code>EmbeddingDim<\/code>).<\/p>\n<p>All these projections are independent, so we can compute them in parallel within the same operation, producing an output which size is (<code>BatchSize<\/code>, <code>SeqLength<\/code>, 3 x <code>NumHeads<\/code> x <code>ProjectionDim<\/code>). That is, GluonNLP uses a single <code>FullyConnected<\/code> layer to compute <strong>Q<\/strong>, <strong>K<\/strong>, and <strong>V<\/strong>.<\/p>\n<p>To compute the attention function (the preceding equation), we first need to compute the dot product <strong>QK<sup>T<\/sup><\/strong>. We need to perform this computation independently for each head, with <em>m<\/em>=<code>SeqLength<\/code> number of <strong>Q <\/strong>rows, <em>n<\/em>=<code>SeqLength<\/code> number of <strong>K <\/strong>columns, and <em>k<\/em>=<code>ProjectionDim<\/code> size of vectors in the dot product. We can use a batched dot product operation, where the number of batches is (<code>BatchSize<\/code> x <code>NumHeads<\/code>), to compute all the dot products within the same operation.<\/p>\n<p>However, to perform such an operation in cuBLAS, we need to have the batches and heads dimensions contiguous (in order to have a regular pattern to express distance between batches), but that isn\u2019t the case by default (<code>SeqLength<\/code> dimension is between them). To avoid rearranging <strong>Q<\/strong>, <strong>K<\/strong>, and <strong>V<\/strong>, GluonNLP transposes the input so that its shape is (<code>SeqLength<\/code>, <code>BatchSize<\/code>, <code>EmbeddingDim<\/code>), and <strong>Q<\/strong>, <strong>K<\/strong>, and <strong>V<\/strong> are directly projected into a tensor with shape (<code>SeqLength<\/code>, <code>BatchSize<\/code>, 3 x <code>NumHeads<\/code> x <code>ProjectionDim<\/code>).<\/p>\n<p>Moreover, to avoid splitting the joint <strong>QKV<\/strong> output, we can compute the projections in an interleaved fashion, allocating continuously the applied weights <strong>W<sub>q<\/sub><\/strong>, <strong>W<sub>k<\/sub><\/strong>, <strong>W<sub>v<\/sub><\/strong> of each individual head. The following diagram depicts the interleaved projection operation, where P is the projection size, and we end with a joint <strong>QKV<\/strong> output with shape (<code>SeqLength<\/code>, <code>BatchSize<\/code>, <code>NumHeads<\/code> x 3 x <code>ProjectionDim<\/code>).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16241\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/7-Diagram.jpg\" alt=\"\" width=\"900\" height=\"172\"><\/p>\n<p>This strategy allows us to compute <strong>QK<sup>T<\/sup><\/strong>from a unique joint input tensor with <a href=\"https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html#cublas-GemmStridedBatchedEx\" target=\"_blank\" rel=\"noopener noreferrer\">cuBLASGEMMStridedBatched<\/a>, setting the number of batches to (<code>BatchSize<\/code> x <code>NumHeads<\/code>) and the stride to (3 x <code>ProjectionDim<\/code>). We also use a strided batched GEMM to compute the dot product of <strong>V<\/strong> (same stride as before) with the output of the Softmax function. We implemented <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/16408\" target=\"_blank\" rel=\"noopener noreferrer\">MXNet operators<\/a> that deal with this cuBLAS configuration.<\/p>\n<h3>True FP16<\/h3>\n<p>Since MXNet 1.7, you can <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/17466\" target=\"_blank\" rel=\"noopener noreferrer\">compute completely in FP16 precision GEMMs<\/a>. By default, when the data type is FP16, MXNet sets cuBLAS to internally use FP32 accumulation. You can now set the environment variable <code>MXNET_FC_TRUE_FP16<\/code> to 1 to force MXNet to use FP16 as the cuBLAS internal computation type.<\/p>\n<h3>Pointwise fusion and prearrangement of MHA weights and bias using a custom graph pass<\/h3>\n<p>Finally, the feed-forward part of the model, which happens after each transformer layer, uses Gaussian Exponential Linear Unit (GELU) as its activation function. This operation follows a feed-forward <code>FullyConnected<\/code> operation, which includes bias addition. We use the MXNet functionality of <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/17885\" target=\"_blank\" rel=\"noopener noreferrer\">custom graph passes<\/a> to detach the bias addition from the <code>FullyConnected<\/code> operation and fuse it with GELU through the <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/pull\/15167\" target=\"_blank\" rel=\"noopener noreferrer\">pointwise fusion mechanism<\/a>.<\/p>\n<p>In our custom graph pass for BERT, we also prearrange the weights and bias terms for the multi-head self-attention computation so that we avoid any overhead at runtime. As explained earlier, weights need to be interleaved, and bias terms need to be joint into a unique tensor. We do this before exporting the model. This strategy shows benefits in small batch size cases.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we presented an efficient solution for performing BERT inference tasks on EC2 G4 GPU instances. We showed how a set of MXNet optimizations boost GPU performance, achieving speeds up to twice as fast in both question answering and classification tasks.<\/p>\n<p>We have shown that g4dn.xlarge instances offer lower latency (below 4 milliseconds with batch size 1) than any EC2 CPU instance, and g4dn.xlarge is 3.8 times better than c5.24xlarge on average. Finally, g4dn.xlarge offers the best cost per million requests ratio\u201416 times better than CPU instances (c5.xlarge) on average.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h3>Acknowledgments<\/h3>\n<p>We would like to thank Triston Cao, Murat Guney from NVIDIA, Sandeep Krishnamurthy from Amazon, the Amazon-MXNet team, and the NVIDIA MXNet team for their feedback and support.<\/p>\n<h3>Disclaimer<\/h3>\n<p>The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.<\/p>\n<h2>References<\/h2>\n<ol>\n<li>\n<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener noreferrer\">Devlin, Jacob, et al. \u201cBert: Pre-training of deep bidirectional transformers for language understanding.\u201d <\/a><a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener noreferrer\"><em>arXiv preprint arXiv:1810.04805<\/em><\/a><a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener noreferrer\"> (2018).<\/a>\n<\/li>\n<li><a href=\"http:\/\/papers.nips.cc\/paper\/7181-attention-is-all-you-need\" target=\"_blank\" rel=\"noopener noreferrer\">Vaswani, Ashish, et al. \u201cAttention is all you need.\u201d Advances in neural information processing systems. 2017.<\/a><\/li>\n<\/ol>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong>Moises Hernandez Fernandez<\/strong> is an AI DevTech Engineer at NVIDIA. He works on accelerating NLP applications on GPUs. Before joining NVIDIA, he was conducting research into the brain connectivity, optimizing the analysis of diffusion MRI using GPUs. Moises received a PhD in Neurosciences from Oxford University.<\/p>\n<p><strong>Haibin Lin<\/strong> is a former Applied Scientist at Amazon Web Services. He works on distributed systems, deep learning, and NLP. He is a PPMC and committer of Apache MXNet, and a major contributor to the GluonNLP toolkit. He has finished his M.S. in Computer Science at Carnegie Mellon University, advised by Andy Pavlo. Prior to that, he has received a B.Eng. in Computer Science from University of Hong Kong and Shanghai Jiao Tong University jointly.<\/p>\n<p><strong>Przemyslaw Tredak<\/strong> is a senior developer technology engineer on the Deep Learning Frameworks team at NVIDIA. He is a committer of Apache MXNet and leads the MXNet team at NVIDIA.<\/p>\n<p><strong>Anish Mohan<\/strong> is a Machine Learning Architect at Nvidia and the technical lead for ML\/DL engagements with key Nvidia customers in the greater Seattle region. Before Nvidia, he was at Microsoft\u2019s AI Division, working to develop and deploy AI\/ML algorithms and solutions.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/bert-inference-on-g4-instances-using-apache-mxnet-and-gluonnlp-1-million-requests-for-20-cents\/<\/p>\n","protected":false},"author":0,"featured_media":305,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/304"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=304"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/304\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/305"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=304"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=304"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=304"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}