{"id":368,"date":"2020-10-07T12:47:19","date_gmt":"2020-10-07T12:47:19","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/10\/07\/achieving-1-85x-higher-performance-for-deep-learning-based-object-detection-with-an-aws-neuron-compiled-yolov4-model-on-aws-inferentia\/"},"modified":"2020-10-07T12:47:19","modified_gmt":"2020-10-07T12:47:19","slug":"achieving-1-85x-higher-performance-for-deep-learning-based-object-detection-with-an-aws-neuron-compiled-yolov4-model-on-aws-inferentia","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/10\/07\/achieving-1-85x-higher-performance-for-deep-learning-based-object-detection-with-an-aws-neuron-compiled-yolov4-model-on-aws-inferentia\/","title":{"rendered":"Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia"},"content":{"rendered":"<div id=\"\">\n<p>In this post, we show you how to deploy a TensorFlow based YOLOv4 model, using Keras optimized for inference on <a href=\"https:\/\/aws.amazon.com\/machine-learning\/inferentia\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Inferentia<\/a> based <a href=\"http:\/\/aws.amazon.com\/ec2\/instance-types\/inf1\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2 Inf1<\/a> instances. You will set up a benchmarking environment to evaluate throughput and precision, comparing Inf1 with comparable Amazon EC2 G4 GPU-based instances. Deploying YOLOv4 on AWS Inferentia provides the highest throughput, lowest latency with minimal latency jitter, and the lowest cost per image.<\/p>\n<p>The following charts show a 2-hour run in which Inf1 provides higher throughout and lower latency. The Inf1 instances achieved up to 1.85 times higher throughput and 37% lower cost per image when compared to the most optimized Amazon EC2 G4 GPU-based instances.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16817\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/06\/2_Update.jpg\" alt=\"\" width=\"450\" height=\"412\"><\/p>\n<p>In addition, the following graph records the P90 inference latency is 60% lower on Inf1, and with significant lower variance compared to the G4 instances.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16818\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/06\/1_Update.jpg\" alt=\"\" width=\"900\" height=\"538\"><\/p>\n<p>When you use the <a href=\"https:\/\/aws.amazon.com\/machine-learning\/neuron\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Neuron<\/a> data type auto-casting feature, there is no measurable degradation in accuracy. The compiler automatically converts the pipeline to mixed precision with BF16 data types for increased performance. The model reaches 48.7% mean average precision\u2014thanks to the state-of-the-art YOLOv4 model implementation.<\/p>\n<h2>About AWS Inferentia and AWS Neuron SDK<\/h2>\n<p>AWS Inferentia chips are custom built by AWS to provide high-inference performance, with the lowest cost of inference in the cloud, with seamless features such as auto-conversion of trained FP32 models to Bfloat16, and elasticity in its machine learning (ML) models\u2019 compute architecture, which supports a wide range of model types from image recognition, object detection, natural language processing (NLP), and modern recommender models.<\/p>\n<p>AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the Inferentia chips. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch, and comes pre-installed in the <a href=\"https:\/\/aws.amazon.com\/machine-learning\/amis\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Deep Learning AMIs<\/a>. Therefore, deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and your applications benefit from the boost in performance and lowest cost.<\/p>\n<p>Since its launch, the Neuron SDK has seen dramatic improvement in the breadth of models that deliver high performance at a fraction of the cost. This includes NLP models like the popular BERT, image classification models (ResNet, VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new Neuron tool that allows you to easily scale monitoring of large fleets of Inference applications.<\/p>\n<h2>You Only Look Once<\/h2>\n<p>Object detection stands out as a computer vision (CV) task that has seen large accuracy improvements (<a href=\"https:\/\/paperswithcode.com\/sota\/object-detection-on-coco\" target=\"_blank\" rel=\"noopener noreferrer\">average precision at 50 IoU &gt; 70<\/a>) due to deep learning model architectures. An object detection model tries to localize and classify objects in an image, allowing for applications ranging from real-time inspection of <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2405896318321001\" target=\"_blank\" rel=\"noopener noreferrer\">manufacturing defects<\/a> to <a href=\"https:\/\/arxiv.org\/pdf\/1910.01268.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">medical imaging<\/a> and <a href=\"https:\/\/towardsdatascience.com\/how-to-track-football-players-using-yolo-sort-and-opencv-6c58f71120b8\" target=\"_blank\" rel=\"noopener noreferrer\">tracking your favorite player and ball on a soccer match<\/a>.<\/p>\n<p>Addressing the real-time inference challenges of such computer vision tasks is key for deploying these models at scale.<\/p>\n<p>YOLO is part of the deep learning (DL) single-stage <a href=\"https:\/\/arxiv.org\/pdf\/2004.10934.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">object detection model family<\/a>, which includes models such as Single-Shot Detector (SSD) and RetinaNet. These models are usually built from stacking a backbone, neck, and head neural network that together perform detection and classification tasks. The main predictions are bounding boxes for identified objects and associated classes.<\/p>\n<p>The backbone network takes care of extracting features of the input image, while the head gets trained on the supervised task, to predict the edges of the bounding box and classify its contents. The addition of a neck neural network allows for the head network to process features from intermediate steps of the backbone. The whole pipeline processes the images only once, hence the name You Only Look Once (YOLO).<\/p>\n<p>On the other hand, models with two-stage detectors process further features from the previous convolutional layers to obtain proposals of regions, prior to generating object class prediction. In this way, the network focuses on detecting and classifying objects on regions of high object probability.<\/p>\n<p>The following diagram illustrates this architecture (from YOLOv4: Optimal Speed and Accuracy of Object Detection, <a href=\"https:\/\/arxiv.org\/abs\/2004.10934v1\" target=\"_blank\" rel=\"noopener noreferrer\">arXiv:2004.10934v1<\/a>).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16775\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/3-1-1.jpg\" alt=\"\" width=\"900\" height=\"264\"><\/p>\n<p>Single-stage models allow for multiple predictions of the same object in a single image. These predictions get disambiguated later by a process called non-max suppression (NMS), which takes care of leaving only the highest probability bounding box and label for the object. It\u2019s a less computationally costly workflow than the two-stage approach.<\/p>\n<p>Models like YOLO are all about performance. Its latest incarnation, version 4, aims at pushing the prediction accuracy further. The research paper <a href=\"https:\/\/arxiv.org\/pdf\/2004.10934.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">YOLOv4: Optimal Speed and Accuracy of Object Detection<\/a> shows how real-time inference can be achieved above the human perception of around 30 frames per second (FPS). In this post, you explore ways to push the performance of this model even further and use AWS Inferentia as a cost-effective hardware accelerator for real-time object detection.<\/p>\n<h2>Prerequisites<\/h2>\n<p>For this walkthrough, you need an AWS account with access to the <a href=\"http:\/\/aws.amazon.com\/console\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a> and the ability to create <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Compute Cloud<\/a> (Amazon EC2) instances with public-facing IP.<\/p>\n<p>Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.<\/p>\n<h2>Building a YOLOv4 predictor from a pre-trained model<\/h2>\n<p>To start building the model, set up an inf1.2xlarge EC2 instance in AWS, with 8 vCPU cores and 16 GB of memory. The Inf1 instance allows for optimizing the ratio between CPU and Inferentia devices through the selection of inf1.xlarge or inf1.2xlarge. We found that for YOLOv4, the optimal CPU to accelerator balance is achieved with inf.2xlarge. Going up to the second size instance improves throughput for a lower cost per image. Use the AWS Deep Learning AMI (Ubuntu 18.04) version 34.0\u2014<code>ami-06a25ee8966373068<\/code>\u2014in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia. For more information about running AWS Deep Learning AMIs on EC2 instances, see <a href=\"https:\/\/docs.aws.amazon.com\/dlami\/latest\/devguide\/launch-config.html\" target=\"_blank\" rel=\"noopener noreferrer\">Launching and Configuring a DLAMI<\/a>.<\/p>\n<p>Next you can connect to the instance through SSH, activate the <code>aws_neuron_tensorflow_p36<\/code> Conda environment, and update the Neuron compiler to the latest release. The compilation script depends on requirements listed in the YOLOv4 tutorial posted on the Neuron <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/tree\/master\/src\/examples\/tensorflow\/yolo_v4_demo\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. Install them by running the following code in the terminal:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">pip install neuron-cc tensorflow-neuron requests pillow matplotlib pycocotools==2.0.1 torch~=1.5.0 --force --extra-index-url=https:\/\/pip.repos.neuron.amazonaws.com\r\n<\/code><\/pre>\n<\/div>\n<p>You can also run the following steps directly from the provided <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/master\/src\/examples\/tensorflow\/yolo_v4_demo\/evaluate.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">Jupyter notebook<\/a>. If doing so, skip to the <strong>Running a performance benchmark on Inferentia<\/strong> section to explore the performance benefits of running YOLOv4 on AWS Inferentia.<\/p>\n<p>The benchmark of the models requires an object detection validation dataset. Start by downloading the COCO 2017 validation dataset. The COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset, with over 300,000 images and 1.5 million object instances. The 2017 version of COCO contains 5,000 images for validation.<\/p>\n<p>To download the dataset, enter the following code on the terminal:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">curl -LO http:\/\/images.cocodataset.org\/zips\/val2017.zip\r\ncurl -LO http:\/\/images.cocodataset.org\/annotations\/annotations_trainval2017.zip\r\nunzip -q val2017.zip\r\nunzip annotations_trainval2017.zip\r\n<\/code><\/pre>\n<\/div>\n<p>When the download is complete, you should see a <code>val2017<\/code> and an <code>annotations<\/code> folder available in your working directory. At this stage, you\u2019re ready to build and compile the model.<\/p>\n<p>The <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/tree\/master\/src\/examples\/tensorflow\/yolo_v4_demo\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a> contains the script <code>yolo_v4_coco_saved_model.py<\/code> for downloading the pretrained weights of a PyTorch implementation of YOLOv4, and the model definition for YOLOv4 using TensorFlow 1.15 and Keras. The code was adapted from <a href=\"https:\/\/github.com\/miemie2013\/Keras-YOLOv4\" target=\"_blank\" rel=\"noopener noreferrer\">an earlier implementation<\/a> and converts the PyTorch checkpoint to a Keras h5 saved model. This implementation of YOLOv4 is optimized to run on AWS Inferentia. For more information about optimizations, see <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/master\/src\/examples\/tensorflow\/yolo_v4_demo\/README.md\" target=\"_blank\" rel=\"noopener noreferrer\">Working with YOLO v4 using AWS Neuron SDK<\/a>.<\/p>\n<p>To download, convert, and save your Keras model to the <code>yolo_v4_coco_saved_model<\/code> folder, enter the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">python3 yolo_v4_coco_saved_model.py .\/yolo_v4_coco_saved_model\r\n<\/code><\/pre>\n<\/div>\n<p>To instantiate a new predictor from the saved model, use <code>tf.contrib.predictor.from_saved_model('.\/yolo_v4_coco_saved_model')<\/code> on your inference script.<\/p>\n<p>The following code implements a single batch predictor and image annotation script, so you can test the saved model:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">import json\r\nimport tensorflow as tf\r\nfrom PIL import Image\r\nimport matplotlib.pyplot as plt\r\nimport matplotlib.patches as patches\r\n\r\nyolo_pred_cpu = tf.contrib.predictor.from_saved_model('.\/yolo_v4_coco_saved_model')\r\nimage_path = '.\/val2017\/000000581781.jpg'\r\nwith open(image_path, 'rb') as f:\r\n    feeds = {'image': [f.read()]}\r\n\r\nresults = yolo_pred_cpu(feeds)\r\n\r\n# load annotations to decode classification result\r\nwith open('.\/annotations\/instances_val2017.json') as f:\r\n    annotate_json = json.load(f)\r\nlabel_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}\r\n\r\n# draw picture and bounding boxes\r\nfig, ax = plt.subplots(figsize=(10, 10))\r\nax.imshow(Image.open(image_path).convert('RGB'))\r\n\r\nwanted = results['scores'][0] &gt; 0.1\r\n\r\nfor xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):\r\n    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]\r\n    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')\r\n    ax.add_patch(rect)\r\n    rx, ry = rect.get_xy()\r\n    rx = rx + rect.get_width() \/ 2.0\r\n    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,\r\n                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))\r\nplt.show()\r\n<\/code><\/pre>\n<\/div>\n<p>The performance in this setup isn\u2019t optimal because you ran YOLO only on CPU. Despite the native parallelization from TensorFlow, the eight cores aren\u2019t enough to bring the inference time close to real time. For that, you use AWS Inferentia.<\/p>\n<h2>Compiling YOLOv4 to run on AWS Inferentia<\/h2>\n<p>The compilation of YOLOv4 uses the TensorFlow-Neuron API <code>tfn.saved_mode.compile<\/code>, working directly with the saved model directory created before. To further reduce the Neuron runtime overhead, two extra arguments are added to the compiler call: <code>no_fuse_ops<\/code> and <code>minimum_segment_size<\/code>.<\/p>\n<p>The first argument, <code>no_fuse_ops<\/code>, partitions the graph prior to casting the FP16 tensors running in the sub-graph back to FP32, <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/ea3937e203e19111d636cdbc034194c877bbc100\/src\/examples\/tensorflow\/yolo_v4_demo\/yolo_v4_coco_saved_model.py#L948-L953\" target=\"_blank\" rel=\"noopener noreferrer\">as defined in the model script<\/a>. This allows for operations that run more efficiently on CPU to be skipped while the Neuron compiler runs its automatic smart partitioning. The argument <code>minimum_segment_size<\/code> sets the minimum number of operations in a sub-graph, to enforce trivial compilable sections to run on CPU. For more information, see <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/master\/docs\/tensorflow-neuron\/api-compilation-python-api.md\" target=\"_blank\" rel=\"noopener noreferrer\">Reference: TensorFlow-Neuron Compilation API<\/a>.<\/p>\n<p>To compile the model, enter the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">import shutil\r\nimport tensorflow as tf\r\nimport tensorflow.neuron as tfn\r\n\r\n\r\ndef no_fuse_condition(op):\r\n    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1\/Cast', 'lambda_2\/Cast', 'lambda_3\/Cast'])\r\n\r\nwith tf.Session(graph=tf.Graph()) as sess:\r\n    tf.saved_model.loader.load(sess, ['serve'], '.\/yolo_v4_coco_saved_model')\r\n    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]\r\n\r\nshutil.rmtree('.\/yolo_v4_coco_saved_model_neuron', ignore_errors=True)\r\n\r\nresult = tfn.saved_model.compile(\r\n                '.\/yolo_v4_coco_saved_model', '.\/yolo_v4_coco_saved_model_neuron',\r\n                # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1\/2\r\n                no_fuse_ops=no_fuse_ops,\r\n                # to enforce trivial compilable subgraphs to run on CPU\r\n                minimum_segment_size=100,\r\n                batch_size=1,\r\n                dynamic_batch_size=True,\r\n)\r\n\r\nprint(result)\r\n<\/code><\/pre>\n<\/div>\n<p>On an inf1.2xlarge, the compilation takes only a few minutes and outputs the ratio of the graph operations run on the AWS Inferentia chip. For our model, it\u2019s approximately 79%. As mentioned earlier, to optimize the compiled model for performance, the target of the compilation shouldn\u2019t be to maximize operations on the AWS inferential chip, but to balance the use of the available CPUs for efficient combined hardware utilization.<\/p>\n<p>AWS Inferentia is designed to reach peak throughput at small\u2014usually single-digit\u2014batch sizes. When optimizing a specific model for throughput, explore compiling the model with different values of the <code>batch_size<\/code> argument and test what batch size yields the maximum throughput for your model. In the case of our YOLOv4 model, the best batch size is 1.<\/p>\n<p>Replace the model path on the predictor instantiation to <code>tf.contrib.predictor.from_saved_model('.\/yolo_v4_coco_saved_model_neuron')<\/code> for a comparison with the previous CPU only inference. You get similar detection accuracy at a fraction of the inference time, approximately 40 milliseconds.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16776\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/4-1-1.jpg\" alt=\"\" width=\"900\" height=\"664\"><\/p>\n<h2>Setting up a benchmarking pipeline<\/h2>\n<p>To set up a performance measuring pipeline, create a multi-threaded loop running inference on all the COCO images downloaded. The code <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/master\/src\/examples\/tensorflow\/yolo_v4_demo\/evaluate.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">available in the notebook<\/a> adapts <a href=\"https:\/\/github.com\/miemie2013\/Keras-YOLOv4\/blob\/910c4c6f7265f5828fceed0f784496a0b46516bf\/tools\/cocotools.py#L97\" target=\"_blank\" rel=\"noopener noreferrer\">the original implementation of the eval function<\/a>. The following adapted version implements a <code>ThreadPoolExecutor<\/code> to send four parallel prediction calls at a time:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">from concurrent import futures\r\n\r\ndef evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):\r\n    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)\r\n\r\n    # warm up\r\n    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})\r\n\r\n    with futures.ThreadPoolExecutor(4) as exe:\r\n        fut_im_list = []\r\n        fut_list = []\r\n        start_time = time.time()\r\n        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):\r\n            if len(batch_img_bytes) != eval_batch_size:\r\n                continue\r\n            fut = exe.submit(yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})\r\n            fut_im_list.append((batch_im_id, batch_im_name))\r\n            fut_list.append(fut)\r\n        bbox_list = []\r\n        count = 0\r\n        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):\r\n            results = fut.result()\r\n            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))\r\n            for _ in batch_im_id:\r\n                count += 1\r\n                if count % 100 == 0:\r\n                    print('Test iter {}'.format(count))\r\n        \r\n        print('==================== Performance Measurement ====================')\r\n        print('Finished inference on {} images in {} seconds'.format(len(images), time.time() - start_time))\r\n        print('=================================================================')\r\n    \r\n    # start evaluation\r\n    box_ap_stats = bbox_eval(anno_file, bbox_list)\r\n    return box_ap_stats\r\n<\/code><\/pre>\n<\/div>\n<p>Additional helper functions are used to calculate average precision scores of the deployed model.<\/p>\n<h2>Running a performance benchmark on Inferentia<\/h2>\n<p>To run the COCO evaluation and benchmark the time to infer over the 5,000 images, run the evaluate function as shown in the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">val_coco_root = '.\/val2017'\r\nval_annotate = '.\/annotations\/instances_val2017.json'\r\nclsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,\r\n               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,\r\n               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,\r\n               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,\r\n               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,\r\n               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,\r\n               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}\r\neval_batch_size = 8\r\n\r\nwith open(val_annotate, 'r', encoding='utf-8') as f2:\r\n    for line in f2:\r\n        line = line.strip()\r\n        dataset = json.loads(line)\r\n        images = dataset['images']\r\n\r\nbox_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)\r\n<\/code><\/pre>\n<\/div>\n<p>When the evaluation is complete, you see logs on the screen like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">\u2026\r\n\r\nTest iter 4500\r\nTest iter 4600\r\nTest iter 4700\r\nTest iter 4800\r\nTest iter 4900\r\n==================== Performance Measurement ====================\r\nFinished inference on 5000 images in 47.50522780418396 seconds\r\n=================================================================\r\n\r\n\u2026\r\n\r\nAccumulating evaluation results...\r\nDONE (t=6.78s).\r\n Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.487\r\n Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.741\r\n Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.531\r\n Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330\r\n Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546\r\n Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.573\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.601\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.430\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.657\r\n Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744<\/code><\/pre>\n<\/div>\n<p>At 5,000 images processed in 47 seconds, this deployment achieves 106 FPS, 3.5 times faster than the real-time threshold of 30 FPS. The research paper <a href=\"https:\/\/arxiv.org\/pdf\/2004.10934.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">YOLOv4: Optimal Speed and Accuracy of Object Detection<\/a> lists the results for batch one performance over the same COCO 2017 dataset running on a NVIDIA Volta GPU, such as the V100. The largest frame rate obtained was 96 FPS, at 41.2% mAP. Our model architecture and deployment achieves higher mAP, 48.7%, with a higher frame rate.<\/p>\n<p>To have a direct comparison between AWS Inferentia, NVIDIA Volta, and Turing architectures, we replicated the same experiment in two GPU based instances, g4dn.xlarge and p3.2xlarge, by running the exact same model prior to compilation, with no further GPU optimization. This time we achieved 39 FPS and 111 FPS for the g4dn.xlarge and p3.2xlarge, respectively.<\/p>\n<p>A YOLO model deployed in production usually doesn\u2019t see a defined batch of 5,000 images at a time. To measure production like performance, we set up a prediction-only multi-threaded pipeline that runs inference for extended periods.<\/p>\n<p>For a total time of 2 hours, we continually ran 8 parallel prediction calls with a batch of 4 images on each, totaling 32 images at a time. To maximize GPU throughput and try to decrease the performance gap between the Inf1 and G4 instances, we use the <a href=\"https:\/\/www.tensorflow.org\/xla\" target=\"_blank\" rel=\"noopener noreferrer\">TensorFlow XLA compiler<\/a>. This setup mimics a live endpoint behavior running at maximum throughput.<\/p>\n<h3>GPU thermal throttling<\/h3>\n<p>In contrast to AWS Inferentia chips, GPU throughput is inversely proportional to GPU temperature. GPU temperature can vary on endpoints running for extended periods at high throughput, which leads to FPS and latency fluctuations. This effect is known as <em>thermal throttling<\/em>. Some production systems can define a limit throughput below the maximum achievable to avoid performance swings over time. The following graph shows the average FPS over 30 second increments for the duration of the test . We observed up to 12% variation of the FPS rolling average on the GPU instance. On AWS Inferentia, this variation is below 3% for a substantially larger FPS average.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16778\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/6-.jpg\" alt=\"\" width=\"900\" height=\"600\"><\/p>\n<p>During the 2-hour period, we ran inference on over 856,000 images on the inf1.2xlarge instance. On the g4dn.xlarge, the maximum number of inferences achieved was 486,000. That amounts to 76% more images processed over the same amount of time using AWS Inferentia! Latency averages for batch 4 inference are also 60% lower for AWS Inferentia.<\/p>\n<p>Using the total throughput collected during our 2-hour test, we calculated that the price of running 1 million inferences is $1.362 on an inf1.xlarge in the us-east-1 Region. For the g4dn.xlarge, the price is $2.163\u2014a 37% price reduction for the YOLOv4 object detection pipeline on AWS Inferentia.<\/p>\n<h2>Safely shutting down and cleaning up<\/h2>\n<p>On the Amazon EC2 console, choose the instances used to perform the benchmark, and choose <strong>Terminate <\/strong>from the <strong>Actions <\/strong>drop-down menu. Stopping the instance discards data stored only in the instance\u2019s home volume. You can persist the compiled model in an <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (S3) bucket, so it can be reused later. If you\u2019ve made changes to the code inside the instances, remember to persist those as well.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, you walked through the steps of optimizing a TensorFlow YOLOv4 model to run on AWS Inferentia. You explored AWS Neuron optimizations that yield better model performance with improved average precision, and in a much more cost-effective way. In production, the Neuron compiled model is up to 37% less expensive in the long run, with little throughput and latency fluctuations, when compared to the most optimized GPU instance.<\/p>\n<p>Some of the steps described in this post also apply to other ML model types and frameworks. For more information, see the <a href=\"https:\/\/github.com\/aws\/aws-neuron-sdk\/blob\/master\/README.md\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Neuron SDK GitHub repo<\/a>.<\/p>\n<p>Learn more about the AWS Inferentia chip and the <a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/inf1\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2 Inf1 instances<\/a> to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16779 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/Fabio.jpg\" alt=\"\" width=\"100\" height=\"134\"><strong>Fabio Nonato de Paula<\/strong> is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of ML and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16780 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/Haichen.jpg\" alt=\"\" width=\"100\" height=\"100\"><strong>Haichen Li<\/strong> is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16781 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/05\/Samuel.jpg\" alt=\"\" width=\"100\" height=\"133\"><strong>Samuel Jacob<\/strong> is a senior software engineer in the AWS Neuron team. He works on AWS Neuron runtime to enable high performance inference data paths between AWS Neuron SDK and AWS Inferentia hardware. He also works on tools to analyze and improve AWS Neuron SDK performance. Outside of work, you can catch him playing video games or tinkering with small boards such as RaspberryPi.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/improving-performance-for-deep-learning-based-object-detection-with-an-aws-neuron-compiled-yolov4-model-on-aws-inferentia\/<\/p>\n","protected":false},"author":0,"featured_media":369,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/368"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=368"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/368\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/369"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}