{"id":1177,"date":"2021-11-10T08:35:07","date_gmt":"2021-11-10T08:35:07","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/10\/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker\/"},"modified":"2021-11-10T08:35:07","modified_gmt":"2021-11-10T08:35:07","slug":"deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/10\/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker\/","title":{"rendered":"Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker"},"content":{"rendered":"<div id=\"\">\n<p>Machine learning (ML) and deep learning (DL) are becoming effective tools for solving diverse computing problems, from image classification in medical diagnosis, conversational AI in chatbots, to recommender systems in ecommerce. However, ML models that have specific latency or high throughput requirements can become prohibitively expensive to run at scale on generic computing infrastructure. To achieve performance and deliver inference at the lowest cost, ML models require inference accelerators such as GPUs to meet the stringent throughput, scale, and latency requirements businesses and customers expect.<\/p>\n<p>The deployment of trained models and accompanying code in the data center, public cloud, or at the edge is called <em>inference serving<\/em>. We are proud to announce the integration of NVIDIA Triton Inference Server in <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a>. Triton Inference Server containers in SageMaker help deploy models from multiple frameworks on CPUs or GPUs with high performance.<\/p>\n<p>In this post, we give an overview of the NVIDIA Triton Inference Server and SageMaker, the benefits of using Triton Inference Server containers, and showcase how easy it is to deploy your own ML models.<\/p>\n<h2>NVIDIA Triton Inference Server overview<\/h2>\n<p>The <a href=\"https:\/\/github.com\/triton-inference-server\/server\/\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA Triton Inference Server<\/a> was developed specifically to enable scalable, rapid, and easy deployment of models in production. Triton is open-source inference serving software that simplifies the inference serving process and provides high inference performance. Triton is widely deployed in industries across all major verticals, ranging from FSI, telco, retail, manufacturing, public, healthcare, and more.<\/p>\n<p>The following are some of the key features of Triton:<\/p>\n<ul>\n<li><strong>Support for multiple frameworks<\/strong> \u2013 You can use Triton to deploy models from all major frameworks. Triton supports TensorFlow GraphDef and SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree-based models, OpenVINO, and custom Python\/C++ model formats.<\/li>\n<li><strong>Model pipelines <\/strong>\u2013 The Triton model ensemble represents a pipeline of one or more models or pre- and postprocessing logic and the connection of input and output tensors between them. A single inference request to an ensemble triggers the entire pipeline.<\/li>\n<li><strong>Concurrent model runs <\/strong>\u2013 Multiple models (support for concurrent runs of different models will be added soon) or multiple instances of the same model can run simultaneously on the same GPU or on multiple GPUs for different model management needs.<\/li>\n<li><strong>Dynamic batching <\/strong>\u2013 For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.<\/li>\n<li><strong>Diverse CPU and GPU support <\/strong>\u2013 You can run the models on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.<\/li>\n<\/ul>\n<p>The following diagram illustrates the NVIDIA Triton Inference Server architecture.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/05\/ML-6284-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30450\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/05\/ML-6284-image001.png\" alt=\"\" width=\"1440\" height=\"810\"><\/a><\/p>\n<p>SageMaker is a fully managed service for data science and ML workflows. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.<\/p>\n<p>SageMaker has now integrated NVIDIA Triton Inference Server to serve models for inference in SageMaker. Thanks to the new Triton Inference Server containers, you can easily serve models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by Triton. Triton helps maximize the utilization of GPUs and CPUs, lowering the cost of inference.<\/p>\n<p>This combination of SageMaker and NVIDIA Triton Inference Server enables developers across all industry verticals to rapidly deploy their models into production at scale.<\/p>\n<p>In the following sections, we detail the steps needed to package your model, create a SageMaker endpoint, and benchmark performance. Note that the initial release of Triton Inference Server containers will only support one or more instances of a single model. Future releases will have multi-model support as well.<\/p>\n<h2>Prepare your model<\/h2>\n<p>To prepare your model for Triton deployment, you should arrange your Triton serving directory in the following format:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">triton_serve\n \u2514\u2500\u2500 model\n \t\u2514\u2500\u2500 config.pbtxt\n \t\u2514\u2500\u2500 1\n     \t         \u2514\u2500\u2500 model.plan<\/code><\/pre>\n<\/p><\/div>\n<p>In this format, <code>triton_serve<\/code> is the directory containing all of your models, <code>model<\/code> is the model name, and <code>1<\/code> is the version number.<\/p>\n<p>In addition to the default configuration like input and output definitions, we recommend using an optimal configuration based on the actual workload that users need for the <code>config.pbtxt<\/code> file.<\/p>\n<p>For example, you only need four lines of code to enable the built-in server-side dynamic batching:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">dynamic_batching {\n   preferred_batch_size: 16\n   max_queue_delay_microseconds: 1000\n }<\/code><\/pre>\n<\/p><\/div>\n<p>Here, the <code>preferred_batch_size<\/code> option means the preferred batch size that you want to combine your input requests into. The <code>max_queue_delay_microseconds<\/code> option is how long the NVIDIA Triton server waits when the preferred size can\u2019t be created from the available requests.<\/p>\n<p>For concurrent model runs, directly specifying the model concurrency per GPU by changing the <code>count<\/code> number in the <code>instance_group<\/code> allows you to easily run multiple copies of the same model to better utilize your compute resources:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">instance_group {\n   count: 1\n   kind: KIND_GPU\n }<\/code><\/pre>\n<\/p><\/div>\n<p>For more information about the configuration files, see <a href=\"https:\/\/github.com\/triton-inference-server\/server\/blob\/master\/docs\/model_configuration.md\" target=\"_blank\" rel=\"noopener noreferrer\">Model Configuration<\/a>.<\/p>\n<p>After you create the model directory, you may use the following command to compress it into a .tar file for your later <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket uploads.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">tar -C triton-serve\/ -czf model.tar.gz model<\/code><\/pre>\n<\/p><\/div>\n<h2>Create a SageMaker endpoint<\/h2>\n<p>To create a SageMaker endpoint with the model repository just created, you have several different options, including using the SageMaker endpoint creation UI, the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI), and the SageMaker Python SDK.<\/p>\n<p>In this notebook example, we use the SageMaker Python SDK.<\/p>\n<ol>\n<li>Create the container definition with both the Triton server container and the uploaded model artifact on the S3 bucket:\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">container = {\n    'Image': triton_image_uri,\n    'ModelDataUrl': model_uri,\n    'Environment': {\n        'SAGEMAKER_TRITON_DEFAULT_MODEL_NAME': 'resnet'\n    }\n}<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create a SageMaker model definition with the container definition in the last step:\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">create_model_response = sm.create_model(\n    ModelName         = sm_model_name,\n    ExecutionRoleArn  = role,\n    PrimaryContainer  = container)\nprint(\"Model Arn: \" + create_model_response['ModelArn'])<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create an endpoint configuration by specifying the instance type and number of instances you want in the endpoint:\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">endpoint_config_name = 'triton-resnet-pt-' + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n\ncreate_endpoint_config_response = sm.create_endpoint_config(\n    EndpointConfigName = endpoint_config_name,\n    ProductionVariants = [{\n        'InstanceType'        : 'ml.g4dn.4xlarge',\n        'InitialVariantWeight': 1,\n        'InitialInstanceCount': 1,\n        'ModelName'           : sm_model_name,\n        'VariantName'         : 'AllTraffic'}])\n\nprint(\"Endpoint Config Arn: \" + create_endpoint_config_response['EndpointConfigArn'])<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create the endpoint by running the following commands:\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">endpoint_name = 'triton-resnet-pt-' + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n\ncreate_endpoint_response = sm.create_endpoint(\n    EndpointName         = endpoint_name,\n    EndpointConfigName   = endpoint_config_name)<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>SageMaker helps developers and organizations across all industries easily adopt and deploy AI models in applications by providing an easy-to-use, fully managed development and deployment platform. With Triton Inference Server containers, organizations can further streamline their model deployment in SageMaker by having a single inference serving solution for multiple frameworks on GPUs and CPUs with high performance.<\/p>\n<p>We invite you to try Triton Inference Server containers in SageMaker, and share your feedback and questions in the comments.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/05\/07\/Santosh-Bhavani.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-24343 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/05\/07\/Santosh-Bhavani.jpg\" alt=\"\" width=\"100\" height=\"130\"><\/a>Santosh Bhavani<\/strong>\u00a0is a Senior Technical Product Manager with the Amazon SageMaker Elastic Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys traveling, playing tennis, and drinking lots of Pu\u2019er tea.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/07\/08\/qingwei-li-100.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-13650 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/07\/08\/qingwei-li-100.jpg\" alt=\"\" width=\"100\" height=\"134\"><\/a>Qingwei Li<\/strong>\u00a0is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor\u2019s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/Jiahong.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30455 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/Jiahong.jpg\" alt=\"\" width=\"100\" height=\"100\"><\/a>Jiahong Liu<\/strong> is a Solution Architect on the NVIDIA Cloud Service Provider team. He assists clients in adopting machine learning and artificial intelligence solutions that leverage the powerfulness of NVIDIA\u2019s GPUs to address training and inference challenges in business. In his leisure time, he enjoys Origami, DIY projects, and playing basketball.<\/p>\n<p><strong>\u00a0<img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30533 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/10\/Eliuth.jpg\" alt=\"\" width=\"100\" height=\"125\">Eliuth Triuna Isaza<\/strong>\u00a0is a Developer Relations Manager on the NVIDIA-AWS team. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML\/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30532 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/10\/Aaqib-Ansari.jpg\" alt=\"\" width=\"100\" height=\"134\"><\/strong><strong>Aaqib Ansari<\/strong> is a Software Development Engineer with the Amazon SageMaker Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys hiking, running, photography and sketching.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker\/<\/p>\n","protected":false},"author":0,"featured_media":1178,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1177"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1177"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1177\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1178"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1177"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1177"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}