{"id":1439,"date":"2022-01-04T20:47:43","date_gmt":"2022-01-04T20:47:43","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/01\/04\/take-advantage-of-advanced-deployment-strategies-using-amazon-sagemaker-deployment-guardrails\/"},"modified":"2022-01-04T20:47:43","modified_gmt":"2022-01-04T20:47:43","slug":"take-advantage-of-advanced-deployment-strategies-using-amazon-sagemaker-deployment-guardrails","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/01\/04\/take-advantage-of-advanced-deployment-strategies-using-amazon-sagemaker-deployment-guardrails\/","title":{"rendered":"Take advantage of advanced deployment strategies using Amazon SageMaker deployment guardrails"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2021\/11\/new-deployment-guardrails-amazon-sagemaker-inference-endpoints\/\" target=\"_blank\" rel=\"noopener noreferrer\">Deployment guardrails<\/a> in <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> provide a new set of deployment capabilities allowing you to implement advanced deployment strategies that minimize risk when deploying new model versions on SageMaker hosting. Depending on your use case, you can use a variety of deployment strategies to release new model versions. Each of these strategies relies on a mechanism to shift inference traffic to one or more versions of a deployed model. The chosen strategy depends on your business requirements for your machine learning (ML) use case. However, any strategy should include the ability to monitor the performance of new model versions and automatically roll back to a previous version as needed to minimize potential risk of introducing a new model version with errors. Deployment guardrails offer new advanced deployment capabilities and as of this writing supports two new traffic shifting policies, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deployment-guardrails-blue-green-canary.html\" target=\"_blank\" rel=\"noopener noreferrer\">canary<\/a> and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deployment-guardrails-blue-green-linear.html\" target=\"_blank\" rel=\"noopener noreferrer\">linear<\/a>, as well as the ability to automatically roll back when issues are detected.<\/p>\n<p>As part of your MLOps strategy to create repeatable and reliable mechanisms to deploy your models, you should also ensure that the chosen deployment strategy is implemented as part of your automated deployment pipeline. Deployment guardrails use the existing SageMaker <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/APIReference\/API_CreateEndpoint.html\" target=\"_blank\" rel=\"noopener noreferrer\">CreateEndpoint<\/a> and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/APIReference\/API_UpdateEndpoint.html\" target=\"_blank\" rel=\"noopener noreferrer\">UpdateEndpoint<\/a> APIs, so you can modify your existing deployment pipeline configurations to take advantage of the new deployment capabilities.<\/p>\n<p>In this post, we show you how to use the new deployment guardrail capabilities to deploy your model versions using both a canary and linear deployment strategy.<\/p>\n<h2>Solution overview<\/h2>\n<p><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deploy-model.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker inference<\/a> provides managed deployment strategies for testing new versions of your models in production. We cover two new traffic shifting policies in this post: <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deployment-guardrails-blue-green-canary.html\" target=\"_blank\" rel=\"noopener noreferrer\">canary<\/a> and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deployment-guardrails-blue-green-linear.html\" target=\"_blank\" rel=\"noopener noreferrer\">linear<\/a>. For each of these traffic shifting modes, two HTTPS endpoints are provisioned. Two endpoints are provisioned to reduce deployment risk as traffic is shifted from the original endpoint variant to the new endpoint variant. You configure the endpoints to contain one or more compute instances to deploy your trained model and perform inference requests. SageMaker manages the routing of traffic between the two endpoints. You define <a href=\"https:\/\/aws.amazon.com\/cloudwatch\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a> metrics and alarms to monitor metrics on the new endpoint, when traffic is shifted, for a set baking period. If a CloudWatch alarm is triggered, SageMaker performs an auto-rollback to route all traffic to the original endpoint variant. If no CloudWatch alarms are triggered, the original endpoint variant is stopped and the new endpoint variant continues to receive all traffic. The following diagrams illustrate shifting traffic to the new endpoint.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31846\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image001.png\" alt=\"\" width=\"830\" height=\"475\"><\/a><\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31847\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image003.png\" alt=\"\" width=\"830\" height=\"485\"><\/a> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31848\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image005.png\" alt=\"\" width=\"721\" height=\"485\"><\/a><\/p>\n<p>Let\u2019s dive deeper into examples of the canary and linear traffic shifting policies.<\/p>\n<p>We go over the following high-level steps as part of the deployment procedure:<\/p>\n<ol>\n<li>Create the model and endpoint configurations required for the three scenarios: the baseline, the update containing the incompatible model version, and the update with the correct model version.<\/li>\n<li>Invoke the baseline endpoint prior to the update.<\/li>\n<li>Specify the CloudWatch alarms used to trigger the rollbacks.<\/li>\n<li>Update the endpoint to trigger a rollback using either the canary or linear strategy.<\/li>\n<\/ol>\n<p>First, let\u2019s start with canary deployment.<\/p>\n<h2>Canary deployment<\/h2>\n<p>The canary deployment option lets you shift one small portion of your traffic (a <em>canary<\/em>) to the green fleet and monitor it for a baking period. If the canary succeeds on the green fleet, the rest of the traffic is shifted from the blue fleet to the green fleet before stopping the blue fleet.<\/p>\n<p>To demonstrate canary deployments and the auto-rollback feature, we update an endpoint with an incompatible model version and deploy it as a canary fleet, taking a small percentage of the traffic. Requests sent to this canary fleet result in errors, which trigger a rollback using preconfigured CloudWatch alarms. We also demonstrate a success scenario where no alarms are tripped and the update succeeds.<\/p>\n<h3>Create and deploy the models<\/h3>\n<p>First, we upload our pre-trained models to <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3). These models were trained using the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/master\/introduction_to_applying_machine_learning\/xgboost_customer_churn\/xgboost_customer_churn.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost churn prediction notebook<\/a> in SageMaker. You can also use your own pre-trained models in this step. If you already have a pre-trained model in Amazon S3, you can add it by specifying the <code>s3_key<\/code>.<\/p>\n<p>The models in this example are used to predict the probability of a mobile customer leaving their current mobile operator. The dataset we use is publicly available and was mentioned in the book <a href=\"https:\/\/www.amazon.com\/dp\/0470908742\/\" target=\"_blank\" rel=\"noopener noreferrer\">Discovering Knowledge in Data<\/a> by Daniel T. Larose.<\/p>\n<p>Upload the models with the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">model_url = S3Uploader.upload(local_path=\"model\/xgb-churn-prediction-model.tar.gz\",\ndesired_s3_uri=f\"s3:\/\/{bucket}\/{prefix}\")\nmodel_url2 = S3Uploader.upload(local_path=\"model\/xgb-churn-prediction-model2.tar.gz\",\ndesired_s3_uri=f\"s3:\/\/{bucket}\/{prefix}\")<\/code><\/pre>\n<\/p><\/div>\n<p>Next, we create our model definitions. We start with deploying the pre-trained churn prediction models. Here, we create the model objects with the image and model data. The three URIs correspond to the baseline version, the update containing the incompatible version, and the update containing the correct model version:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">image_uri = image_uris.retrieve('xgboost', boto3.Session().region_name, '0.90-1')\n# using newer version of XGBoost which is incompatible, in order to simulate model faults\n\nimage_uri2 = image_uris.retrieve('xgboost', boto3.Session().region_name, '1.2-1')\nimage_uri3 = image_uris.retrieve('xgboost', boto3.Session().region_name, '0.90-2')\nmodel_name = f\"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}\" \nmodel_name2 = f\"DEMO-xgb-churn-pred2-{datetime.now():%Y-%m-%d-%H-%M-%S}\"\nmodel_name3 = f\"DEMO-xgb-churn-pred3-{datetime.now():%Y-%m-%d-%H-%M-%S}\"\n\nresp = sm.create_model(\n    ModelName=model_name,\n    ExecutionRoleArn=role,\n    Containers=[{\n       'Image': image_uri,\n       'ModelDataUrl': model_url\n     }])\n\nresp = sm.create_model(\n    ModelName=model_name2,\n    ExecutionRoleArn=role,\n    Containers=[{\n       'Image':image_uri2,\n       'ModelDataUrl': model_url2\n     }])\n\nresp = sm.create_model(\n    ModelName=model_name3,\n    ExecutionRoleArn=role,\n    Containers=[{\n       'Image':image_uri3,\n       'ModelDataUrl': model_url2\n     }])<\/code><\/pre>\n<\/p><\/div>\n<p>Now that the three models are created, we create the three endpoint configs:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">ep_config_name = f\"DEMO-EpConfig-1-{datetime.now():%Y-%m-%d-%H-%M-%S}\" \nep_config_name2 = f\"DEMO-EpConfig-2-{datetime.now():%Y-%m-%d-%H-%M-%S}\" \nep_config_name3 = f\"DEMO-EpConfig-3-{datetime.now():%Y-%m-%d-%H-%M-%S}\"\n\nresp = sm.create_endpoint_config(\n     EndpointConfigName=ep_config_name,\n     ProductionVariants=[\n        {\n          'VariantName': \"AllTraffic\",\n          'ModelName': model_name,\n          'InstanceType': \"ml.m5.xlarge\",\n          \"InitialInstanceCount\": 3\n        }\n      ])\n\nresp = sm.create_endpoint_config(\n     EndpointConfigName=ep_config_name2,\n     ProductionVariants=[\n        {\n          'VariantName': \"AllTraffic\",\n          'ModelName': model_name2,\n          'InstanceType': \"ml.m5.xlarge\",\n          \"InitialInstanceCount\": 3\n        }\n      ])\n\n\nresp = sm.create_endpoint_config(\n      EndpointConfigName=ep_config_name3,\n      ProductionVariants=[\n         {\n           'VariantName': \"AllTraffic\",\n           'ModelName': model_name3,\n           'InstanceType': \"ml.m5.xlarge\",\n           \"InitialInstanceCount\": 3\n         }\n     ])<\/code><\/pre>\n<\/p><\/div>\n<p>We then deploy the baseline model to a SageMaker endpoint:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">resp = sm.create_endpoint(\n          EndpointName=endpoint_name,\n          EndpointConfigName=ep_config_name\n)<\/code><\/pre>\n<\/p><\/div>\n<h3>Invoke the endpoint<\/h3>\n<p>This step invokes the endpoint with sample data with a maximum invocations count and waiting intervals. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">def invoke_endpoint(endpoint_name, max_invocations=300, wait_interval_sec=1, should_raise_exp=False):\n    print(f\"Sending test traffic to the endpoint {endpoint_name}. nPlease wait...\")\n \n    count = 0\n    with open('test_data\/test-dataset-input-cols.csv', 'r') as f:\n        for row in f:\n            payload = row.rstrip('n')\n            try:\n                response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name,\n                                                      ContentType='text\/csv', \n                                                      Body=payload)\n                response['Body'].read()\n                print(\".\", end=\"\", flush=True)\n            except Exception as e:\n                print(\"E\", end=\"\", flush=True)\n                if should_raise_exp:\n                    raise e\n            count += 1\n            if count &gt; max_invocations:\n                break\n            time.sleep(wait_interval_sec)\n \n    print(\"nDone!\")\n \ninvoke_endpoint(endpoint_name, max_invocations=100)<\/code><\/pre>\n<\/p><\/div>\n<p>For a full list of metrics, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/monitoring-cloudwatch.html\" target=\"_blank\" rel=\"noopener noreferrer\">Monitor Amazon SageMaker with Amazon CloudWatch<\/a>.<\/p>\n<p>Then we plot graphs to show the metrics <code>Invocations<\/code>, <code>Invocation4XXErrors<\/code>, <code>Invocation5XXErrors<\/code>, <code>ModelLatency<\/code>, and <code>OverheadLatency<\/code> against the endpoint over time.<\/p>\n<p>You can observe a flat line for <code>Invocation4XXErrors<\/code> and <code>Invocation5XXErrors<\/code> because we\u2019re using the correct version model version and configs. Additionally, <code>ModelLatency<\/code> and <code>OverheadLatency<\/code> start decreasing over time.<\/p>\n<h3>Create CloudWatch alarms to monitor endpoint performance<\/h3>\n<p>We create CloudWatch alarms to monitor endpoint performance with the metrics <code>Invocation5XXErrors<\/code> and <code>ModelLatency<\/code>.<\/p>\n<p>We use metric dimensions <code>EndpointName<\/code> and <code>VariantName<\/code> to select the metric for each endpoint config and variant. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">def create_auto_rollback_alarm(alarm_name, endpoint_name, variant_name, metric_name, statistic, threshold):\ncw.put_metric_alarm(\nAlarmName=alarm_name,\nAlarmDescription='Test SageMaker endpoint deployment auto-rollback alarm',\nActionsEnabled=False,\nNamespace='AWS\/SageMaker',\nMetricName=metric_name,\nStatistic=statistic,\nDimensions=[\n{\n'Name': 'EndpointName',\n'Value': endpoint_name\n},\n{\n'Name': 'VariantName',\n'Value': variant_name\n}\n],\nPeriod=60,\nEvaluationPeriods=1,\nThreshold=threshold,\nComparisonOperator='GreaterThanOrEqualToThreshold',\nTreatMissingData='notBreaching'\n)\n\n# alarm on 1% 5xx error rate for 1 minute\ncreate_auto_rollback_alarm(error_alarm, endpoint_name, 'AllTraffic', 'Invocation5XXErrors', 'Average', 1)\n# alarm on model latency &gt;= 10 ms for 1 minute\ncreate_auto_rollback_alarm(latency_alarm, endpoint_name, 'AllTraffic', 'ModelLatency', 'Average', 10000)<\/code><\/pre>\n<\/p><\/div>\n<h3>Update the endpoint with deployment configurations<\/h3>\n<p>We define the following deployment configuration to perform a blue\/green update strategy with canary traffic shifting from the old to the new stack. The canary traffic shifting option can reduce the blast ratio of a regressive update to the endpoint. In contrast, for the all-at-once traffic shifting option, the invocation requests start faulting at 100% after flipping the traffic. In canary mode, invocation requests are shifted to the new version of model gradually, preventing errors from impacting 100% of the traffic. Additionally, the auto-rollback alarms monitor the metrics during the canary stage.<\/p>\n<p>The following diagram illustrates the workflow of our rollback use case.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image017.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31855\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image017.png\" alt=\"\" width=\"2667\" height=\"492\"><\/a><\/p>\n<p>We update the endpoint with an incompatible model version to simulate errors and trigger a rollback:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">canary_deployment_config = {\n    \"BlueGreenUpdatePolicy\": {\n        \"TrafficRoutingConfiguration\": {\n            \"Type\": \"CANARY\",\n            \"CanarySize\": {\n                \"Type\": \"INSTANCE_COUNT\", # or use \"CAPACITY_PERCENT\" as 30%, 50%\n                \"Value\": 1\n            },\n            \"WaitIntervalInSeconds\": 300, # wait for 5 minutes before enabling traffic on the rest of fleet\n        },\n        \"TerminationWaitInSeconds\": 120, # wait for 2 minutes before terminating the old stack\n        \"MaximumExecutionTimeoutInSeconds\": 1800 # maximum timeout for deployment\n    },\n    \"AutoRollbackConfiguration\": {\n        \"Alarms\": [\n            {\n                \"AlarmName\": error_alarm\n            },\n            {\n                \"AlarmName\": latency_alarm\n            }\n        ],\n    }\n}\n \n# update endpoint request with new DeploymentConfig parameter\nsm.update_endpoint(\n    EndpointName=endpoint_name,\n    EndpointConfigName=ep_config_name2,\n    DeploymentConfig=canary_deployment_config\n)<\/code><\/pre>\n<\/p><\/div>\n<p>When we invoke the endpoint, we encounter errors because of the incompatible version of the model (<code>ep_config_name2<\/code>), and this leads to the rollback to a stable version of the model (<code>ep_config_name1<\/code>). This is reflected in the following graphs as <code>Invocation5XXErrors<\/code> and <code>ModelLatency<\/code> increase during this rollback phase.<\/p>\n<p>The following diagram shows a success case where we use the same canary deployment configuration but a valid endpoint configuration.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image025.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-31860 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image025.png\" alt=\"\" width=\"1276\" height=\"654\"><\/a><\/p>\n<p>We update the endpoint configuration to a valid version (using the same canary deployment config as the rollback case):<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># update endpoint with a valid version of DeploymentConfig\nsm.update_endpoint(\n    EndpointName=endpoint_name,\n    EndpointConfigName=ep_config_name3,\n    RetainDeploymentConfig=True\n)<\/code><\/pre>\n<\/p><\/div>\n<p>We plot graphs to show the <code>Invocations<\/code>, <code>Invocation5XXErrors<\/code>, and <code>ModelLatency<\/code> metrics against the endpoint. When the new <code>endpoint config-3<\/code> (correct model version) starts getting deployed, it takes over endpoint <code>config-2<\/code> (incompatible due to model version) without any errors. We can see this in the graphs as <code>Invocation5XXErrors<\/code> and <code>ModelLatency<\/code> decrease during this transition phase.<\/p>\n<p>Next, let\u2019s see how linear deployments are configured and how it works.<\/p>\n<h2>Linear deployment<\/h2>\n<p>The linear deployment option provides even more customization over how many traffic-shifting steps to make and what percentage of traffic to shift for each step. Whereas canary shifting lets you shift traffic in two steps, linear shifting extends this to <em>n<\/em> linearly spaced steps.<\/p>\n<p>To demonstrate linear deployments and the auto-rollback feature, we update an endpoint with an incompatible model version and deploy it as a linear fleet, taking a small percentage of the traffic. Requests sent to this linear fleet result in errors, which triggers a rollback using preconfigured CloudWatch alarms. We also demonstrate a success scenario where no alarms are tripped and the update succeeds.<\/p>\n<p>The steps to create the models, invoke the endpoint, and create the CloudWatch alarms are the same as with the canary method.<\/p>\n<p>We define the following deployment configuration to perform a blue\/green update strategy with linear traffic shifting from old to new stack. The linear traffic shifting option can reduce the blast ratio of a regressive update to the endpoint. In contrast, for the all-at-once traffic shifting option, the invocation requests start faulting at 100% after flipping the traffic. In linear mode, invocation requests are shifted to the new version of the model gradually, with a controlled percentage of traffic shifting for each step. You can use the auto-rollback alarms to monitor the metrics during the linear traffic shifting stage.<\/p>\n<p>The following diagram shows the workflow for our linear rollback case.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image033.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31864\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image033.png\" alt=\"\" width=\"1276\" height=\"517\"><\/a><\/p>\n<p>We update the endpoint with an incompatible model version to simulate errors and trigger a rollback:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">linear_deployment_config = {\n    \"BlueGreenUpdatePolicy\": {\n        \"TrafficRoutingConfiguration\": {\n            \"Type\": \"LINEAR\",\n            \"LinearStepSize\": {\n                \"Type\": \"CAPACITY_PERCENT\",\n                \"Value\": 33, # 33% of whole fleet capacity (33% * 3 = 1 instance)\n            },\n            \"WaitIntervalInSeconds\": 180, # wait for 3 minutes before enabling traffic on the rest of fleet\n        },\n        \"TerminationWaitInSeconds\": 120, # wait for 2 minutes before terminating the old stack\n        \"MaximumExecutionTimeoutInSeconds\": 1800 # maximum timeout for deployment\n    },\n    \"AutoRollbackConfiguration\": {\n        \"Alarms\": [\n            {\n                \"AlarmName\": error_alarm\n            },\n            {\n                \"AlarmName\": latency_alarm\n            }\n        ],\n    }\n}\n \n# update endpoint request with new DeploymentConfig parameter\nsm.update_endpoint(\n    EndpointName=endpoint_name,\n    EndpointConfigName=ep_config_name2,\n    DeploymentConfig=linear_deployment_config\n)<\/code><\/pre>\n<\/p><\/div>\n<p>When we invoke the endpoint, we encounter errors because of the incompatible version of the model (<code>ep_config_name2<\/code>), which leads to the rollback to a stable version of the model (<code>ep_config_name1<\/code>). We can see this in the following graphs as the <code>Invocation5XXErrors<\/code> and ModelLatency metrics increase during this rollback phase.<\/p>\n<p>Let\u2019s look at a success case where we use the same linear deployment configuration but a valid endpoint configuration. The following diagram illustrates our workflow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image041.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31868\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/ML-6319-image041.png\" alt=\"\" width=\"1276\" height=\"516\"><\/a><\/p>\n<p>We update the endpoint to a valid endpoint configuration version with the same linear deployment configuration:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># update endpoint with a valid version of DeploymentConfig\nsm.update_endpoint(\n    EndpointName=endpoint_name,\n    EndpointConfigName=ep_config_name3,\n    RetainDeploymentConfig=True\n)<\/code><\/pre>\n<\/p><\/div>\n<p>Then we plot graphs to show the <code>Invocations<\/code>, <code>Invocation5XXErrors<\/code>, and <code>ModelLatency<\/code> metrics against the endpoint.<\/p>\n<p>As the new endpoint <code>config-3<\/code> (correct model version) starts getting deployed, it takes over endpoint <code>config-2<\/code> (incompatible due to model version) without any errors. We can see this in the following graphs as <code>Invocation5XXErrors<\/code> and <code>ModelLatency<\/code> decrease during this transition phase.<\/p>\n<h2>Considerations and best practices<\/h2>\n<p>Now that we\u2019ve walked through a comprehensive example, let\u2019s recap some best practices and considerations:<\/p>\n<ul>\n<li><strong>Pick the right health check<\/strong> \u2013 The CloudWatch alarms determine whether the traffic shift to the new endpoint variant succeeds. In our example, we used <code>Invocation5XXErrors<\/code> (caused by the endpoint failing to return a valid result) and <code>ModelLatency<\/code>, which measure how long the model takes to return a response. You can consider other built-in metrics in some cases, like <code>OverheadLatency<\/code>, which accounts for other causes of latency, such as unusually large response payloads. You can also have your inference code record custom metrics, and you can configure the alarm measurement evaluation interval. For more information about available metrics, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Endpoint Invocation Metrics<\/a>.<\/li>\n<li><strong>Pick the most suitable traffic shifting policy<\/strong> \u2013 The all-at-once policy is a good choice if you just want to make sure that the new endpoint variant is healthy and able to serve traffic. The canary policy is useful if you want to avoid affecting too much traffic if the new endpoint variant has a problem, or if you want to evaluate a custom metric on a small percentage of traffic before shifting over. For example, perhaps you want to emit a custom metric that checks for inference response distribution, and make sure it falls within expected ranges. The linear policy is a more conservative and more complex take on the canary pattern.<\/li>\n<li><strong>Monitor the alarms<\/strong> \u2013 The alarms you use to trigger rollback should also cause other actions, like notifying an operations team.<\/li>\n<li><strong>Use the same deployment strategy in multiple environments<\/strong> \u2013 As part of an overall MLOps pipeline, use the same deployment strategy in test as well as production environments, so that you become comfortable with the behavior. This consideration implies that you can inject realistic load onto your test endpoints.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>In this post, we introduced SageMaker inference\u2019s new deployment guardrail options, which let you manage deployment of a new model version in a safe and controlled way. We reviewed the new traffic shifting policies, canary and linear, and showed how to use them in a realistic example. Finally, we discussed some best practices and considerations. Get started today with deployment guardrails on the SageMaker console or, for more information, review <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/deployment-guardrails.html\" target=\"_blank\" rel=\"noopener noreferrer\">Deployment Guardrails<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/Raghu-Ramesha-1.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-31886 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/17\/Raghu-Ramesha-1.jpg\" alt=\"\" width=\"100\" height=\"140\"><\/a>Raghu Ramesha<\/strong> is an ML Solutions Architect with the Amazon SageMaker Services SA team. He focuses on helping customers migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master\u2019s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/11\/25\/shelbees-100.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-10409 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/11\/25\/shelbees-100.jpg\" alt=\"\" width=\"100\" height=\"148\"><\/a>Shelbee Eigenbrode<\/strong> is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/07\/30\/Randy-DeFauw.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-26720 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/07\/30\/Randy-DeFauw.png\" alt=\"\" width=\"100\" height=\"134\"><\/a>Randy DeFauw<\/strong> is a Principal Solutions Architect. He\u2019s an electrical engineer by training who\u2019s been working in technology for 23 years at companies ranging from startups to large defense firms. A fascination with distributed consensus systems led him into the big data space, where he discovered a passion for analytics and machine learning. He started using AWS in his Hadoop days, where he saw how easy it was to set up large complex infrastructure, and then realized that the cloud solved some of the challenges he saw with Hadoop. Randy picked up an MBA so he could learn how business leaders think and talk, and found that the soft skill classes were some of the most interesting ones he took. Lately, he\u2019s been dabbling with reinforcement learning as a way to tackle optimization problems, and re-reading Martin Kleppmann\u2019s book on data intensive design.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32067 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/03\/Lauren-Mullennex.jpg\" alt=\"\" width=\"99\" height=\"127\">Lauren Mullennex<\/strong> is a Solutions Architect based in Denver, CO. She works with customers to help them architect solutions on AWS. In her spare time, she enjoys hiking and cooking Hawaiian cuisine.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/take-advantage-of-advanced-deployment-strategies-using-amazon-sagemaker-deployment-guardrails\/<\/p>\n","protected":false},"author":0,"featured_media":1440,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1439"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1439"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1439\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1440"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1439"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1439"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1439"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}