{"id":1141,"date":"2021-11-03T08:40:21","date_gmt":"2021-11-03T08:40:21","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/03\/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected\/"},"modified":"2021-11-03T08:40:21","modified_gmt":"2021-11-03T08:40:21","slug":"automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/03\/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected\/","title":{"rendered":"Automate model retraining with Amazon SageMaker Pipelines when drift is detected"},"content":{"rendered":"<div id=\"\">\n<p>Training your machine learning (ML) model and serving predictions is usually not the end of the ML project. The accuracy of ML models can deteriorate over time, a phenomenon known as <em>model drift<\/em>. Many factors can cause model drift, such as changes in model features. The accuracy of ML models can also be affected by <em>concept drift<\/em>, the difference between data used to train models and data used during inference. Therefore, teams must make sure that their models and solutions remain relevant and continue providing value back to the business. Without proper metrics, alarms, and automation in place, the technical debt from simply maintaining existing ML models in production can become overwhelming.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/pipelines\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Pipelines<\/a> is a native <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/workflows.html\" target=\"_blank\" rel=\"noopener noreferrer\">workflow orchestration tool<\/a> for building ML pipelines that take advantage of direct <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> integration. Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.<\/p>\n<p>In this post, we discuss how to automate retraining with pipelines in SageMaker when model drift is detected.<\/p>\n<h2>From static models to continuous training<\/h2>\n<p>Static models are a great place to start when you\u2019re experimenting with ML. However, because real-world data is always changing, static models degrade over time, and your training dataset won\u2019t represent real behavior for long. Having an effective deployment <a href=\"https:\/\/docs.aws.amazon.com\/wellarchitected\/latest\/machine-learning-lens\/ml-lifecycle-phase-monitoring.html\" target=\"_blank\" rel=\"noopener noreferrer\">model monitoring phase<\/a> is an important step when building an MLOps pipeline. It\u2019s also one of the most challenging aspects of MLOps because it requires having an effective feedback loop between the data captured by your production system and the data distribution used during the training phase. For model retraining to be effective, you also must be continuously updating your training dataset with new ground truth labels. You might be able to use implicit or explicit feedback from users based on the predictions you provide, such as in the case of recommendations. Alternatively, you may need to introduce a human in the loop workflow through a service like <a href=\"https:\/\/aws.amazon.com\/augmented-ai\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Augmented AI <\/a>(Amazon A2I) to qualify the accuracy of predictions from your ML system. Other considerations are to monitor predictions for bias on a regular basis, which can be supported through <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/clarify-model-monitor-bias-drift.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Clarify<\/a>.<\/p>\n<p>In this post, we propose a solution that focuses on <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-monitor-data-quality.html\" target=\"_blank\" rel=\"noopener noreferrer\">data quality monitoring<\/a> to detect concept drift in the production data and retrain your model automatically.<\/p>\n<h2>Solution overview<\/h2>\n<p>Our solution uses an open-source <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template to create a model build and deployment pipeline. We use Pipelines and supporting AWS services, including <a href=\"https:\/\/aws.amazon.com\/codepipeline\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CodePipeline<\/a>, <a href=\"https:\/\/aws.amazon.com\/codebuild\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CodeBuild<\/a>, and <a href=\"https:\/\/aws.amazon.com\/eventbridge\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EventBridge<\/a>.<\/p>\n<p>The following diagram illustrates our architecture.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image001.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29667\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image001.jpg\" alt=\"\" width=\"1895\" height=\"1091\"><\/a><\/p>\n<p>The following are the high-level steps for this solution:<\/p>\n<ol>\n<li>Create the new <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a> <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-projects-whatis.html\" target=\"_blank\" rel=\"noopener noreferrer\">project<\/a> based on the custom template.<\/li>\n<li>Create a SageMaker pipeline to perform data preprocessing, generate baseline statistics, and train a model.<\/li>\n<li>Register the model in the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-registry.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Model Registry<\/a>.<\/li>\n<li>The data scientist verifies the models metrics and performance and approves the model.<\/li>\n<li>The model is deployed to a real-time endpoint in staging and, after approval, to a production endpoint.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-monitor.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Model Monitor<\/a> is configured on the production endpoint to detect a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-monitor-data-quality.html\" target=\"_blank\" rel=\"noopener noreferrer\">concept drift of the data<\/a> with respect to the training baseline.<\/li>\n<li>Model Monitor is scheduled to run every hour, and publishes metrics to <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a>.<\/li>\n<li>A CloudWatch alarm is raised when metrics exceed a model-specific threshold. This results in an EventBridge rule starting the model build pipeline.<\/li>\n<li>The model build pipeline can also be retrained with an EventBridge rule that runs on a schedule.<\/li>\n<\/ol>\n<h2>Dataset<\/h2>\n<p>This solution uses the <a href=\"https:\/\/registry.opendata.aws\/nyc-tlc-trip-records-pds\/\" target=\"_blank\" rel=\"noopener noreferrer\">New York City Taxi and Limousine Commission (TLC) Trip Record Data<\/a> public dataset to train a model to predict taxi fare based on the information available for that trip. The available information includes the start and end location and travel date and time from which we engineer datetime features and distance traveled.<\/p>\n<p>For example, in the following image, we see an example trip for location 65 (Downtown Brooklyn) to 68 (East Chelsea), which took 21 minutes and cost $20.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image003.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29668\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image003.jpg\" alt=\"\" width=\"589\" height=\"590\"><\/a><\/p>\n<p>If you\u2019re interested in understanding more about the dataset, you can use the <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/notebook\/nyctaxi.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">exploratory data analysis<\/a> notebook in the GitHub repository.<\/p>\n<h2>Get started<\/h2>\n<p>You can use the following quick start button to launch a CloudFormation stack to publish the custom SageMaker MLOps project template to the AWS Service Catalog:<\/p>\n<p><a href=\"https:\/\/console.aws.amazon.com\/cloudformation\/home?region=us-east-1#\/stacks\/quickcreate?templateUrl=https%3A%2F%2Faws-ml-blog.s3.amazonaws.com%2Fartifacts%2Famazon-sagemaker-drift-detection%2Fdrift-service-catalog.yml&amp;stackName=drift-pipeline&amp;param_ExecutionRoleArn=&amp;param_PortfolioName=SageMaker%20Organization%20Templates&amp;param_PortfolioOwner=administrator&amp;param_ProductVersion=1.0\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15948 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/16\/2-LaunchStack.jpg\" alt=\"\" width=\"107\" height=\"20\"><\/a><\/p>\n<h2>Create a new project in Studio<\/h2>\n<p>After your MLOps project template is published, you can create a new project using your new template via the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio-ui.html\" target=\"_blank\" rel=\"noopener noreferrer\">Studio UI<\/a>.<\/p>\n<ol>\n<li>In the Studio sidebar, choose <strong>SageMaker Components and registries.<\/strong><\/li>\n<li>Choose <strong>Projects<\/strong> on the drop-down menu.<\/li>\n<li>Choose <strong>Create Project<\/strong>.<\/li>\n<\/ol>\n<p>On the <strong>Create project<\/strong> page, SageMaker templates is chosen by default. This option lists the built-in templates. However, you want to use the template you published for the drift detection pipeline.<\/p>\n<ol start=\"4\">\n<li>Choose <strong>Organization templates<\/strong>.<\/li>\n<li>Choose <strong>Amazon SageMaker drift detection template for real-time deployment<\/strong>.<\/li>\n<li>Choose <strong>Select project template<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image007.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29669\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image007.jpg\" alt=\"\" width=\"1364\" height=\"601\"><\/a><\/li>\n<\/ol>\n<p>If you have recently updated your AWS Service Catalog project, you may need to refresh Studio to make sure it finds the latest version of your template.<\/p>\n<ol start=\"7\">\n<li>In the <strong>Project details<\/strong> section, for <strong>Name<\/strong>, enter <code>drift-detection<\/code>.<\/li>\n<\/ol>\n<p>Your project name must have 32 characters or fewer.<\/p>\n<ol start=\"8\">\n<li>Under <strong>Project template parameters<\/strong>, for <strong>RetrainSchedule<\/strong>, keep the default of <code>cron(0 12 1 * ? *)<\/code>.<\/li>\n<li>Choose <strong>Create project.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image009.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29670\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image009.jpg\" alt=\"\" width=\"927\" height=\"697\"><\/a><br \/><\/strong><\/li>\n<\/ol>\n<p>When you choose <strong>Create project<\/strong>, a CloudFormation stack is created in your account. This takes a few minutes. If you\u2019re interested in the components that are being deployed, navigate to the CloudFormation console. There you can find the stack that is being created.<\/p>\n<p>When the page reloads, on the main project page you can find a summary of all resources created in the project. For now, we need to clone the <code>sagemaker-drift-detection-build<\/code> repository.<\/p>\n<ol start=\"10\">\n<li>On the <strong>Repository <\/strong>tab, choose <strong>clone repo\u2026<\/strong> and accept the default values in the dialog box.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image011.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29671\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image011.jpg\" alt=\"\" width=\"1690\" height=\"998\"><\/a><\/li>\n<\/ol>\n<p>This clones the repository to the Studio space for your user. The notebook <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/build_pipeline\/build-pipeline.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">build-pipeline.ipynb<\/a> is provided as an entry point for you to run through the solution and to help you understand how to use it.<\/p>\n<ol start=\"11\">\n<li>Open the notebook.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image013.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29672\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image013.jpg\" alt=\"\" width=\"401\" height=\"392\"><\/a><\/li>\n<li>Choose the <strong>Python 3 (Data Science) <\/strong>kernel for the notebook<em>. <\/em><\/li>\n<\/ol>\n<p>If a Studio instance isn\u2019t already running, an instance is provisioned. This can take a couple of minutes. By default, an ml.t3.medium instance is launched, which is enough to run our notebook.<\/p>\n<ol start=\"13\">\n<li>When the notebook is open, edit the second cell with the actual project name you chose:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">project_name = \"drift-detection\"<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"14\">\n<li>Run the first couple of cells to initialize some variables we need later and to verify we have selected our project:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">import sagemaker\nimport json\n\nsess = sagemaker.session.Session()\nregion_name = sess._region_name\nsm_client = sess.sagemaker_client\nproject_id = sm_client.describe_project(ProjectName=project_name)[\"ProjectId\"]\nprint(f\"Project: {project_name} ({project_id})\")<\/code><\/pre>\n<\/p><\/div>\n<p>Next, we must define the dataset that we\u2019re using. In this example, we use data from <a href=\"https:\/\/www1.nyc.gov\/site\/tlc\/about\/tlc-trip-record-data.page\" target=\"_blank\" rel=\"noopener noreferrer\">NYC Taxi and Limousine Commission (TLC)<\/a>.<\/p>\n<ol start=\"15\">\n<li>Download the data from its public <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) location and upload to the artifact bucket provisioned by the project template:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">from sagemaker.s3 import S3Downloader, S3Uploader\n\n# Download to the data folder, and upload to the pipeline input uri\ndownload_uri = \"s3:\/\/nyc-tlc\/trip data\/green_tripdata_2018-02.csv\"\nS3Downloader().download(download_uri, \"data\")\n\n# Uplaod the data to the input location\nartifact_bucket = f\"sagemaker-project-{project_id}-{region_name}\"\ninput_data_uri = f\"s3:\/\/{artifact_bucket}\/{project_id}\/input\"\nS3Uploader().upload(\"data\", input_data_uri)\n\nprint(\"Listing input files:\")\nfor s3_uri in S3Downloader.list(input_data_uri):\n    print(s3_uri.split(\"\/\")[-1])<\/code><\/pre>\n<\/p><\/div>\n<p>For your own custom template, you could also upload your data directly in the <code>input_data_uri<\/code> location, because this is where the pipeline expects to find the training data.<\/p>\n<h2>Run the training pipeline<\/h2>\n<p>To run the pipeline, you can continue running the cells in the notebook. You can also start a pipeline run through the Studio interface.<\/p>\n<ol>\n<li>First, go back to the main project view page and choose the <strong>Pipelines<\/strong> tab.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image015.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29673\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image015.jpg\" alt=\"\" width=\"855\" height=\"317\"><\/a><\/li>\n<li>Choose the pipeline <code>drift-detection-pipeline<\/code>, which opens a tab containing a list of past runs.<\/li>\n<\/ol>\n<p>When you first get to this page, you can see a previous failed run of the pipeline. This was started when the project was initialized. It failed because at the time there was no data for the pipeline to use.<\/p>\n<p>You can now start a new pipeline with the data.<\/p>\n<ol start=\"3\">\n<li>Choose <strong>Start an execution<\/strong>.<\/li>\n<li>For <strong>Name<\/strong>, enter <code>First-Pipeline-execution<\/code>.<\/li>\n<li>Choose <strong>Start<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image017.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29674\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image017.jpg\" alt=\"\" width=\"1094\" height=\"900\"><\/a><\/li>\n<\/ol>\n<p>When the pipeline starts, it\u2019s added to the list of pipeline runs with a status of <code>Executing<\/code><em>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image019.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29675\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image019.jpg\" alt=\"\" width=\"815\" height=\"166\"><\/a><br \/><\/em><\/p>\n<ol start=\"6\">\n<li>Choose this pipeline run to open its details.<\/li>\n<\/ol>\n<p>Pipelines automatically constructs a graph showing the data dependency for each step in the pipeline. Based on this, you can see the order in which the steps are completed.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image021.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29676\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image021.jpg\" alt=\"\" width=\"982\" height=\"1242\"><\/a><\/p>\n<p>If you wait for a few minutes and refresh the graph, you can notice that the <code>BaselineJob<\/code> and <code>TrainModel<\/code> steps run at the same time. This happened automatically. SageMaker understands that these steps are safe to run in parallel because there is no data dependency between them. You can explore this page by choosing the different steps and tabs on the page. Let\u2019s look at what the different steps are doing:<\/p>\n<ul>\n<li><strong>PreprocessData<\/strong> \u2013 This step is responsible for preprocessing the data and transforming it into a format that is appropriate for the following ML algorithm. This step contains <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/build_pipeline\/pipelines\/preprocess.py\" target=\"_blank\" rel=\"noopener noreferrer\">custom code<\/a> developed for the particular use case.<\/li>\n<li><strong>BaselineJob<\/strong> \u2013 This step is responsible for generating a baseline regarding the expected type and distribution of your data. This is essential for the monitoring of the model. Model Monitor uses this baseline to compare against the latest collected data from the endpoint. This step doesn\u2019t require custom code because it\u2019s part of the Model Monitor offering.<\/li>\n<li><strong>TrainModel<\/strong> \u2013 This step is responsible for training an XGBoost regressor using the built-in implementation of the algorithm by SageMaker. Because we\u2019re using the built-in model, no custom code is required.<\/li>\n<li><strong>EvaluateModel and CheckEvaluation <\/strong>\u2013 In these steps, we calculate an evaluation metric that is important for us, in this case the root mean square error (rmse) on the test set. If it\u2019s less than the predefined threshold (7), we continue to the next step. If not, the pipeline stops. The <code>EvaluateModel<\/code> step requires <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/build_pipeline\/pipelines\/evaluate.py\" target=\"_blank\" rel=\"noopener noreferrer\">custom code<\/a> to compute the metric we\u2019re interested in.<\/li>\n<li><strong>RegisterModel<\/strong> \u2013 During this step, the trained model from the <code>TrainModel<\/code> step is registered in the model registry. From there, we can centrally manage and deploy the trained models. No custom code is required at this step.<\/li>\n<\/ul>\n<h2>Approve and deploy a model<\/h2>\n<p>After the pipeline has finished running, navigate to the main project page. Go to the <strong>Model groups<\/strong> tab, choose the <code>drift-detection<\/code> model group, and then a registered model version. This brings up the specific model version page. You can inspect the outputs of this model including the following metrics. You can also get more insights from the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/debugger-training-xgboost-report.html\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost training report<\/a>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image023.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29677\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image023.jpg\" alt=\"\" width=\"1252\" height=\"474\"><\/a><\/p>\n<p>The deployment pipeline for a registered model is triggered based on its status. To deploy this model, complete the following steps:<\/p>\n<ol>\n<li>Choose <strong>Update status<\/strong>.<\/li>\n<li>Change the pending status to <strong>Approved.<\/strong><\/li>\n<li>Choose <strong>Update status<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image025.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29678\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image025.jpg\" alt=\"\" width=\"1676\" height=\"848\"><\/a><\/li>\n<\/ol>\n<p>Approving the model generates an event in CloudWatch that gets captured by a rule in EventBridge, which starts the model deployment.<\/p>\n<p>To see the deployment progress, navigate to the CodePipeline console. From the pipelines section, choose <code>sagemaker-drift-detection-deploy<\/code> to see the deployment of the approved model in progress.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image027.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29679\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image027.jpg\" alt=\"\" width=\"496\" height=\"1215\"><\/a><\/p>\n<p>This pipeline includes a build stage that gets the latest approved model version from the model registry and generates a CloudFormation template. The model is deployed to a staging SageMaker endpoint using this template.<\/p>\n<p>Now you can return to the notebook to test the staging endpoint by running the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">predictor = wait_for_predictor(\"staging\")\npayload = \"1,-73.986114,40.685634,-73.936794,40.715370,5.318025,7,0,2\"\npredictor.predict(data=payload)<\/code><\/pre>\n<\/p><\/div>\n<h2>Promote the model to production<\/h2>\n<p>If the staging endpoint is performing as expected, the model can be promoted to production. If this is the first time running this pipeline, you can approve this model by choosing <strong>Review<\/strong> in CodePipeline, entering any comments, and choosing <strong>Approve.<\/strong><\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image029.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29680\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image029.jpg\" alt=\"\" width=\"570\" height=\"369\"><\/a><\/p>\n<p>In our example, the approval of a model to production is a two-step process, reflecting different responsibilities and personas:<\/p>\n<ol>\n<li>First, we approved the model in the model registry to be tested on a staging endpoint. This would typically be performed by a data scientist after evaluating the model training results from a data science perspective.<\/li>\n<li>After the endpoint has been tested in the staging environment, the second approval is to deploy the model to production. This approval could be restricted by <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) roles to be performed only by an operations or application team. This second approval could follow additional tests defined by these teams.<\/li>\n<\/ol>\n<p>The production deployment has a few extra configuration parameters compared to the staging. Although the staging created a single instance to host the endpoint, this stage creates the endpoint with <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/endpoint-auto-scaling.html\" target=\"_blank\" rel=\"noopener noreferrer\">automatic scaling<\/a> with two instances across multiple Availability Zones. Automatic scaling makes sure that if traffic increases, the deployed model scales out in order to meet the user request throughput. Additionally, the production variant of the deployment enables <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-monitor-data-capture.html\" target=\"_blank\" rel=\"noopener noreferrer\">data capture<\/a>, which means all requests and response predictions from the endpoint are logged to Amazon S3. A monitoring schedule, also deployed at this stage, analyzes this data. If data drift is detected, a new run of the build pipeline is started.<\/p>\n<h2>Monitor the model<\/h2>\n<p>For model monitoring, an important step is to define sensible thresholds that are relevant to your business problem. In our case, we want to be alerted if, for example, the underlying distribution of the prices of fares change. The deployment pipeline has a <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/deployment_pipeline\/prod-config.json\" target=\"_blank\" rel=\"noopener noreferrer\">prod-config.json<\/a> file that defines a metric and threshold for this drift detection.<\/p>\n<ol>\n<li>Navigate back to the main project page in Studio.<\/li>\n<li>Choose the <strong>Endpoints<\/strong> tab, where you can see both the staging and the prod endpoints are <code>InService<\/code>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image031.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29681\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image031.jpg\" alt=\"\" width=\"951\" height=\"248\"><\/a><\/li>\n<li>Choose the prod endpoint to display additional details.<\/li>\n<\/ol>\n<p>If you choose the prod version, the first thing you see is the monitoring schedule (which hasn\u2019t run yet).<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image033.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29682\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image033.jpg\" alt=\"\" width=\"858\" height=\"541\"><\/a><\/p>\n<ol start=\"4\">\n<li>Test the production endpoint and the data capture by sending some artificial traffic using the notebook cells under the <strong>Test Production<\/strong> and <strong>Inspect Data Capture<\/strong> sections.<\/li>\n<\/ol>\n<p>This code also modifies the distribution of the input data, which causes a drift to be detected in the predicted fare amount when Model Monitor runs. This in turn raises an alarm and restarts the training pipeline to train a new model.<\/p>\n<p>The monitoring schedule has been set to run hourly. After the hour, you can see that a new monitoring job is now <code>In progress<\/code>. This should take about 10 minutes to complete, at which point you should see its status change to <code>Issue Found<\/code> due to the data drift that we introduced.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image035.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29683\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image035.jpg\" alt=\"\" width=\"763\" height=\"225\"><\/a><\/p>\n<p>To monitor the metrics emitted by the monitoring job, you can add a chart in Studio to inspect the different features over a relevant timeline.<\/p>\n<ol start=\"5\">\n<li>Choose the <strong>Data Quality<\/strong> tab of the <strong>Model Monitoring<\/strong> page.<\/li>\n<li>Choose <strong>Add chart,<\/strong> which reveals the chart properties.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image037.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29684\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image037.jpg\" alt=\"\" width=\"292\" height=\"404\"><\/a><\/li>\n<li>To get more insights into the monitoring job, choose the latest job to inspect the job details.<\/li>\n<\/ol>\n<p>In <strong>Monitor Job Details<\/strong>, you can see a summary of the discovered constraint violations.<\/p>\n<ol start=\"8\">\n<li>To discover more, copy the long string under <strong>Processing Job ARN<\/strong>.<\/li>\n<li>Choose <strong>View Amazon SageMaker notebook<\/strong>, which opens a pre-populated notebook.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image039.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29685\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image039.jpg\" alt=\"\" width=\"2084\" height=\"1268\"><\/a><\/li>\n<li>In the notebook the cell, replace <code>FILL-IN-PROCESSING-JOB-ARN<\/code> with the ARN value you copied.<\/li>\n<li>Run all the notebook cells.<\/li>\n<\/ol>\n<p>This notebook outputs a series of tables and graphs, including a distribution that compares the newly collected data (in blue) to the baseline metrics distribution (in green). For the <code>geo_distance<\/code> and <code>passenger_count<\/code> features for which we introduced artificial noise, you can see the shifts in distributions. Similarly, as a consequence you can notice a shift in the distribution for the <code>fare_amount<\/code> predicted value.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image041.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29686\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image041.jpg\" alt=\"\" width=\"1113\" height=\"876\"><\/a><\/p>\n<h2>Retrain the model<\/h2>\n<p>The preceding change in data raises an alarm in response to the CloudWatch metrics published from Model Monitor exceeding the configured threshold. Thanks to the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/pipeline-eventbridge.html\" target=\"_blank\" rel=\"noopener noreferrer\">Pipelines integration with EventBridge<\/a>, the model build pipeline is started to retrain the model on the latest data.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image043.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29687\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image043.jpg\" alt=\"\" width=\"822\" height=\"615\"><\/a><\/p>\n<p>Navigating to the CloudWatch console and choosing <strong>Alarms<\/strong> should show that the alarm <code>sagemaker-drift-detection-prod-threshold<\/code> is in the status <code>In Alarm<\/code>. When the alarm changes to In alarm, a new run of the pipeline is started. You can see this on the pipeline tab of the main project in the Studio interface.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image045.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29688\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/ML-4030-image045.jpg\" alt=\"\" width=\"758\" height=\"63\"><\/a><\/p>\n<p>At this point, the new model that is generated suffers from the same drift if we use that same generated data to test the endpoint. This is because we didn\u2019t change or update the training data. In a real production environment, for this pipeline to be effective, a process should exist to load newly labeled data to the location where the pipeline is getting the input data. This last detail is crucial when building your solution.<\/p>\n<h2>Clean up<\/h2>\n<p>The <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\/blob\/main\/build_pipeline\/build-pipeline.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">build-pipeline.ipynb<\/a> notebook includes cells that you can run to clean up the following resources:<\/p>\n<ul>\n<li>SageMaker prod endpoint<\/li>\n<li>SageMaker staging endpoint<\/li>\n<li>SageMaker pipeline workflow and model package group<\/li>\n<li>Amazon S3 artifacts and SageMaker project<\/li>\n<\/ul>\n<p>You can also clean up resources using the <a href=\"https:\/\/aws.amazon.com\/cli\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI):<\/p>\n<ol>\n<li>Delete the CloudFormation stack created to provision the production endpoint:<br \/><code>aws cloudformation delete-stack \u2014stack-name sagemaker-<span>&lt;&lt;project_name&gt;&gt;<\/span>-deploy-prod<\/code><\/li>\n<\/ol>\n<ol start=\"2\">\n<li>Delete the CloudFormation stack created to provision the staging endpoint:<br \/><code>aws cloudformation delete-stack \u2014stack-name sagemaker-<span>&lt;&lt;project_name&gt;&gt;<\/span>-deploy-staging<\/code><\/li>\n<\/ol>\n<ol start=\"3\">\n<li>Delete the CloudFormation stack created to provision the SageMaker pipeline and model package group:<br \/><code>aws cloudformation delete-stack \u2014stack-name sagemaker-<span>&lt;&lt;project_name&gt;&gt;<\/span>-deploy-pipeline<\/code><\/li>\n<\/ol>\n<ol start=\"4\">\n<li>Empty the S3 bucket containing the artifacts output from the drift deployment pipeline:<br \/><code>aws s3 rm \u2014recursive s3:\/\/sagemaker-project-<span>&lt;&lt;project_id&gt;&gt;<\/span>-region_name<\/code><\/li>\n<\/ol>\n<ol start=\"5\">\n<li>Delete the project, which removes the CloudFormation stack that created the deployment pipeline:<br \/><code class=\"lang-bash\">aws sagemaker delete-project \u2014project-name <span>&lt;&lt;project_name&gt;&gt;<\/span><\/code><\/li>\n<\/ol>\n<ol start=\"6\">\n<li>Delete the AWS Service Catalog project template:<br \/><code>aws cloudformation delete-stack \u2014stack-name <span>&lt;&lt;drift-pipeline&gt;&gt;<\/span><\/code><\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>Model Monitor allows you to capture incoming data to the deployed model, detect changes, and raise alarms when significant data drift is detected. Additionally, Pipelines allows you to orchestrate building new model versions. With its integration with EventBridge, you can run a pipeline either on a schedule or on demand. The latter feature allows for integration between monitoring a model and automatically retraining a model when a drift in the incoming feature data has been detected.<\/p>\n<p>You can use the code repository on <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-drift-detection\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a> as a starting point to try out this solution for your own data and use case.<\/p>\n<h2>Additional references<\/h2>\n<p>For additional information, see the following resources:<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/07\/30\/julian-bright-100.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-14385 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/07\/30\/julian-bright-100.jpg\" alt=\"\" width=\"100\" height=\"138\"><\/a>Julian Bright<\/strong> is an Principal AI\/ML Specialist Solutions Architect based out of Melbourne, Australia. Julian works as part of the global Amazon Machine Learning team and is passionate about helping customers realize their AI and ML journey through MLOps. In his spare time, he loves running around after his kids, playing soccer and getting outdoors.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/schinasg.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-29693 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/schinasg.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a> <strong>Georgios Schinas<\/strong> is a Specialist Solutions Architect for AI\/ML in the EMEA region. He is based in London and works closely with customers in UK. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices. In his spare time, he enjoys traveling, cooking and spending time with friends and family.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/theisshe.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-29694 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/theisshe.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Theiss Heilker<\/strong> is an AI\/ML Solutions Architect at AWS. He helps customer create AI\/ML solutions and accelerate their Machine Learning journey. He is passionate about MLOps and in his spare time you can find him in the outdoors playing with his dog and son.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/Alessandro-Cer%C3%A8.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-29692 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/22\/Alessandro-Cer%C3%A8.jpg\" alt=\"\" width=\"100\" height=\"137\"><\/a><strong>Alessandro Cer\u00e8<\/strong> is a Senior ML Solutions Architect at AWS based in Singapore, where he helps customers design and deploy Machine Learning solutions across the ASEAN region. Before being a data scientist, Alessandro was researching the limits of Quantum Correlation for secure communication. In his spare time, he\u2019s a landscape and underwater photographer.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/automate-model-retraining-with-amazon-sagemaker-pipelines-when-drift-is-detected\/<\/p>\n","protected":false},"author":0,"featured_media":1142,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1141"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1141"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1141\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1142"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}