{"id":302,"date":"2020-09-29T13:04:24","date_gmt":"2020-09-29T13:04:24","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/29\/moving-from-notebooks-to-automated-ml-pipelines-using-amazon-sagemaker-and-aws-glue\/"},"modified":"2020-09-29T13:04:24","modified_gmt":"2020-09-29T13:04:24","slug":"moving-from-notebooks-to-automated-ml-pipelines-using-amazon-sagemaker-and-aws-glue","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/29\/moving-from-notebooks-to-automated-ml-pipelines-using-amazon-sagemaker-and-aws-glue\/","title":{"rendered":"Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue"},"content":{"rendered":"<div id=\"\">\n<p>A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model\u2019s ability to autonomously learn and adapt in production as new data is added.<\/p>\n<p>In practice, data scientists often work with Jupyter notebooks for development work and find it hard to translate from <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/nbi.html\" target=\"_blank\" rel=\"noopener noreferrer\">notebooks<\/a> to automated pipelines. To achieve the two main functions of a ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> \u2013 A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly<\/li>\n<li>\n<a href=\"https:\/\/aws.amazon.com\/glue\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue<\/a> \u2013 A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data<\/li>\n<\/ul>\n<p>In this post, we demonstrate how to orchestrate an ML training pipeline using <a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/workflows_overview.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue workflows<\/a> and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints.<\/p>\n<h2>Use case<\/h2>\n<p>For this use case, we use the <a href=\"http:\/\/dbpedia.org\" target=\"_blank\" rel=\"noopener noreferrer\">DBpedia Ontology classification dataset<\/a> to build a model that performs multi-class classification. We trained the model using the <a href=\"https:\/\/docs.aws.amazon.com\/Amazon%20SageMaker\/latet\/dg\/blazingtext.html\" target=\"_blank\" rel=\"noopener noreferrer\">BlazingText algorithm<\/a>, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes.<\/p>\n<p>This post doesn\u2019t go into the details of the model, but demonstrates a way to build an ML pipeline that builds and deploys any ML model.<\/p>\n<h2>Solution overview<\/h2>\n<p>The following diagram summarizes the approach for the retraining pipeline. <img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15082 size-full\" title=\"Retraining Pipeline Workflow\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/24\/1-Retraining-Pipeline.jpg\" alt=\"\" width=\"1000\" height=\"623\"><\/p>\n<p>The workflow contains the following elements:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/add-crawler.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue crawler<\/a> \u2013 You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.<\/li>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/about-triggers.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue triggers<\/a> \u2013 Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers.<\/li>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/add-job.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue job<\/a> \u2013 An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location.<\/li>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/dg\/creating_running_workflows.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue workflow<\/a> \u2013 An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image.<\/li>\n<\/ul>\n<p>The workflow begins by downloading the training data from <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/how-it-works-training.html\" target=\"_blank\" rel=\"noopener noreferrer\">training job<\/a> in Amazon SageMaker based on a set of hyperparameters.<\/p>\n<p>When the training job is complete, an <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/how-it-works-deployment.html\" target=\"_blank\" rel=\"noopener noreferrer\">endpoint<\/a> is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in <code>InService<\/code> status.<\/p>\n<p>At the end of the workflow, a message is sent to an <a href=\"http:\/\/aws.amazon.com\/sqs\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Queue Service<\/a> (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more.<\/p>\n<h2>Setting up the environment<\/h2>\n<p>To set up the environment, complete the following steps:<\/p>\n<ol>\n<li>Configure the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI) and a profile to use to run the code. For instructions, see <a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/cli-chap-configure.html\" target=\"_blank\" rel=\"noopener noreferrer\">Configuring the AWS CLI<\/a>.<\/li>\n<li>Make sure you have the Unix utility wget installed on your machine to download the <code>DBpedia<\/code> dataset from the internet.<\/li>\n<li>Download the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/ml-pipeline-sm-glue\/Glue_workflow_orchestration.zip\" target=\"_blank\" rel=\"noopener noreferrer\">following code<\/a> into your local directory.<\/li>\n<\/ol>\n<h2>Organization of code<\/h2>\n<p>The code to build the pipeline has the following directory structure:<\/p>\n<pre><code class=\"lang-code\">--Glue workflow orchestration\r\n\t--glue_scripts\r\n\t\t--DataExtractionJob.py\r\n\t\t--DataProcessingJob.py\r\n\t\t--MessagingQueueJob,py\r\n\t\t--TrainingJob.py\r\n\t--base_resources.template\r\n\t--deploy.sh\r\n\t--glue_resources.template<\/code><\/pre>\n<p>The code directory is divided into three parts:<\/p>\n<ul>\n<li>\n<strong>AWS CloudFormation templates<\/strong> \u2013 The directory has two <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> templates: <code>glue_resources.template<\/code> and <code>base_resources.template<\/code>. The <code>glue_resources.template<\/code> template creates the AWS Glue workflow-related resources, and <code>base_resources.template<\/code> creates the Amazon S3, <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to<a href=\"https:\/\/docs.aws.amazon.com\/systems-manager\/latest\/userguide\/systems-manager-parameter-store.html\" target=\"_blank\" rel=\"noopener noreferrer\"> AWS Systems Manager Parameter Store<\/a>, which allows easy and secure access to ARNs further in the workflow.<\/li>\n<li>\n<strong>AWS Glue scripts<\/strong> \u2013 The folder <code>glue_scripts<\/code> holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs.<\/li>\n<li>\n<strong>Bash script<\/strong> \u2013 A wrapper script <code>deploy.sh<\/code> is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as <code>stage<\/code> in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers.<\/li>\n<\/ul>\n<h2>Implementing the solution<\/h2>\n<p>Complete the following steps:<\/p>\n<ol>\n<li>Go to the <code>deploy.sh<\/code> file and replace <code>algorithm_image<\/code> name with <em>&lt;ecr_path&gt;<\/em> based on your Region.<\/li>\n<\/ol>\n<p>The following code example is a path for Region <code>us-west-2<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">algorithm_image=\"433757028032.dkr.ecr.us-west-2.amazonaws.com\/blazingtext:latest\"<\/code><\/pre>\n<\/div>\n<p>For more information about <code>BlazingText<\/code> parameters, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-algo-docker-registry-paths.html\" target=\"_blank\" rel=\"noopener noreferrer\">Common parameters for built-in algorithms<\/a>.<\/p>\n<ol start=\"2\">\n<li>Enter the following code in your terminal:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">sh deploy.sh -s dev AWS_PROFILE=your_profile_name<\/code><\/pre>\n<\/div>\n<\/li>\n<\/ol>\n<p>This step sets up the infrastructure of the pipeline.<\/p>\n<ol start=\"3\">\n<li>On the AWS CloudFormation console, check that the templates have the status <code>CREATE_COMPLETE<\/code>. <img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15083 size-full\" title=\"Checking Template Status for CREATE_COMPLETE\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/24\/2-CREATE_COMPLETE-Screenshot.jpg\" alt=\"\" width=\"1000\" height=\"269\">\n<\/li>\n<li>On the AWS Glue console, manually start the pipeline.<\/li>\n<\/ol>\n<p>In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs.<\/p>\n<ol start=\"5\">\n<li>To begin the workflow, in the <strong>Workflow <\/strong>section, select <strong>DevMLWorkflow<\/strong>.<\/li>\n<li>From the <strong>Actions<\/strong> drop-down menu, choose <strong>Run<\/strong>. <img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15084 size-full\" title=\"Selecting Run on Actions Dropdown\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/24\/3-Actions-Dropdown-Run.jpg\" alt=\"\" width=\"1000\" height=\"225\">\n<\/li>\n<li>View the progress of your workflow on the <strong>History<\/strong> tab and select the latest RUN ID.<\/li>\n<\/ol>\n<p>The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15085 size-full\" title=\"Post Workflow Completion\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/24\/4-Workflow-Post-Completion.jpg\" alt=\"\" width=\"1000\" height=\"358\"><\/p>\n<ol start=\"8\">\n<li>After the workflow is successful, open the Amazon SageMaker console.<\/li>\n<li>Under <strong>Inference<\/strong>, choose <strong>Endpoint<\/strong>.<\/li>\n<\/ol>\n<p>The following screenshot shows that the endpoint the workflow deployed is ready.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15322\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/27\/5-Endpoint-Screenshot-1.jpg\" alt=\"\" width=\"1000\" height=\"211\"><\/p>\n<p>Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application.<\/p>\n<h2>Cleaning up<\/h2>\n<p>Make sure to delete the Amazon SageMaker <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/how-it-works-hosting.html\" target=\"_blank\" rel=\"noopener noreferrer\">hosting services<\/a>\u2014endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\tdef delete_resources(self):\r\n\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 endpoint_name = self.endpoint\r\n\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 try:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sagemaker.delete_endpoint(EndpointName=endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print(\"Deleted Test Endpoint \", endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 except Exception as e:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print('Model endpoint deletion failed')\r\n\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 try:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print(\"Deleted Test Endpoint Configuration \", endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 except Exception as e:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print(' Endpoint config deletion failed')\r\n\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 try:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sagemaker.delete_model(ModelName=endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print(\"Deleted Test Endpoint Model \", endpoint_name)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 except Exception as e:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 print('Model deletion failed')<\/code><\/pre>\n<\/div>\n<h2>Conclusion<\/h2>\n<p>This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.<\/p>\n<hr>\n<h2>About the Authors<\/h2>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-15153 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/26\/Sai-Sharanya-Nalla.jpg\" alt=\"\" width=\"100\" height=\"136\">Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-15152 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/26\/Inchara-Bellavara-Diwakar.jpg\" alt=\"\" width=\"100\" height=\"136\">Inchara B Diwakar is a Data Scientist at AWS Professional Services. She designs and engineers ML solutions at scale, with experience across healthcare, manufacturing and retail verticals. Outside of work, she enjoys the outdoors, traveling and a good read.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/moving-from-notebooks-to-automated-ml-pipelines-using-amazon-sagemaker-and-aws-glue\/<\/p>\n","protected":false},"author":0,"featured_media":303,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/302"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=302"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/302\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/303"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=302"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=302"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=302"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}