{"id":1950,"date":"2022-03-10T18:45:52","date_gmt":"2022-03-10T18:45:52","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/10\/make-batch-predictions-with-amazon-sagemaker-autopilot\/"},"modified":"2022-03-10T18:45:52","modified_gmt":"2022-03-10T18:45:52","slug":"make-batch-predictions-with-amazon-sagemaker-autopilot","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/10\/make-batch-predictions-with-amazon-sagemaker-autopilot\/","title":{"rendered":"Make batch predictions with Amazon SageMaker Autopilot"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/autopilot-automate-model-development.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Autopilot<\/a> is an automated machine learning (AutoML) solution that performs all the tasks you need to complete an end-to-end machine learning (ML) workflow. It explores and prepares your data, applies different algorithms to generate a model, and transparently provides model insights and explainability reports to help you interpret the results. Autopilot can also create a real-time endpoint for online inference. You can access Autopilot\u2019s one-click features in <a href=\"https:\/\/aws.amazon.com\/sagemaker\/studio\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a> or by using the <a href=\"https:\/\/aws.amazon.com\/sdk-for-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS SDK for Python (Boto3)<\/a> or the SageMaker Python SDK.<\/p>\n<p>In this post, we show how to make batch predictions on an unlabeled dataset using an Autopilot-trained model. We use a synthetically generated dataset that is indicative of the types of features you typically see when predicting customer churn.<\/p>\n<h2>Solution overview<\/h2>\n<p><em>Batch <\/em>inference, or <em>offline <\/em>inference, is the process of generating predictions on a batch of observations. Batch inference assumes you don\u2019t need an immediate response to a model prediction request, as you would when using an online, real-time model endpoint. Offline predictions are suitable for larger datasets and in cases where you can afford to wait several minutes or hours for a response. In contrast, <em>online <\/em>inference generates ML predictions in real time, and is aptly referred to as <em>real-time<\/em> inference or <em>dynamic<\/em> inference. Typically, these predictions are generated on a single observation of data at runtime.<\/p>\n<p>Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. Mobile operators have historical customer data showing those who have churned and those who have maintained service. We can use this historical information to construct a model to predict if a customer will churn using ML.<\/p>\n<p>After we train an ML model, we can pass the profile information of an arbitrary customer (the same profile information that we used for training) to the model, and have the model predict whether or not the customer will churn. The dataset used for this post is hosted under the sagemaker-sample-files folder in an <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) public bucket, which you can download. It consists of 5,000 records, where each record uses 21 attributes to describe the profile of a customer for an unknown US mobile operator. The attributes are as follows:<\/p>\n<ul>\n<li><strong>State<\/strong> \u2013 US state in which the customer resides, indicated by a two-letter abbreviation; for example, TX or CA<\/li>\n<li><strong>Account Length <\/strong>\u2013 Number of days that this account has been active<\/li>\n<li><strong>Area Code<\/strong> \u2013 Three-digit area code of the corresponding customer\u2019s phone number<\/li>\n<li><strong>Phone<\/strong> \u2013 Remaining seven-digit phone number<\/li>\n<li><strong>Int\u2019l Plan<\/strong> \u2013 Has an international calling plan: Yes\/No<\/li>\n<li><strong>VMail Plan<\/strong> \u2013 Has a voice mail feature: Yes\/No<\/li>\n<li><strong>VMail Message<\/strong> \u2013 Average number of voice mail messages per month<\/li>\n<li><strong>Day Mins<\/strong> \u2013 Total number of calling minutes used during the day<\/li>\n<li><strong>Day Calls<\/strong> \u2013 Total number of calls placed during the day<\/li>\n<li><strong>Day Charge<\/strong> \u2013 Billed cost of daytime calls<\/li>\n<li><strong>Eve Mins, Eve Calls, Eve Charge<\/strong> \u2013 Billed cost for calls placed during the evening<\/li>\n<li><strong>Night Mins, Night Calls, Night Charge<\/strong> \u2013 Billed cost for calls placed during nighttime<\/li>\n<li><strong>Intl Mins, Intl Calls, Intl Charge<\/strong> \u2013 Billed cost for international calls<\/li>\n<li><strong>CustServ Calls<\/strong> \u2013 Number of calls placed to Customer Service<\/li>\n<li><strong>Churn? <\/strong>\u2013 Customer left the service: True\/False<\/li>\n<\/ul>\n<p>The last attribute, Churn?, is the target attribute that we want the ML model to predict. Because the target attribute is binary, our model performs binary prediction, also known as <em>binary classification<\/em>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-large wp-image-33849\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/07\/ML-8367-dataset-1024x327.png\" alt=\"churn dataset\" width=\"1024\" height=\"327\"><\/p>\n<h2>Prerequisites<\/h2>\n<p>Download the dataset to your local development environment and explore it by running the following S3 copy command with the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI):<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">$ aws s3 cp s3:\/\/sagemaker-sample-files\/datasets\/tabular\/synthetic\/churn.txt .\/<\/code><\/pre>\n<\/p><\/div>\n<p>You can then copy the dataset to an S3 bucket within your own AWS account. This is the input location for Autopilot. You can copy the dataset to Amazon S3 by either manually uploading to your bucket or by running the following command using the AWS CLI:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">$ aws s3 cp .\/churn.txt s3:\/\/<span>&lt;YOUR S3 BUCKET&gt;<\/span>\/datasets\/tabular\/datasets\/churn.txt<\/code><\/pre>\n<\/p><\/div>\n<h2>Create an Autopilot experiment<\/h2>\n<p>When the dataset is ready, you can initialize an Autopilot experiment in SageMaker Studio. For full instructions, refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/autopilot-automate-model-development-create-experiment.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create an Amazon SageMaker Autopilot experiment<\/a>.<\/p>\n<p>Under <strong>Basic settings<\/strong>, you can easily create an Autopilot experiment by providing an experiment name, the data input and output locations, and specifying the target data to predict. Optionally, you can specify the type of ML problem that you want to solve. Otherwise, use the <strong>Auto<\/strong> setting, and Autopilot automatically determines the model based on the data you provide.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-large wp-image-33848\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/07\/ML-8367-create-experiment-1024x710.png\" alt=\"create autopilot experiment\" width=\"1024\" height=\"710\"><\/p>\n<p>You can also run an Autopilot experiment with code using either the AWS SDK for Python (Boto3) or the SageMaker Python SDK. The following code snippet demonstrates how to initialize an Autopilot experiment using the SageMaker Python SDK. We use the <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/stable\/api\/training\/automl.html\" target=\"_blank\" rel=\"noopener noreferrer\">AutoML class<\/a> from the SageMaker Python SDK.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from sagemaker import AutoML\n\nautoml = AutoML(role=\"<span>&lt;SAGEMAKER EXECUTION ROLE&gt;<\/span>\",\ntarget_attribute_name=\"<span>&lt;NAME OF YOUR TARGET COLUMN&gt;<\/span>\",\nbase_job_name=\"&lt;NAME FOR YOUR AUTOPILOT EXPERIMENT&gt;\",\nsagemaker_session=\"<span>&lt;SAGEMAKER SESSION&gt;<\/span>\",\nmax_candidates=\"<span>&lt;MAX NUMBER OF TRAINING JOBS TO RUN AS PART OF THE EXPERIMENT&gt;<\/span>\")\n\nautoml.fit(\"<span>&lt;PATH TO INPUT DATASET&gt;<\/span>\", \n\t\t   job_name=\"<span>&lt;NAME OF YOUR AUTOPILOT EXPERIMENT&gt;<\/span>\", \n\t\t   wait=False,\n\t\t   logs=False)<\/code><\/pre>\n<\/p><\/div>\n<p>After Autopilot begins an experiment, the service automatically inspects the raw input data, applies feature processors, and picks the best set of algorithms. After it choose an algorithm, Autopilot optimizes its performance using a hyperparameter optimization search process. This is often referred to as training and tuning the model. This ultimately helps produce a model that can accurately make predictions on data it has never seen. Autopilot automatically tracks model performance, and then ranks the final models based on metrics that describe a model\u2019s accuracy and precision.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-large wp-image-33847\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/07\/ML-8367-experiment-results-1024x446.png\" alt=\"autopilot experiment results\" width=\"1024\" height=\"446\"><\/p>\n<p>You also have the option to deploy any of the ranked models either by choosing the model (right-click) and choosing <strong>Deploy model<\/strong>, or by selecting the model in the ranked list and choosing <strong>Deploy model<\/strong>.<\/p>\n<h2>Make batch predictions using a model from Autopilot<\/h2>\n<p>When your Autopilot experiment is complete, you can use the trained model to run batch predictions on your test or holdout dataset for evaluation. You can then compare the predicted labels against expected labels if your test or holdout dataset is pre-labeled. This is essentially a way to compare a model\u2019s predictions to the truth. If more of the model\u2019s predictions match the true labels, we can generally categorize the model as performing well. You can also run batch predictions to label unlabeled data. You can easily accomplish the same using the high-level SageMaker Python SDK with a few lines of code.<\/p>\n<h3>Describe a previously run Autopilot experiment<\/h3>\n<p>We first need to extract the information from a previously completed Autopilot experiment. We can use the AutoML class from the SageMaker Python SDK to create an automl object that encapsulates the information of a previous Autopilot experiment. You can use the experiment name you defined when initializing the Autopilot experiment. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from sagemaker import AutoML\n\nautopilot_experiment_name = \"<span>&lt;ENTER YOUR AUTOPILOT EXPERIMENT NAME HERE&gt;<\/span>\"\nautoml = AutoML.attach(auto_ml_job_name=autopilot_experiment_name)<\/code><\/pre>\n<\/p><\/div>\n<p>With the automl object, we can easily describe and recreate the best trained model, as shown in the following snippets:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">best_candidate = automl.describe_auto_ml_job()['BestCandidate']\nbest_candidate_name = best_candidate['CandidateName']\n\nmodel = automl.create_model(name=best_candidate_name, \n\t                    candidate=best_candidate, \n\t                    inference_response_keys=inference_response_keys)<\/code><\/pre>\n<\/p><\/div>\n<p>In some cases, you might want to use a model other than the best model as ranked by Autopilot. To find such a candidate model, you can use the automl object and iterate through the list of all or the top N model candidates and choose the model you want to recreate. For this post, we use a simple Python For loop to iterate through a list of model candidates:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">all_candidates = automl.list_candidates(sort_by='FinalObjectiveMetricValue', \n                                         sort_order='Descending', \n\t\t\t                 max_results=100)\n\nfor candidate in all_candidates:\n\tif candidate['CandidateName'] == \"<span>&lt;ANY CANDIDATE MODEL OTHER THAN BEST MODEL&gt;<\/span>\":\n\t\tcandidate_name = candidate['CandidateName']\n\t\tmodel = automl.create_model(name=candidate_name, \n\t\t\t\t                 candidate=candidate, \n\t\t\t\t                 inference_response_keys=inference_response_keys)\n\t\tbreak<\/code><\/pre>\n<\/p><\/div>\n<h3>Customize the inference response<\/h3>\n<p>When recreating either the best or any other of Autopilot\u2019s trained models, we can customize the inference response for the model by adding in the extra parameter <code>inference_response_keys<\/code>, as shown in the preceding example. You can use this parameter for both binary or multiclass classification problem types:<\/p>\n<ul>\n<li><strong>predicted_label<\/strong> \u2013 The predicted class.<\/li>\n<li><strong>probability<\/strong> \u2013 In binary classification, the probability that the result is predicted as the second or True class in the target column. In multiclass classification, the probability of the winning class.<\/li>\n<li><strong>labels<\/strong> \u2013 A list of all possible classes.<\/li>\n<li><strong>probabilities<\/strong> \u2013 A list of all probabilities for all classes (order corresponds with labels).<\/li>\n<\/ul>\n<p>Because the problem we\u2019re tackling in this post is binary classification, we set this parameter as follows in the preceding snippets while creating the models:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">inference_response_keys = ['predicted_label', 'probability']<\/code><\/pre>\n<\/p><\/div>\n<h3>Create transformer and run batch predictions<\/h3>\n<p>Finally, after we recreate the candidate models, we can create a transformer to start the batch predictions job, as shown in the following two code snippets. While creating the transformer, we define the specifications of the cluster to run the batch job, such as instance count and type. The batch input and output are the Amazon S3 locations where our data inputs and outputs are stored. The batch prediction job is powered by <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/batch-transform.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker batch transform<\/a>.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">transformer = model.transformer(instance_count=1, \n\t\t\t\t instance_type='ml.m5.xlarge',\n\t\t\t\t assemble_with='Line',\n\t\t\t\t output_path=batch_output)\n\t\t\t\t \ntransformer.transform(data=batch_input,\n                      split_type='Line',\n\t\t      content_type='text\/csv',\n\t\t      wait=False)<\/code><\/pre>\n<\/p><\/div>\n<p>When the job is complete, we can read the batch output and perform evaluations and other downstream actions.<\/p>\n<h2>Summary<\/h2>\n<p>In this post, we demonstrated how to quickly and easily make batch predictions using Autopilot-trained models for your post-training evaluations. We used SageMaker Studio to initialize an Autopilot experiment to create a model for predicting customer churn. Then we referenced Autopilot\u2019s best model to run batch predictions using the automl class with the SageMaker Python SDK. We also used the SDK to perform batch predictions with other model candidates. With Autopilot, we automatically explored and preprocessed our data, then created several ML models with one click, letting SageMaker take care of managing the infrastructure needed to train and tune our models. Lastly, we used batch transform to make predictions with our model using minimal code.<\/p>\n<p>For more information on Autopilot and its advanced functionalities, refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/autopilot-automate-model-development.html\" target=\"_blank\" rel=\"noopener noreferrer\">Automate model development with Amazon SageMaker Autopilot<\/a>. For a detailed walkthrough of the example in the post, take a look at the following <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/main\/autopilot\/autopilot_customer_churn_high_level_with_evaluation.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">example notebook<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32544 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\" alt=\"\" width=\"100\" height=\"124\"><\/a><strong>Arunprasath Shankar<\/strong> is an Artificial Intelligence and Machine Learning (AI\/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-33852 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/07\/peter-chung-e1646679625860.png\" alt=\"Peter Chung\" width=\"100\" height=\"139\"><strong>Peter Chung<\/strong> is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications. He enjoys coffee, cooking, staying active, and spending time with his family.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/make-batch-predictions-with-amazon-sagemaker-autopilot\/<\/p>\n","protected":false},"author":0,"featured_media":1951,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1950"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1950"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1950\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1951"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1950"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1950"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1950"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}