{"id":1017,"date":"2021-10-12T08:39:31","date_gmt":"2021-10-12T08:39:31","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/12\/build-tune-and-deploy-an-end-to-end-churn-prediction-model-using-amazon-sagemaker-pipelines\/"},"modified":"2021-10-12T08:39:31","modified_gmt":"2021-10-12T08:39:31","slug":"build-tune-and-deploy-an-end-to-end-churn-prediction-model-using-amazon-sagemaker-pipelines","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/12\/build-tune-and-deploy-an-end-to-end-churn-prediction-model-using-amazon-sagemaker-pipelines\/","title":{"rendered":"Build, tune, and deploy an end-to-end churn prediction model using Amazon SageMaker Pipelines"},"content":{"rendered":"<p>The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge potential revenue source for every online business. Depending on the industry and business objective, the problem statement can be multi-layered. The following are some business objectives based on this strategy:<\/p>\n<p>This post discusses how you can orchestrate an end-to-end churn prediction model across each step: data preparation, experimenting with a baseline model and hyperparameter optimization (HPO), training and tuning, and registering the best model. You can manage your <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> training and inference workflows using <a href=\"https:\/\/aws.amazon.com\/sagemaker\/studio\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a> and the SageMaker Python SDK. SageMaker offers all the tools you need to create high-quality data science solutions.<\/p>\n<p>SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.<\/p>\n<p>Studio provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity by up to 10 times.<\/p>\n<p><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/pipelines.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Pipelines<\/a> is a tool for building ML pipelines that takes advantage of direct SageMaker integration. With Pipelines, you can easily automate the steps of building a ML model, catalog models in the model registry, and use one of several templates provided in SageMaker Projects to set up continuous integration and continuous delivery (CI\/CD) for the end-to-end ML lifecycle at scale.<\/p>\n<p>After the model is trained, you can use <a href=\"https:\/\/aws.amazon.com\/sagemaker\/clarify\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Clarify<\/a> to identify and limit bias and explain predictions to business stakeholders. You can share these automated reports with business and technical teams for downstream target campaigns or to determine features that are key differentiators for customer lifetime value.<\/p>\n<p>By the end of this post, you should have enough information to successfully use this end-to-end template using Pipelines to train, tune, and deploy your own predictive analytics use case. The full instructions are available on the <a href=\"https:\/\/github.com\/aws-samples\/customer-churn-sagemaker-pipelines-sample\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>.<\/p>\n<p>In this solution, your entry point is the Studio integrated development environment (IDE) for rapid experimentation. Studio offers an environment to manage the end-to-end Pipelines experience. With Studio, you can bypass the <a href=\"http:\/\/aws.amazon.com\/console\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a> for your entire workflow management. For more information on managing Pipelines from Studio, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/pipelines-studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">View, Track, and Execute SageMaker Pipelines in SageMaker Studio<\/a>.<\/p>\n<p>The following diagram illustrates the high-level architecture of the data science workflow.<\/p>\n<p>After you create the Studio domain, select your user name and choose <strong>Open Studio<\/strong>. A web-based IDE opens that allows you to store and collect all the things that you need\u2014whether it\u2019s code, notebooks, datasets, settings, or project folders.<\/p>\n<p>Pipelines is integrated directly with SageMaker, so you don\u2019t need to interact with any other AWS services. You also don\u2019t need to manage any resources because Pipelines is a fully managed service, which means that it creates and manages resources for you. For more information the various SageMaker components that are both standalone Python APIs along with integrated components of Studio, see the <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker service page<\/a>.<\/p>\n<p>For this use case, you use the following components for the fully automated model development process:<\/p>\n<p>A SageMaker pipeline is a series of interconnected steps that is defined by a JSON pipeline definition. This pipeline definition encodes a pipeline using a directed acyclic graph (DAG). This DAG gives information on the requirements for and relationships between each step of your pipeline. The structure of a pipeline\u2019s DAG is determined by the data dependencies between steps. These data dependencies are created when the properties of a step\u2019s output are passed as the input to another step.<\/p>\n<p>For this post, our use case is a classic ML problem that aims to understand what various marketing strategies based on consumer behavior we can adopt to increase customer retention for a given retail store. The following diagram illustrates the complete ML workflow for the churn prediction use case.<\/p>\n<p>Let\u2019s go through the accelerated ML workflow development process in detail.<\/p>\n<p>To follow along with this post, you need to download and save the <a href=\"https:\/\/www.kaggle.com\/uttamp\/store-data\" target=\"_blank\" rel=\"noopener noreferrer\">sample dataset<\/a> in the default <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket associated with your SageMaker session, and in the S3 bucket of your choice. For rapid experimentation or baseline model building, you can save a copy of the dataset under your home directory in <a href=\"https:\/\/aws.amazon.com\/efs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic File System<\/a> (Amazon EFS) and follow the Jupyter notebook <code>Customer_Churn_Modeling.ipynb<\/code>.<\/p>\n<p>The following screenshot shows the sample set with the target variable as retained 1, if customer is assumed to be active, or 0 otherwise.<\/p>\n<p>Run the following code in a Studio notebook to preprocess the dataset and upload it to your own S3 bucket:<\/p>\n<div>\n<pre><code class=\"lang-python\">import boto3\nimport pandas as pd\nimport numpy as np\n\n## Preprocess the dataset\ndef preprocess_data(file_path):\u00a0 \n  df = pd.read_csv(file_path)\n  ## Convert to datetime columns\n  df[\"firstorder\"]=pd.to_datetime(df[\"firstorder\"],errors='coerce')\n  df[\"lastorder\"] = pd.to_datetime(df[\"lastorder\"],errors='coerce')\n  ## Drop Rows with null values\n  df = df.dropna()\u00a0\u00a0\u00a0 \n  ## Create Column which gives the days between the last order and the first order\n  df[\"first_last_days_diff\"] = (df['lastorder']-df['firstorder']).dt.days\n  ## Create Column which gives the days between when the customer record was created and the first order\n  df['created'] = pd.to_datetime(df['created'])\n  df['created_first_days_diff']=(df['created']-df['firstorder']).dt.days\n  ## Drop Columns\n  df.drop(['custid','created','firstorder','lastorder'],axis=1,inplace=True)\n  ## Apply one hot encoding on favday and city columns\n  df = pd.get_dummies(df,prefix=['favday','city'],columns=['favday','city'])\n  return df\n\n## Set the required configurations\nmodel_name = \"churn_model\"\nenv = \"dev\"\n## S3 Bucket\ndefault_bucket = \"customer-churn-sm-pipeline\"\n## Preprocess the dataset\nstoredata = preprocess_data(f\"s3:\/\/{default_bucket}\/data\/storedata_total.csv\")<\/code><\/pre>\n<\/p><\/div>\n<p>With Studio notebooks with elastic compute, you can now easily run multiple training and tuning jobs. For this use case, you use the SageMaker built-in XGBoost algorithm and SageMaker HPO with objective function as <code>\"binary:logistic\"<\/code> and <code>\"eval_metric\":\"auc\"<\/code>.<\/p>\n<div>\n<pre><code class=\"lang-python\">def split_datasets(df):\n    y=df.pop(\"retained\")\n    X_pre = df\n    y_pre = y.to_numpy().reshape(len(y),1)\n    feature_names = list(X_pre.columns)\n    X= np.concatenate((y_pre,X_pre),axis=1)\n    np.random.shuffle(X)\n    train,validation,test=np.split(X,[int(.7*len(X)),int(.85*len(X))])\n    return feature_names,train,validation,test\n\n# Split dataset\nfeature_names,train,validation,test = split_datasets(storedata)\n\n# Save datasets in Amazon S3\npd.DataFrame(train).to_csv(f\"s3:\/\/{default_bucket}\/data\/train\/train.csv\",header=False,index=False)\npd.DataFrame(validation).to_csv(f\"s3:\/\/{default_bucket}\/data\/validation\/validation.csv\",header=False,index=False)\npd.DataFrame(test).to_csv(f\"s3:\/\/{default_bucket}\/data\/test\/test.csv\",header=False,index=False)\n<\/code><\/pre>\n<p>Train, tune, and find the best candidate model with the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Training and Validation Input for SageMaker Training job\ns3_input_train = TrainingInput(\n    s3_data=f\"s3:\/\/{default_bucket}\/data\/train\/\",content_type=\"csv\")\ns3_input_validation = TrainingInput(\n    s3_data=f\"s3:\/\/{default_bucket}\/data\/validation\/\",content_type=\"csv\")\n\n# Hyperparameter used\nfixed_hyperparameters = {\n    \"eval_metric\":\"auc\",\n    \"objective\":\"binary:logistic\",\n    \"num_round\":\"100\",\n    \"rate_drop\":\"0.3\",\n    \"tweedie_variance_power\":\"1.4\"\n}\n\n# Use the built-in SageMaker algorithm\n\nsess = sagemaker.Session()\ncontainer = sagemaker.image_uris.retrieve(\"xgboost\",region,\"0.90-2\")\n\nestimator = sagemaker.estimator.Estimator(\n    container,\n    role,\n    instance_count=1,\n    hyperparameters=fixed_hyperparameters,\n    instance_type=\"ml.m4.xlarge\",\n    output_path=\"s3:\/\/{}\/output\".format(default_bucket),\n    sagemaker_session=sagemaker_session\n)\n\nhyperparameter_ranges = {\n    \"eta\": ContinuousParameter(0, 1),\n    \"min_child_weight\": ContinuousParameter(1, 10),\n    \"alpha\": ContinuousParameter(0, 2),\n    \"max_depth\": IntegerParameter(1, 10),\n}\nobjective_metric_name = \"validation:auc\"\ntuner = HyperparameterTuner(\nestimator, objective_metric_name,\nhyperparameter_ranges,max_jobs=10,max_parallel_jobs=2)\n\n# Tune\ntuner.fit({\n    \"train\":s3_input_train,\n    \"validation\":s3_input_validation\n    },include_cls_metadata=False)\n\n## Explore the best model generated\ntuning_job_result = boto3.client(\"sagemaker\").describe_hyper_parameter_tuning_job(\n    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name\n)\n\njob_count = tuning_job_result[\"TrainingJobStatusCounters\"][\"Completed\"]\nprint(\"%d training jobs have completed\" %job_count)\n## 10 training jobs have completed\n\n## Get the best training job\n\nfrom pprint import pprint\nif tuning_job_result.get(\"BestTrainingJob\",None):\n    print(\"Best Model found so far:\")\n    pprint(tuning_job_result[\"BestTrainingJob\"])\nelse:\n    print(\"No training jobs have reported results yet.\")\n<\/code><\/pre>\n<\/p><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28978 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-HyperParameterTuning-1024x434.png\" alt=\"ML-4931-HyperParameterTuning\" width=\"1024\" height=\"434\"><\/p>\n<p>After you establish a baseline, you can use <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/train-debugger.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Debugger<\/a> for offline model analysis. Debugger is a capability within SageMaker that automatically provides visibility into the model training process for real-time and offline analysis. Debugger saves the internal model state at periodic intervals, which you can analyze in real time during training and offline after the training is complete. For this use case, you use the explainability tool SHAP (SHapley Additive exPlanation) and the native integration of SHAP with Debugger. Refer to the following <a href=\"https:\/\/github.com\/aws-samples\/customer-churn-sagemaker-pipelines-sample\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a> for detailed analysis.<\/p>\n<p>The following summary plot explains the positive and negative relationships of the predictors with the target variable. For example, the top variable here, <code>esent<\/code>, is defined as number of emails sent. This plot is made of all data points in the training set. Blue indicates dragging the final output to class 0, and pink represents class 1. Key influencing features are ranked in descending order.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28979 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-Shapley-Plots.png\" alt=\"ML-4931-Shapley-Plots\" width=\"972\" height=\"829\"><\/p>\n<p>Now you can proceed with the deploy and manage step of the ML workflow.<\/p>\n<h2>Develop and automate the workflow<\/h2>\n<p>Let\u2019s start with the project structure:<\/p>\n<ul>\n<li><strong>\/customer-churn-model<\/strong> \u2013 Project name<\/li>\n<li><strong>\/data<\/strong> \u2013 Dataset<\/li>\n<li><strong>\/pipelines<\/strong> \u2013 Code for SageMaker pipeline components<\/li>\n<li><strong>SageMaker_Pipelines_project.ipynb<\/strong> \u2013 Allows you to create and run the ML workflow<\/li>\n<li><strong>Customer_Churn_Modeling.ipynb<\/strong> \u2013 Baseline model development notebook<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28980 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-ProjectStructure-1024x523.png\" alt=\"ML-4931-projectstructure\" width=\"1024\" height=\"523\"><\/p>\n<p>Under <code>&lt;project-name&gt;\/pipelines\/customerchurn<\/code>, you can see the following Python scripts:<\/p>\n<ul>\n<li><strong>Preprocess.py<\/strong> \u2013 Integrates with SageMaker Processing for feature engineering<\/li>\n<li><strong>Evaluate.py<\/strong> \u2013 Allows model metrics calculation, in this case auc_score<\/li>\n<li><strong>Generate_config.py<\/strong> \u2013 Allows dynamic configuration needed for the downstream Clarify job for model explainability<\/li>\n<li><strong>Pipeline.py<\/strong> \u2013 Templatized code for the Pipelines ML workflow<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28981 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-CodeStructure-1024x252.png\" alt=\"ML-4931-CodeStructure\" width=\"1024\" height=\"252\"><\/p>\n<p>Let\u2019s walk through every step in the DAG and how they run. The steps are similar to when we first prepared the data.<\/p>\n<p>Perform data readiness with the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># processing step for feature engineering\n    sklearn_processor = SKLearnProcessor(\n        framework_version=\"0.23-1\",\n        instance_type=processing_instance_type,\n        instance_count=processing_instance_count,\n        sagemaker_session=sagemaker_session,\n        role=role,\n    )\n    step_process = ProcessingStep(\n        name=\"ChurnModelProcess\",\n        processor=sklearn_processor,\n        inputs=[\n          ProcessingInput(source=input_data, destination=\"\/opt\/ml\/processing\/input\"),  \n        ],\n        outputs=[\n            ProcessingOutput(output_name=\"train\", source=\"\/opt\/ml\/processing\/train\",\n                             destination=f\"s3:\/\/{default_bucket}\/output\/train\" ),\n            ProcessingOutput(output_name=\"validation\", source=\"\/opt\/ml\/processing\/validation\",\n                            destination=f\"s3:\/\/{default_bucket}\/output\/validation\"),\n            ProcessingOutput(output_name=\"test\", source=\"\/opt\/ml\/processing\/test\",\n                            destination=f\"s3:\/\/{default_bucket}\/output\/test\")\n        ],\n        code=f\"s3:\/\/{default_bucket}\/input\/code\/preprocess.py\",\n    )<\/code><\/pre>\n<\/p><\/div>\n<p>Train, tune, and find the best candidate model:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># training step for generating model artifacts\n    model_path = f\"s3:\/\/{default_bucket}\/output\"\n    image_uri = sagemaker.image_uris.retrieve(\n        framework=\"xgboost\",\n        region=region,\n        version=\"1.0-1\",\n        py_version=\"py3\",\n        instance_type=training_instance_type,\n    )\n    fixed_hyperparameters = {\n    \"eval_metric\":\"auc\",\n    \"objective\":\"binary:logistic\",\n    \"num_round\":\"100\",\n    \"rate_drop\":\"0.3\",\n    \"tweedie_variance_power\":\"1.4\"\n    }\n    xgb_train = Estimator(\n        image_uri=image_uri,\n        instance_type=training_instance_type,\n        instance_count=1,\n        hyperparameters=fixed_hyperparameters,\n        output_path=model_path,\n        base_job_name=f\"churn-train\",\n        sagemaker_session=sagemaker_session,\n        role=role,\n    )\n    hyperparameter_ranges = {\n    \"eta\": ContinuousParameter(0, 1),\n    \"min_child_weight\": ContinuousParameter(1, 10),\n    \"alpha\": ContinuousParameter(0, 2),\n    \"max_depth\": IntegerParameter(1, 10),\n    }\n    objective_metric_name = \"validation:auc\"\n<\/code><\/pre>\n<\/p><\/div>\n<p>You can add a <a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2021\/07\/amazon-sagemaker-pipeline-introduces-a-automatic-hyperparameter-tuning-step\/\" target=\"_blank\" rel=\"noopener noreferrer\">model tuning step (TuningStep)<\/a> in the pipeline, which automatically invokes a hyperparameter tuning job (see the following code). The hyperparameter tuning finds the best version of a model by running many training jobs on the dataset using the algorithm and the ranges of hyperparameters that you specified. You can then register the best version of the model into the model registry using the RegisterModel step.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">## Direct Integration for HPO\n\n    step_tuning = TuningStep(\n    name = \"ChurnHyperParameterTuning\",\n    tuner = HyperparameterTuner(xgb_train, objective_metric_name, hyperparameter_ranges, max_jobs=2, max_parallel_jobs=2),\n    inputs={\n            \"train\": TrainingInput(\n                s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n                    \"train\"\n                ].S3Output.S3Uri,\n                content_type=\"text\/csv\",\n            ),\n            \"validation\": TrainingInput(\n                s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n                    \"validation\"\n                ].S3Output.S3Uri,\n                content_type=\"text\/csv\",\n            ),\n        },\n    )<\/code><\/pre>\n<\/p><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28983 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-SM_Pipeline-HPO-1024x445.png\" alt=\"ML-4931-SM_PipelineHPO\" width=\"1024\" height=\"445\"><\/p>\n<p>After you tune the model, depending on the tuning job objective metrics, you can use branching logic when orchestrating the workflow. For this post, the conditional step for model quality check is as follows:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># condition step for evaluating model quality and branching execution\n    cond_lte = ConditionGreaterThan(\n        left=JsonGet(\n            step=step_eval,\n            property_file=evaluation_report,\n            json_path=\"classification_metrics.auc_score.value\"\n        ),\n        right=0.75,\n    )\n<\/code><\/pre>\n<\/p><\/div>\n<p>The best candidate model is registered for batch scoring using the RegisterModel step:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">step_register = RegisterModel(\n        name=\"RegisterChurnModel\",\n        estimator=xgb_train,\n        model_data=step_tuning.get_top_model_s3_uri(top_k=0,s3_bucket=default_bucket,prefix=\"output\"),\n        content_types=[\"text\/csv\"],\n        response_types=[\"text\/csv\"],\n        inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n        transform_instances=[\"ml.m5.large\"],\n        model_package_group_name=model_package_group_name,\n        model_metrics=model_metrics,\n    )\n<\/code><\/pre>\n<\/p><\/div>\n<p>Now that the model is trained, let\u2019s see how Clarify helps us understand what features the models base their predictions on. You can create an <code>analysis_config.json<\/code> file dynamically per workflow run using the <code>generate_config.py<\/code> utility. You can version and track the config file per pipeline <code>runId<\/code> and store it in Amazon S3 for further references. Initialize the <code>dataconfig<\/code> and <code>modelconfig<\/code> files as follows:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">data_config = sagemaker.clarify.DataConfig(\n    s3_data_input_path=f's3:\/\/{args.default_bucket}\/output\/train\/train.csv',\n    s3_output_path=args.bias_report_output_path,\n        label=0,\n        headers= ['target','esent','eopenrate','eclickrate','avgorder','ordfreq','paperless','refill','doorstep','first_last_days_diff','created_first_days_diff','favday_Friday','favday_Monday','favday_Saturday','favday_Sunday','favday_Thursday','favday_Tuesday','favday_Wednesday','city_BLR','city_BOM','city_DEL','city_MAA'],\n        dataset_type=\"text\/csv\",\n    )\n    model_config = sagemaker.clarify.ModelConfig(\n        model_name=args.modelname,\n        instance_type=args.clarify_instance_type, \n        instance_count=1,\n        accept_type=\"text\/csv\",\n    )\n    model_predicted_label_config = sagemaker.clarify.ModelPredictedLabelConfig(probability_threshold=0.5)\n    bias_config = sagemaker.clarify.BiasConfig(\n        label_values_or_threshold=[1],\n        facet_name=\"doorstep\",\n        facet_values_or_threshold=[0],\n    )<\/code><\/pre>\n<\/p><\/div>\n<p>After you add the Clarify step as a postprocessing job using <code>sagemaker.clarify.SageMakerClarifyProcessor<\/code> in the pipeline, you can see a detailed feature and bias analysis report per pipeline run.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28984 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-ClarifyReport-1024x730.png\" alt=\"ML-4931-ClarifyReport\" width=\"1024\" height=\"730\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28985 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-SM-UI-1-1024x550.png\" alt=\"ML-4931-SM-UI-1\" width=\"1024\" height=\"550\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28986 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-SM-UI-2-1024x244.png\" alt=\"ML-4931-SM-UI-2\" width=\"1024\" height=\"244\"><\/p>\n<p>As the final step of the pipeline workflow, you can use the <code>TransformStep<\/code> step for offline scoring. Pass in the <code>transformer instance<\/code> and the <code>TransformInput<\/code> with the <code>batch_data<\/code> pipeline parameter defined earlier:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># step to perform batch transformation\n    transformer = Transformer(\n    model_name=step_create_model.properties.ModelName,\n    instance_type=\"ml.m5.xlarge\",\n    instance_count=1,\n    output_path=f\"s3:\/\/{default_bucket}\/ChurnTransform\"\n    )\n    step_transform = TransformStep(\n    name=\"ChurnTransform\",\n    transformer=transformer,\n    inputs=TransformInput(data=batch_data,content_type=\"text\/csv\")\n    )\n<\/code><\/pre>\n<\/p><\/div>\n<p>Finally, you can trigger a new pipeline run by choosing <strong>Start an execution<\/strong> on the Studio IDE interface.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28987 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-SMPipeline-Execution-1024x162.png\" alt=\"ML-4931-SMPipeline-Execution\" width=\"1024\" height=\"162\"><\/p>\n<p>You can also describe a pipeline run or start the pipeline using the following <a href=\"https:\/\/github.com\/aws-samples\/customer-churn-sagemaker-pipelines-sample\/blob\/main\/SageMaker_Pipelines_project.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a>. The following screenshot shows our output.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28988 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/05\/ML-4931-SMPipeline-DescribeExecution-1024x612.png\" alt=\"ML-4931-SMPipeline-DescribeExecution\" width=\"1024\" height=\"612\"><\/p>\n<p>You can schedule your SageMaker model building pipeline runs using <a href=\"https:\/\/aws.amazon.com\/eventbridge\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EventBridge<\/a>. SageMaker model building pipelines are supported <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/pipeline-eventbridge.html\" target=\"_blank\" rel=\"noopener noreferrer\">as a target in Amazon EventBridge<\/a>. This allows you to trigger your pipeline to run based on any event in your event bus. EventBridge enables you to automate your pipeline runs and respond automatically to events such as training job or endpoint status changes. Events include a new file being uploaded to your S3 bucket, a change in status of your SageMaker endpoint due to drift, and <a href=\"http:\/\/aws.amazon.com\/sns\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Notification Service<\/a> (Amazon SNS) topics.<\/p>\n<h2>Conclusion<\/h2>\n<p>This post explained how to use SageMaker Pipelines with other built-in SageMaker features and the XGBoost algorithm to develop, iterate, and deploy the best candidate model for churn prediction. For instructions on implementing this solution, see the <a href=\"https:\/\/github.com\/aws-samples\/customer-churn-sagemaker-pipelines-sample\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. You can also clone and extend this solution with additional data sources for model retraining. We encourage you to reach out and discuss your ML use cases with your AWS account manager.<\/p>\n<h2>Additional references<\/h2>\n<p>For additional information, see the following resources:<\/p>\n<hr>\n<h2>About the Authors<\/h2>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/06\/ML-4931-Gayatri.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-29030 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/06\/ML-4931-Gayatri.png\" alt=\"\" width=\"100\" height=\"117\"><\/a><strong>Gayatri Ghanakota<\/strong> is a Machine Learning Engineer with AWS Professional Services. She is passionate about developing, deploying, and explaining AI\/ ML solutions across various domains. Prior to this role, she led multiple initiatives as a data scientist and ML engineer with top global firms in the financial and retail space. She holds a master\u2019s degree in Computer Science specialized in Data Science from the University of Colorado, Boulder.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/06\/ML-4931-Sarita-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-29032 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/06\/ML-4931-Sarita-1.png\" alt=\"\" width=\"100\" height=\"107\"><\/a><strong>Sarita Joshi<\/strong> is a Senior Data Scientist with AWS Professional Services focused on supporting customers across industries including retail, insurance, manufacturing, travel, life sciences, media and entertainment, and financial services. She has several years of experience as a consultant advising clients across many industries and technical domains, including AI, ML, analytics, and SAP. Today, she is passionately working with customers to develop and implement machine learning and AI solutions at scale.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-tune-and-deploy-an-end-to-end-churn-prediction-model-using-amazon-sagemaker-pipelines\/<\/p>\n","protected":false},"author":0,"featured_media":1018,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1017"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1017"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1017\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1018"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1017"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1017"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1017"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}