{"id":1217,"date":"2021-11-18T08:32:49","date_gmt":"2021-11-18T08:32:49","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/18\/accelerate-data-preparation-using-amazon-sagemaker-data-wrangler-for-diabetic-patient-readmission-prediction\/"},"modified":"2021-11-18T08:32:49","modified_gmt":"2021-11-18T08:32:49","slug":"accelerate-data-preparation-using-amazon-sagemaker-data-wrangler-for-diabetic-patient-readmission-prediction","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/18\/accelerate-data-preparation-using-amazon-sagemaker-data-wrangler-for-diabetic-patient-readmission-prediction\/","title":{"rendered":"Accelerate data preparation using Amazon SageMaker Data Wrangler for diabetic patient readmission prediction"},"content":{"rendered":"<div id=\"\">\n<p>Patient readmission to hospital after prior visits for the same disease results in an additional burden on healthcare providers, the health system, and patients. Machine learning (ML) models, if built and trained properly, can help understand reasons for readmission, and predict readmission accurately. ML could allow providers to create better treatment plans and care, which would translate to a reduction of both cost and mental stress for patients. However, ML is a complex technique that has been limiting organizations that don\u2019t have the resources to recruit a team of data engineers and scientists to build ML workloads. In this post, we show you how to build an ML model based on the XGBoost algorithm to predict diabetic patient readmission easily and quickly with a graphical interface from <a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a>.<\/p>\n<p>Data Wrangler is an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/studio\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a> feature designed to allow you to explore and transform tabular data for ML use cases without coding. Data Wrangler is the fastest and easiest way to prepare data for ML. It gives you the ability to use a visual interface to access data and perform exploratory data analysis (EDA) and feature engineering. It also seamlessly operationalizes your data preparation steps by allowing you to export your data flow into <a href=\"https:\/\/aws.amazon.com\/sagemaker\/pipelines\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Pipelines<\/a>, a Data Wrangler job, Python file, or <a href=\"https:\/\/aws.amazon.com\/sagemaker\/feature-store\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Feature Store<\/a>.<\/p>\n<p>Data Wrangler comes with over 300 built-in transforms and custom transformations using either Python, PySpark, or SparkSQL runtime. It also comes with built-in data analysis capabilities for charts (such as scatter plot or histogram) and time-saving model analysis capabilities such as feature importance, target leakage, and model explainability.<\/p>\n<p>In this post, we explore the key capabilities of Data Wrangler using the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/diabetes+130-us+hospitals+for+years+1999-2008\" target=\"_blank\" rel=\"noopener noreferrer\">UCI diabetic patient readmission dataset<\/a>. We showcase how you can build ML data transformation steps without writing sophisticated coding, and how to create a model training, feature store, or ML pipeline with reproducibility for a diabetic patient readmission prediction use case.<\/p>\n<p>We also have published a related <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-data-wrangler-hospital-readmission-prediction\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub project repo<\/a> that includes the end-to-end ML workflow steps and relevant assets, including Jupyter notebooks.<\/p>\n<p>We walk you through the following high-level steps:<\/p>\n<ul>\n<li>Studio prerequisites and input dataset setup<\/li>\n<li>Design your Data Wrangler flow file<\/li>\n<li>Create processing and training jobs for model building<\/li>\n<li>Host a trained model for real-time inference<\/li>\n<\/ul>\n<h2>Studio prerequisites and input dataset setup<\/h2>\n<p>To use Studio and Studio notebooks, you must complete the Studio onboarding process. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/onboard-quick-start.html\" target=\"_blank\" rel=\"noopener noreferrer\">Quick start<\/a> instructions. The Quick start uses the same default settings as the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/onboard-iam.html\" target=\"_blank\" rel=\"noopener noreferrer\">standard Studio setup<\/a>. You can also choose to onboard using <a href=\"https:\/\/aws.amazon.com\/single-sign-on\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Single Sign-On<\/a> (AWS SSO) for authentication (see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/onboard-sso-users.html\" target=\"_blank\" rel=\"noopener noreferrer\">Onboard to Amazon SageMaker Studio Using AWS SSO<\/a>).<\/p>\n<h3>Dataset<\/h3>\n<p>The <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/diabetes+130-us+hospitals+for+years+1999-2008\" target=\"_blank\" rel=\"noopener noreferrer\">patient readmission dataset<\/a> captures 10 years (1999\u20132008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes with about 100,000 observations.<\/p>\n<p>You can start by downloading the public dataset and uploading it to an <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket. For demonstration purposes, we split the dataset into four tables based on feature categories: <code>diabetic_data_hospital_visits.csv<\/code>, <code>diabetic_data_demographic.csv<\/code>, <code>diabetic_data_labs.csv<\/code>, and <code>diabetic_data_medication.csv<\/code>. Review and run the code in <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-data-wrangler-hospital-readmission-prediction\/blob\/main\/datawrangler_workshop_pre_requisite.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">datawrangler_workshop_pre_requisite.ipynb<\/a>. If you leave everything at its default inside the notebook, the CSV files will be available in <code>s3:\/\/sagemaker-${region}-${account_number}\/sagemaker\/demo-diabetic-datawrangler\/<\/code>.<\/p>\n<h2>Design your Data Wrangler flow file<\/h2>\n<p>To get started \u2013 on the Studio File menu, choose New, and choose Data Wrangler Flow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30468\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image001.png\" alt=\"\" width=\"1357\" height=\"906\"><\/a><\/p>\n<p>This launches a Data Wrangler instance and configures it with the Data Wrangler app. The process takes a few minutes to complete.<\/p>\n<h3>Load the data from Amazon S3 into Data Wrangler<\/h3>\n<p>To load the data into Data Wrangler, complete the following steps:<\/p>\n<ol>\n<li>On the <strong>Import tab<\/strong>, choose <strong>Amazon S3<\/strong> as the data source.<\/li>\n<li>Choose <strong>Add data source<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30469\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image003.png\" alt=\"\" width=\"1222\" height=\"451\"><\/a><\/li>\n<\/ol>\n<p>You could also import data from <a href=\"http:\/\/aws.amazon.com\/athena\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena<\/a>, <a href=\"http:\/\/aws.amazon.com\/redshift\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift<\/a>, or Snowflake. For more information about the currently supported import sources, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-import.html\" target=\"_blank\" rel=\"noopener noreferrer\">Import<\/a>.<\/p>\n<ol start=\"3\">\n<li>Select the CSV files from the bucket <code>s3:\/\/sagemaker-${region}-${account_number}\/sagemaker\/demo-diabetic-datawrangler\/<\/code> one at a time.<\/li>\n<li>Choose <strong>Import <\/strong>for each file.<\/li>\n<\/ol>\n<p>When the import is complete, data in an S3 bucket is available inside Data Wrangler for preprocessing.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30470\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image005.png\" alt=\"\" width=\"1385\" height=\"796\"><\/a><\/p>\n<h3>Join the CSV files<\/h3>\n<p>Now that we have imported multiple CSV source dataset, let\u2019s join them for a consolidated dataset.<\/p>\n<ol>\n<li>On the <strong>Data flow<\/strong> tab, for <strong>Data types<\/strong>, choose the plus sign.<\/li>\n<li>On the menu, choose <strong>Join<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image007.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30471\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image007.png\" alt=\"\" width=\"1295\" height=\"840\"><\/a><\/li>\n<li>Choose the <code>diabetic_data_hospital_visits.csv<\/code> dataset as the <strong>Right<\/strong> dataset.<\/li>\n<li>Choose <strong>Configure<\/strong> to set up the join criteria.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image009.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30472\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image009.png\" alt=\"\" width=\"1283\" height=\"802\"><\/a><\/li>\n<li>For <strong>Name<\/strong>, enter a name for the join.<\/li>\n<li>For <strong>Join type<\/strong>\u00b8 choose a join type (for this post, <strong>Inner<\/strong>).<\/li>\n<li>Choose the columns for <strong>Left<\/strong> and <strong>Right<\/strong>.<\/li>\n<li>Choose <strong>Apply<\/strong> to preview the joined dataset.<\/li>\n<li>Choose <strong>Add<\/strong> to add it to the data flow file.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image011.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30473\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image011.png\" alt=\"\" width=\"1340\" height=\"796\"><\/a><\/li>\n<\/ol>\n<h3>Built-in analysis<\/h3>\n<p>Before we apply any transformations on the input source, let\u2019s perform a quick analysis of the dataset. Data Wrangler provides several built-in analysis types, like histogram, scatter plot, target leakage, bias report, and quick model. For more information about analysis types, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-analyses.html\" target=\"_blank\" rel=\"noopener noreferrer\">Analyze and Visualize<\/a>.<\/p>\n<h4>Target leakage<\/h4>\n<p>Target leakage occurs when information in an ML training dataset is strongly correlated with the target label, but isn\u2019t available when the model is used for prediction. You might have a column in your dataset that serves as a proxy for the column you want to predict with your model. For classification tasks, Data Wrangler calculates the prediction quality metric of ROC-AUC, which is computed individually for each feature column via cross-validation to generate a target leakage report.<\/p>\n<ol>\n<li>On the <strong>Data Flow<\/strong> tab, for <strong>Join<\/strong>, choose the plus sign.<\/li>\n<li>Choose <strong>Add analysis<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image013.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30474 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image013.png\" alt=\"\" width=\"1284\" height=\"834\"><\/a><\/li>\n<li>For <strong>Analysis type<\/strong>, choose <strong>Target Leakage<\/strong>.<\/li>\n<li>For <strong>Analysis name<\/strong>\u00b8 enter a name.<\/li>\n<li>For <strong>Max features<\/strong>, enter <code>50<\/code>.<\/li>\n<li>For <strong>Problem Type<\/strong>\u00b8 choose <strong>classification<\/strong>.<\/li>\n<li>For <strong>Target<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong> to generate the report.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image015.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30475\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image015.png\" alt=\"\" width=\"1432\" height=\"719\"><\/a><\/li>\n<\/ol>\n<p>As shown in the preceding screenshot, there is no indication of target leakage in our input dataset. However, a few features like <code>encounter_id_1<\/code>, <code>encounter_id_0<\/code>, <code>weight<\/code>, and <code>payer_code<\/code> are marked as possibly redundant with 0.5 predictive ability of ROC. This means these features by themselves aren\u2019t providing any useful information towards predicting the target. Before making the decision to drop these uninformative features, you should consider whether these could add value when used in tandem with other features. For our use case, we keep them as is and move to the next step.<\/p>\n<ol start=\"9\">\n<li>Choose <strong>Save<\/strong> to save the analysis into your Data Wrangler data flow file.<\/li>\n<\/ol>\n<h4>Bias report<\/h4>\n<p>AI\/ML systems are only as good as the data we put into them. ML-based systems are more accessible than ever before, and with the growth of adoption throughout various industries, further questions arise surrounding fairness and how it is ensured across these ML systems. Understanding how to detect and avoid bias in ML models is imperative and complex. With the built-in bias report in Data Wrangler, data scientists can quickly detect bias during the data preparation stage of the ML workflow. Bias report analysis uses <a href=\"https:\/\/aws.amazon.com\/sagemaker\/clarify\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Clarify<\/a> to perform bias analysis.<\/p>\n<p>To generate a bias report, you must specify the target column that you want to predict and a facet or column that you want to inspect for potential biases. For example, we can generate a bias report on the <code>gender<\/code> feature for <code>Female<\/code> values to see whether there is any class imbalance.<\/p>\n<ol>\n<li>On the <strong>Analysis<\/strong> tab, choose <strong>Create new analysis<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image017.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30476\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image017.png\" alt=\"\" width=\"1429\" height=\"444\"><\/a><\/li>\n<li>For <strong>Analysis type<\/strong>\u00b8 choose <strong>Bias Report<\/strong>.<\/li>\n<li>For <strong>Analysis name<\/strong>, enter a name.<\/li>\n<li>For <strong>Select the column your model predicts<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>For <strong>Predicted value<\/strong>, enter <code>NO<\/code>.<\/li>\n<li>For <strong>Column to analyze for bias<\/strong>, choose <strong>gender<\/strong>.<\/li>\n<li>For <strong>Column value to analyze for bias<\/strong>, choose <strong>Female<\/strong>.<\/li>\n<li>Leave remaining settings at their default.<\/li>\n<li>Choose <strong>Check for bias<\/strong> to generate the bias report.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image019.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30477\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image019.png\" alt=\"\" width=\"1618\" height=\"853\"><\/a><\/li>\n<\/ol>\n<p>As shown in the bias report, there is no significant bias in our input dataset, which means the dataset has a fair amount of representation by gender. For our dataset, we can move forward with a hypothesis that there is no inherent bias in our dataset. However, based on your use case and dataset, you might want to run similar bias reporting on other features of your dataset to identify any potential bias. If any bias is detected, you can consider applying a suitable transformation to address that bias.<\/p>\n<ol start=\"10\">\n<li>Choose <strong>Save<\/strong> to add this report to the data flow file.<\/li>\n<\/ol>\n<h4>Histogram<\/h4>\n<p>In this section, we use a histogram to gain insights into the target label patterns inside our input dataset.<\/p>\n<ol>\n<li>On the <strong>Analysis<\/strong> tab, choose <strong>Create new analysis<\/strong>.<\/li>\n<li>For <strong>Analysis type<\/strong>\u00b8 choose <strong>Histogram<\/strong>.<\/li>\n<li>For <strong>Analysis name<\/strong>\u00b8 enter a name.<\/li>\n<li>For <strong>X axis<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>For <strong>Color by<\/strong>, choose <strong>race<\/strong>.<\/li>\n<li>For <strong>Facet by<\/strong>, choose <strong>gender<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong> to generate a histogram.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image021.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30478\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image021.png\" alt=\"\" width=\"1432\" height=\"682\"><\/a><\/li>\n<\/ol>\n<p>This ML problem is a multi-class classification problem. However, we can observe a major target class imbalance between patients readmitted <strong>&lt;30 <\/strong>days, <strong>&gt;30 <\/strong>days, and <strong>NO<\/strong> readmission. We can also see that these two classifications are proportionate across gender and race. To improve our potential model predictability, we can merge <strong>&lt;30<\/strong> and <strong>&gt;30<\/strong> into a single positive class. This merge of target label classification turns our ML problem into a binary classification. As we demonstrate in the next section, we can do this easily by adding respective transformations.<\/p>\n<h3>Transformations<\/h3>\n<p>When it comes to training an ML model for structured or tabular data, decision tree-based algorithms are considered best in class. This is due to their inherent technique of applying ensemble tree methods in order to boost weak learners using the gradient descent architecture.<\/p>\n<p>For our medical source dataset, we use the SageMaker built-in <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/xgboost.html\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost algorithm<\/a> because it\u2019s one of the most popular decision tree-based ensemble ML algorithms. The XGBoost algorithm can only accept numerical values as input, therefore as a prerequisite we must apply categorical feature transformations on our source dataset.<\/p>\n<p>Data Wrangler comes with over 300 built-in transforms, which require no coding. Let\u2019s use built-in transforms to apply a few key transformations and prepare our training dataset.<\/p>\n<h4>Handle missing values<\/h4>\n<p>To address missing values, complete the following steps:<\/p>\n<ol>\n<li>Switch to <strong>Data<\/strong> tab to bring up all the built-in transforms<\/li>\n<li>Expand <strong>Handle missing<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Impute<\/strong>.<\/li>\n<li>For <strong>Column type<\/strong>\u00b8 choose <strong>Numeric<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>diag_1<\/strong>.<\/li>\n<li>For <strong>Imputing strategy<\/strong>, choose <strong>Mean<\/strong>.<\/li>\n<li>By default, the operation is performed in-place, but you can provide an optional <strong>Output column<\/strong> name, which creates a new column with imputed values. For our blog we go with default in-place update.<\/li>\n<li>Choose <strong>Preview<\/strong> to preview the results.<\/li>\n<li>Choose <strong>Add<\/strong> to include this transformation step into the data flow file.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image023.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30479\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image023.png\" alt=\"\" width=\"1428\" height=\"705\"><\/a><\/li>\n<li>Repeat these steps for the <code>diag_2<\/code> and <code>diag_3<\/code> features and impute missing values.<\/li>\n<\/ol>\n<h4>Search and edit features with special characters<\/h4>\n<p>Because our source dataset has features with special characters, we need to clean them before training. Let\u2019s use the search and edit transform.<\/p>\n<ol>\n<li>Expand <strong>Search and edit<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Find and replace substring<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>race<\/strong>.<\/li>\n<li>For <strong>Pattern<\/strong>, enter <code><strong>?<\/strong><\/code>.<\/li>\n<li>For <strong>Replacement string<\/strong>\u00b8 choose <strong>Other<\/strong>.<\/li>\n<li>Leave <strong>Output column<\/strong> blank for in-place replacements.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the transform to your data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image025.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30459\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image025.png\" alt=\"\" width=\"1432\" height=\"722\"><\/a><\/li>\n<li>Repeat the same steps for other features to replace <code>weight<\/code> and <code>payer_code<\/code> with <code>0<\/code> and <code>medical_specialty<\/code> with Other.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image027.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30460\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image027.png\" alt=\"\" width=\"1432\" height=\"721\"><\/a><\/li>\n<\/ol>\n<h4>One-hot encoding for categorical features<\/h4>\n<p>To use one-hot encoding for categorical features, complete the following steps:<\/p>\n<ol>\n<li>Expand <strong>Encode categorical<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>One-hot encode<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>race<\/strong>.<\/li>\n<li>For <strong>Output style<\/strong>, choose <strong>Columns<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the change to the data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image029.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30461\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image029.png\" alt=\"\" width=\"1672\" height=\"793\"><\/a><\/li>\n<li>Repeat these steps for age and <code>medical_specialty_filler<\/code> to one-hot encode those categorical features as well.<\/li>\n<\/ol>\n<h4>Ordinal encoding for categorical features<\/h4>\n<p>To use ordinal encoding for categorical features, complete the following steps:<\/p>\n<ol>\n<li>Expand <strong>Encode categorical<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Ordinal encode<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>gender<\/strong>.<\/li>\n<li>For <strong>Invalid handling strategy<\/strong>, choose <strong>Keep<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the change to the data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image031.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30462\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image031.png\" alt=\"\" width=\"1669\" height=\"677\"><\/a><\/li>\n<\/ol>\n<h4>Custom transformations: Add new features to your dataset<\/h4>\n<p>If we decide to store our transformed features in Feature Store, a prerequisite is to insert the <code>eventTime<\/code> feature into the dataset. We can easily do that using a custom transformation<strong>.<\/strong><\/p>\n<ol>\n<li>Expand <strong>Custom Transform<\/strong> in the list of transforms.<\/li>\n<li>Choose <strong>Python (Pandas)<\/strong> and enter the following line of code:\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Table is available as variable `df`\nimport time\ndf['eventTime'] = time.time()<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Choose <strong>Preview<\/strong> to view the results.<\/li>\n<li>Choose <strong>Add<\/strong> to add the change to the data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image033.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30465\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image033.png\" alt=\"\" width=\"1615\" height=\"489\"><\/a><\/li>\n<\/ol>\n<h4>Transform the target Label<\/h4>\n<p>The target label <code>readmitted<\/code> has three classes: <strong>NO<\/strong> readmission, readmitted <strong>&lt;30<\/strong> days, and readmitted <strong>&gt;30<\/strong> days. We saw in our histogram analysis that there is a strong class imbalance because the majority of the patients didn\u2019t readmit. We can combine the latter two classes into a positive class to denote the patients being readmitted, and turn the classification problem into a binary case instead of multi-class. Let\u2019s use the search and edit transform to convert string values to binary values.<\/p>\n<ol>\n<li>Expand <strong>Search and edit<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose<strong> Find and replace substring<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>For<strong> Pattern, <\/strong>enter <code>&gt;30|&lt;30<\/code>.<\/li>\n<li>For the <strong>Replacement string<\/strong>, enter <code><strong>1<\/strong><\/code>.<\/li>\n<\/ol>\n<p>This converts all the values that have either <strong>&gt;30<\/strong> or <strong>&lt;30<\/strong> values to <strong>1<\/strong>.<\/p>\n<ol start=\"6\">\n<li>Choose <strong>Preview<\/strong> to view the results.<\/li>\n<li>Choose <strong>Add <\/strong>to add this transform to the data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image035.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30480 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image035.png\" alt=\"\" width=\"1784\" height=\"905\"><\/a><\/li>\n<\/ol>\n<p>Let\u2019s repeat the same steps to convert <strong>NO<\/strong> values to <strong>0<\/strong>.<\/p>\n<ol start=\"8\">\n<li>Expand <strong>Search and edit<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Find and replace<\/strong> <strong>substring<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>For <strong>Pattern<\/strong>, enter <code><strong>NO<\/strong><\/code>.<\/li>\n<li>For <strong>Replacement string<\/strong>, enter <code><strong>0<\/strong><\/code>.<\/li>\n<li>Choose <strong>Preview<\/strong> to review the converted column.<\/li>\n<li>Choose <strong>Add<\/strong> to add the transform to our data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image037.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30466\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/image037.png\" alt=\"\" width=\"1796\" height=\"903\"><\/a><\/li>\n<\/ol>\n<p>Now our target label <strong>readmitted <\/strong>is ready for ML training.<\/p>\n<h4>Position the target label as the first column to utilize XGBoost algorithm<\/h4>\n<p>Because we\u2019re going to use the XGBoost built-in SageMaker algorithm to train the model, the algorithm assumes that the target label is in the first column. Let\u2019s position the target label as such in order to use this algorithm.<\/p>\n<ol>\n<li>Expand <strong>Manage columns<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Move column<\/strong>.<\/li>\n<li>For <strong>Move type<\/strong>, choose <strong>Move to start<\/strong>.<\/li>\n<li>For <strong>Column to move<\/strong>, choose <strong>readmitted<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the change to your data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image039.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30481\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image039.png\" alt=\"\" width=\"1759\" height=\"824\"><\/a><\/li>\n<\/ol>\n<h4>Drop redundant columns<\/h4>\n<p>Next, we drop any redundant columns.<\/p>\n<ol>\n<li>Expand <strong>Manage columns<\/strong> in the list of transforms.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Drop column<\/strong>.<\/li>\n<li>For <strong>Column to drop<\/strong>, choose <code><strong>encounter_id_0<\/strong><\/code>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the changes to the flow file.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image041.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30482\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image041.png\" alt=\"\" width=\"1668\" height=\"750\"><\/a><\/li>\n<li>Repeat these steps for the other redundant columns: <code>patient_nbr_0<\/code>, <code>encounter_id_1<\/code>, and <code>patient_nbr_1<\/code>.<\/li>\n<\/ol>\n<p>At this stage, we have done a few analyses and applied a few transformations on our raw input dataset. If we choose to preserve the transformed state of the input dataset, like checkpoint, you can do so by choosing <strong>Export data<\/strong>. This option allows you to persist the transformed dataset to an S3 bucket.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image043.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30483\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image043.png\" alt=\"\" width=\"1380\" height=\"366\"><\/a><\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image045.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30484\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image045.png\" alt=\"\" width=\"1429\" height=\"242\"><\/a><\/p>\n<h4>Quick Model analysis<\/h4>\n<p>Now that we have applied transformations to our initial dataset, let\u2019s explore the Quick Model analysis feature. Quick model helps you quickly evaluate the training dataset and produce importance scores for each feature. A feature importance score indicates how useful a feature is at predicting a target label. The feature importance score is between 0\u20131; a higher number indicates that the feature is more important to the whole dataset. Because our use case relates to the classification problem type, the quick model also generates an F1 score for the current dataset.<\/p>\n<ol>\n<li>Switch back to <strong>Analysis<\/strong> Tab and click <strong>Create new analysis<\/strong> to bring-up built-in analysis<\/li>\n<li>For <strong>Analysis type<\/strong>, choose <strong>Quick Model<\/strong>.<\/li>\n<li>Enter a name for your analysis.<\/li>\n<li>For <strong>Label<\/strong>, choose <strong>readmitted<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image047.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30485\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image047.png\" alt=\"\" width=\"1432\" height=\"686\"><\/a><\/li>\n<li>Choose <strong>Preview<\/strong> and wait for the model to be trained and the results to appear.<\/li>\n<\/ol>\n<p>The resulting quick model F1 score shows 0.618 (your generated score might be different) with the transformed dataset. Data Wrangler performs several steps to generate the F1 score, including preprocessing, training, evaluating, and finally calculating feature importance. For more details about these steps, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-analyses.html#data-wrangler-quick-model\" target=\"_blank\" rel=\"noopener noreferrer\">Quick Model<\/a>.<\/p>\n<p>With the quick model analysis feature, data scientists can iterate through applicable transformations until they have their desired transformed dataset that can potentially lead to better business accuracy and expectations.<\/p>\n<ol start=\"6\">\n<li>Choose <strong>Save<\/strong> to add the quick model analysis to the data flow.<\/li>\n<\/ol>\n<h4>Export options<\/h4>\n<p>We\u2019re now ready to export our data flow for further processing.<\/p>\n<ol>\n<li>Navigate back to data flow designer by clicking <strong>Back to data flow<\/strong> on the top left<\/li>\n<li>On the <strong>Export<\/strong> tab, choose <strong>Steps<\/strong> to reveal the Data Wrangler flow steps.<\/li>\n<li>Choose the last step to mark it with a check.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image049.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30486\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image049.png\" alt=\"\" width=\"1721\" height=\"879\"><\/a><\/li>\n<li>Choose <strong>Export step<\/strong> to reveal the export options.<\/li>\n<\/ol>\n<p>As of this writing, you have four export options:<\/p>\n<ul>\n<li><strong>Save to S3<\/strong> \u2013 Save the data to an S3 bucket using a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/processing-job.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker processing job<\/a><\/li>\n<li><strong>Pipeline<\/strong> \u2013 Export a Jupyter notebook that creates a <a href=\"https:\/\/aws.amazon.com\/sagemaker\/pipelines\/\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker pipeline <\/a>with your data flow<\/li>\n<li><strong>Python Code<\/strong> \u2013 Export your data flow to Python code<\/li>\n<li><strong>Feature Store<\/strong> \u2013 Export a Jupyter notebook that creates a Feature Store feature group and adds features to an offline or online feature store<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image051.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30487\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image051.png\" alt=\"\" width=\"1727\" height=\"818\"><\/a><\/li>\n<\/ul>\n<ol start=\"5\">\n<li>Choose <strong>Save to S3<\/strong> to generate a fully implemented Jupyter notebook that creates a processing job using your data flow file.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image053.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30488\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image053.png\" alt=\"\" width=\"1430\" height=\"714\"><\/a><\/li>\n<\/ol>\n<h2>Run processing and training jobs for model building<\/h2>\n<p>In this section, we show how to run processing and training jobs using the generated Jupyter notebook from Data Wrangler.<\/p>\n<h3>Submit a processing job<\/h3>\n<p>We\u2019re now ready to submit a SageMaker processing job using our data flow file.<\/p>\n<p>Run all the cells up to and including the <strong>Create Processing Job<\/strong> cell inside the exported notebook.<\/p>\n<p>The cell <strong>Create Processing Job<\/strong> triggers a new SageMaker processing job by provisioning managed infrastructure and running the required Data Wrangler Docker container on that infrastructure.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image055.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30489\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image055.png\" alt=\"\" width=\"1024\" height=\"519\"><\/a><\/p>\n<p>You can check the status of the submitted processing job by running the next cell <strong>Job Status &amp; S3 Output Location<\/strong>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image057.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30490\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image057.png\" alt=\"\" width=\"1008\" height=\"205\"><\/a><\/p>\n<p>You can also check the status of the submitted processing job on the SageMaker console.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image059.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30491\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image059.png\" alt=\"\" width=\"1428\" height=\"296\"><\/a><\/p>\n<h3>Train a model with SageMaker<\/h3>\n<p>Now that the data has been processed, let\u2019s train a model using the data. The same notebook has sample steps to train a model using the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/xgboost.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker built-in XGBoost algorithm<\/a>. Because our use case is a binary classification ML problem, we need to change the objective to <code>binary:logistic<\/code> inside the sample training steps.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image061.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30492\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image061.png\" alt=\"\" width=\"1156\" height=\"384\"><\/a><\/p>\n<p>Now we\u2019re ready to run our training job using the SageMaker managed infrastructure. Run the cell <strong>Start the Training Job<\/strong>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image063.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30493\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image063.png\" alt=\"\" width=\"1220\" height=\"287\"><\/a><\/p>\n<p>You can monitor the status of the submitted training job on the SageMaker console, on the <strong>Training jobs<\/strong> page.<\/p>\n<h2>Host a trained model for real-time inference<\/h2>\n<p>We now use another notebook available on GitHub under the project folder <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-data-wrangler-hospital-readmission-prediction\/blob\/main\/hosting\/model_hosting_steps.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">hosting\/Model_deployment_Steps.ipynb<\/a>. This is a simple notebook with two cells: the first cell has code for deploying your model to a persistent endpoint. You need to update <code>model_url<\/code> with your training job output S3 model artifact.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image065.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30494\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image065.png\" alt=\"\" width=\"1031\" height=\"560\"><\/a><\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image067.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30495\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image067.png\" alt=\"\" width=\"1354\" height=\"709\"><\/a><\/p>\n<p>The second cell in the notebook runs inference on the sample test file under <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-data-wrangler-hospital-readmission-prediction\/blob\/main\/hosting\/test_data\/test_data_UCI_sample.csv\" target=\"_blank\" rel=\"noopener noreferrer\">test_data\/test_data_UCI_sample.csv<\/a>. As you can see, we are able to generate predictions for our synthetic observations inside csv file. That concludes the ML workflow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image069.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30496\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image069.png\" alt=\"\" width=\"859\" height=\"324\"><\/a><\/p>\n<h2>Clean up<\/h2>\n<p>After you have experimented with the steps in this post, perform the following cleanup steps to stop incurring charges:<\/p>\n<ol>\n<li>On the SageMaker console, under <strong>Inference<\/strong> in the navigation pane, choose <strong>Endpoints<\/strong>.<\/li>\n<li>Select your hosted endpoint.<\/li>\n<li>On the <strong>Actions<\/strong> menu, choose <strong>Delete<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image071.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30497\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image071.png\" alt=\"\" width=\"1767\" height=\"575\"><\/a><\/li>\n<li>On the SageMaker Studio Control Panel, navigate to your SageMaker user profile.<\/li>\n<li>Under <strong>Apps<\/strong>, locate your Data Wrangler app and choose <strong>Delete app<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image073.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30498\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/ML-5155-image073.png\" alt=\"\" width=\"1438\" height=\"739\"><\/a><\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>In this post, we explored Data Wrangler capabilities using a public medical dataset related to <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/diabetes+130-us+hospitals+for+years+1999-2008\" target=\"_blank\" rel=\"noopener noreferrer\">patient readmission<\/a> and demonstrated how to perform feature transformations using built-in transforms and quick analysis. We showed how, without much coding, to generate the required steps to trigger data processing and ML training. This no-code\/low-code capability of Data Wrangler accelerates training data preparation and increases data scientist agility with faster iterative data preparation. In the end, we hosted our trained model and ran inferences against synthetic test data. We encourage you to check out our <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-data-wrangler-hospital-readmission-prediction\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a> to get hands-on practice and find new ways to improve model accuracy! To learn more about SageMaker, visit the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/whatis.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Development Guide<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/Shyam-Namavaram.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30499 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/08\/Shyam-Namavaram.png\" alt=\"\" width=\"100\" height=\"123\"><\/a>Shyam Namavaram<\/strong> is a Senior Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud-native applications. He passionately works with customers accelerating their AI\/ML adoption by providing technical guidance and helping them innovate and build secure cloud solutions on AWS. He specializes in AI\/ML, containers, and analytics technologies. Outside of work, he loves playing sports and exploring nature with trekking.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/08\/18\/Michael-Hsieh.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-27322 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/08\/18\/Michael-Hsieh.jpg\" alt=\"\" width=\"100\" height=\"111\"><\/a>Michael Hsieh<\/strong> is a Senior AI\/ML Specialist Solutions Architect. He works with customers to advance their ML journey with a combination of Amazon ML offerings and his ML domain knowledge. As a Seattle transplant, he loves exploring the great nature the region has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at the Shilshole Bay.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/accelerate-data-preparation-using-amazon-sagemaker-data-wrangler-for-diabetic-patient-readmission-prediction\/<\/p>\n","protected":false},"author":0,"featured_media":1218,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1217"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1217"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1217\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1218"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1217"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1217"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}