{"id":684,"date":"2020-12-13T00:13:20","date_gmt":"2020-12-13T00:13:20","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/12\/13\/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler\/"},"modified":"2020-12-13T00:13:20","modified_gmt":"2020-12-13T00:13:20","slug":"exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/12\/13\/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler\/","title":{"rendered":"Exploratory data analysis, feature engineering, and operationalizing your data flow into your ML pipeline with Amazon SageMaker Data Wrangler"},"content":{"rendered":"<div id=\"\">\n<p>According to <a href=\"https:\/\/www.anaconda.com\/state-of-data-science-2020\" target=\"_blank\" rel=\"noopener noreferrer\">The State of Data Science 2020<\/a> survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist\u2019s time (see the following diagram).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19754\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-1.jpg\" alt=\"According to The State of Data Science 2020 survey, data management, exploratory data analysis (EDA), feature selection, and feature engineering accounts for more than 66% of a data scientist\u2019s time.\" width=\"500\" height=\"509\"><\/p>\n<p>The same survey highlights that the top three biggest roadblocks to deploying a model in production are managing dependencies and environments, security, and skill gaps (see the following diagram).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19755\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-2.jpg\" alt=\"The same survey highlights that the top three biggest roadblocks to deploying a model in production are managing dependencies and environments, security, and skill gaps.\" width=\"800\" height=\"326\"><\/p>\n<p>The survey posits that these struggles result in fewer than half (48%) of the respondents feeling able to illustrate the impact data science has on business outcomes.<\/p>\n<p>Enter <a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a>, the fastest and easiest way to prepare data for machine learning (ML). SageMaker Data Wrangler gives you the ability to use a visual interface to access data, perform EDA and feature engineering, and seamlessly operationalize your data flow by exporting it into an <a href=\"https:\/\/aws.amazon.com\/sagemaker\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> pipeline, Amazon SageMaker Data Wrangler job, Python file, or SageMaker feature group.<\/p>\n<p>SageMaker Data Wrangler also provides you with over 300 built-in transforms, custom transforms using a Python, PySpark or SparkSQL runtime, built-in data analysis such as common charts (like scatterplot or histogram), custom charts using the <a href=\"https:\/\/altair-viz.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Altair library<\/a>, and useful model analysis capabilities such as feature importance, target leakage, and model explainability. Finally, SageMaker Data Wrangler creates a data flow file that can be versioned and shared across your teams for reproducibility.<\/p>\n<h2>Solution overview<\/h2>\n<p>In this post, we use the <a href=\"https:\/\/github.com\/aws-samples\/retail-demo-store\" target=\"_blank\" rel=\"noopener noreferrer\">retail demo store<\/a> example and <a href=\"https:\/\/github.com\/aws-samples\/retail-demo-store\/tree\/master\/generators\" target=\"_blank\" rel=\"noopener noreferrer\">generate<\/a> a sample dataset. We use three files: users.csv, items.csv, and interactions.csv. We first prepare the data in order to predict the customer segment based on past interactions. Our target is the field called <code>persona<\/code>, which we later transform and rename to <code>USER_SEGMENT<\/code>.<\/p>\n<p>The following code is a preview of the users dataset:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">id,username,email,first_name,last_name,addresses,age,gender,persona\r\n1,user1,nathan.smith@example.com,Nathan,Smith,\"[{\"\"first_name\"\": \"\"Nathan\"\", \"\"last_name\"\": \"\"Smith\"\", \"\"address1\"\": \"\"049 Isaac Stravenue Apt. 770\"\", \"\"address2\"\": \"\"\"\", \"\"country\"\": \"\"US\"\", \"\"city\"\": \"\"Johnsonmouth\"\", \"\"state\"\": \"\"NY\"\", \"\"zipcode\"\": \"\"12758\"\", \"\"default\"\": true}]\",28,M,electronics_beauty_outdoors\r\n2,user2,kevin.martinez@example.com,Kevin,Martinez,\"[{\"\"first_name\"\": \"\"Kevin\"\", \"\"last_name\"\": \"\"Martinez\"\", \"\"address1\"\": \"\"074 Jennifer Flats Suite 538\"\", \"\"address2\"\": \"\"\"\", \"\"country\"\": \"\"US\"\", \"\"city\"\": \"\"East Christineview\"\", \"\"state\"\": \"\"MI\"\", \"\"zipcode\"\": \"\"49758\"\", \"\"default\"\": true}]\",19,M,electronics_beauty_outdoors<\/code><\/pre>\n<\/div>\n<p>The following code is a preview of the items dataset:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">ITEM_ID,ITEM_URL,ITEM_SK,ITEM_NAME,ITEM_CATEGORY,ITEM_STYLE,ITEM_DESCRIPTION,ITEM_PRICE,ITEM_IMAGE,ITEM_FEATURED,ITEM_GENDER_AFFINITY\r\n36,http:\/\/dbq4nocqaarhp.cloudfront.net\/#\/product\/36,,Exercise Headphones,electronics,headphones,These stylishly red ear buds wrap securely around your ears making them perfect when exercising or on the go.,19.99,5.jpg,true,\r\n49,http:\/\/dbq4nocqaarhp.cloudfront.net\/#\/product\/49,,Light Brown Leather Lace-Up Boot,footwear,boot,Sturdy enough for the outdoors yet stylish to wear out on the town.,89.95,11.jpg,,<\/code><\/pre>\n<\/div>\n<p>The following code is a preview of the interactions dataset:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">ITEM_ID,USER_ID,EVENT_TYPE,TIMESTAMP\r\n2,2539,ProductViewed,1589580300\r\n29,5575,ProductViewed,1589580305\r\n4,1964,ProductViewed,1589580309\r\n46,5291,ProductViewed,1589580309<\/code><\/pre>\n<\/div>\n<p>This post is not intended to be a step-by-step guide, but rather describe the process of preparing a training dataset and highlight some of the transforms and data analysis capabilities using SageMaker Data Wrangler. You can download the <a href=\"https:\/\/s3.us-east-1.amazonaws.com\/aws-ml-blog\/artifacts\/Exploratory-data-analysis-feature-engineering-Amazon-SageMaker-Data-Wrangler\/Data-Wrangler-Blog-Example.flow\" target=\"_blank\" rel=\"noopener noreferrer\">.flow files<\/a> if you want to download, upload, and retrace the full example in your SageMaker Studio environment.<\/p>\n<p>At a high level, we perform the following steps:<strong>\u00a0<\/strong><\/p>\n<ol>\n<li>Connect to <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) and import the data.<\/li>\n<li>Transform the data, including type casting, dropping unneeded columns, imputing missing values, label encoding, one hot encoding, and custom transformations to extract elements from a JSON formatted column.<\/li>\n<li>Create table summaries and charts for data analysis. We use the quick model option to get a sense of which features are adding predictive power as we progress with our data preparation. We also use the built-in target leakage capability and get a report on any features that are at risk of leaking.<\/li>\n<li>Create a data flow, in which we combine and join the three tables to perform further aggregations and data analysis.<\/li>\n<li>Iterate by performing additional feature engineering or data analysis on the newly added data.<\/li>\n<li>Export our workflow to a SageMaker Data Wrangler job.<\/li>\n<\/ol>\n<h2>Prerequisites<\/h2>\n<p>Make sure you don\u2019t have any quota limits on the m5.4xlarge instance type part of your Studio application before creating a new data flow. For more information about prerequisites, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-getting-started.html\" target=\"_blank\" rel=\"noopener noreferrer\">Getting Started with Data Wrangler<\/a>.<\/p>\n<h2>Importing the data<\/h2>\n<p>We import our three CSV files from Amazon S3. SageMaker Data Wrangler supports CSV and Parquet files. It also allows you to sample the data in case the data is too large to fit in your studio application. The following screenshot shows a preview of the users dataset.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19756\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-3.jpg\" alt=\"A preview of the users dataset.\" width=\"800\" height=\"860\"><\/p>\n<p>After importing our CSV files, our datasets look like the following screenshot in SageMaker Data Wrangler.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19757 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-4.jpg\" alt=\"After importing our CSV files, our datasets look like the following screenshot in SageMaker Data Wrangler.\" width=\"800\" height=\"554\"><\/p>\n<p>We can now add some transforms and perform data analysis.<\/p>\n<h2>Transforming the data<\/h2>\n<p>For each table, we check the data types and make sure that it was inferred correctly.<\/p>\n<h3>Items table<\/h3>\n<p>To perform transforms on the items table, complete the following steps:<\/p>\n<ol>\n<li>On the SageMaker Data Wrangler UI, for the items table, choose <strong>+<\/strong>.<\/li>\n<li>Choose <strong>Edit data types<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19758 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-5.jpg\" alt=\"On the SageMaker Data Wrangler UI, for the items table, choose +.\" width=\"800\" height=\"474\"><\/p>\n<p>Most of the columns were inferred properly, except for one. The <code>ITEM_FEATURED<\/code> column is missing values and should really be casted as a Boolean.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19759\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-6.jpg\" alt=\"The ITEM_FEATURED column is missing values and should really be casted as a Boolean.\" width=\"800\" height=\"475\"><\/p>\n<p>For the items table, we perform the following transformations:<\/p>\n<ul>\n<li>Fill missing values with <code>false<\/code> for the <code>ITEM_FEATURED<\/code> column<\/li>\n<li>Drop unneeded columns such as <code>URL<\/code>, <code>SK<\/code>, <code>IMAGE<\/code>, <code>NAME<\/code>, <code>STYLE<\/code>, <code>ITEM_FEATURED<\/code> and <code>DESCRIPTION<\/code>\n<\/li>\n<li>Rename <code>ITEM_FEATURED_IMPUTED<\/code> to <code>ITEM_FEATURED<\/code>\n<\/li>\n<li>Cast the <code>ITEM_FEATURED<\/code> column as Boolean<\/li>\n<li>Encode the <code>ITEM_GENDER_AFFINITY<\/code> column<\/li>\n<\/ul>\n<ol start=\"3\">\n<li>To add a new transform, choose <strong>+<\/strong> and choose <strong>Add transform<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19760\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-7.jpg\" alt=\"To add a new transform, choose + and choose Add transform.\" width=\"800\" height=\"451\"><\/p>\n<ol start=\"4\">\n<li>Fill in missing values using the built-in <strong>Handling missing values\u00a0<\/strong>transform.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19761 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-8.jpg\" alt=\"Fill in missing values using the built-in Handling missing values\u00a0transform.\" width=\"800\" height=\"474\"><\/p>\n<ol start=\"5\">\n<li>To drop columns, under <strong>Manage columns<\/strong>, For <strong>Input column<\/strong>, choose <strong>ITEM_URL<\/strong>.\n<ol type=\"a\">\n<li>For <strong>Required column operator<\/strong>, choose <strong>Drop column<\/strong>.<img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19762\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-9.jpg\" alt=\"\" width=\"776\" height=\"402\">\n<\/li>\n<li>Repeat this step for <code>SK<\/code>, <code>IMAGE<\/code>, <code>NAME<\/code>, <code>STYLE<\/code>, <code>ITEM_FEATURED<\/code>, and <code>DESCRIPTION<\/code>\n<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<ol start=\"6\">\n<li>Under <strong>Type Conversion<\/strong>, for <strong>Column<\/strong>, choose <strong>ITEM_FEATURED<\/strong>.<\/li>\n<li>for <strong>To<\/strong>, choose <strong>Boolean<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19763\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-10.jpg\" alt=\"\" width=\"752\" height=\"540\"><\/p>\n<ol start=\"8\">\n<li>Under <strong>Encore categorical<\/strong>, add a one hot encoding transform to the <code>ITEM_GENDER_AFFINITY<\/code> column.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19764\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-11.jpg\" alt=\"\" width=\"800\" height=\"460\"><\/p>\n<ol start=\"9\">\n<li>Rename our column from <code>ITEM_FEATURED_IMPUTED to ITEM_FEATURED<\/code>.<\/li>\n<li>Run a table summary.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19765\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-12.jpg\" alt=\"Rename our column from ITEM_FEATURED_IMPUTED to ITEM_FEATURED.\" width=\"800\" height=\"208\"><\/p>\n<p>The table summary data analysis doesn\u2019t provide information on all the columns.<\/p>\n<ol start=\"11\">\n<li>Run the <code>df.info()<\/code> function as a custom transform.<\/li>\n<li>Choose <strong>Preview <\/strong>to verify that our <code>ITEM_FEATURED<\/code> column comes as a Boolean data type.<\/li>\n<\/ol>\n<p><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.info.html\" target=\"_blank\" rel=\"noopener noreferrer\">DataFrame.info()<\/a> prints information about the DataFrame including the data types, non-null values, and memory usage.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19766\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-13.jpg\" alt=\"\" width=\"512\" height=\"523\"><\/p>\n<ol start=\"13\">\n<li>Check that the <code>ITEM_FEATURED<\/code> column has been casted properly and doesn\u2019t have any null values.<\/li>\n<\/ol>\n<p>Let\u2019s move on to the users table and prepare our dataset for training.<\/p>\n<h3>Users table<\/h3>\n<p>For the users table, we perform the following steps:<\/p>\n<ol>\n<li>Drop unneeded columns such as <code>username<\/code>, <code>email<\/code>, <code>first_name<\/code>, and <code>last_name<\/code>.<\/li>\n<li>Extract elements from a JSON column such as zip code, state, and city.<\/li>\n<\/ol>\n<p>The <code>addresse<\/code> column containing a JSON string looks like the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">[{  \"first_name\": \"Nathan\",\r\n    \"last_name\": \"Smith\", \r\n    \"address1\": \"049 Isaac Stravenue Apt. 770\", \r\n    \"address2\": \"\", \r\n    \"country\": \"US\", \r\n    \"city\": \"Johnsonmouth\", \r\n    \"state\": \"NY\", \r\n    \"zipcode\": \"12758\", \r\n    \"default\": true\r\n    }]\r\n<\/code><\/pre>\n<\/div>\n<p>To extract relevant location elements for our model, we apply several transforms and save them in their respective columns. The following screenshot shows an example of extracting the user zip code.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19767\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-14.jpg\" alt=\"\" width=\"756\" height=\"350\"><\/p>\n<p>We apply the same transform to extract city and state, respectively.<\/p>\n<ol start=\"3\">\n<li>In the following transform, we split and rearrange the different personas (such as <code>electronics_beauty_outdoors<\/code>) and save it as <code>USER_SEGMENT<\/code>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19768\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-15.jpg\" alt=\"In the following transform, we split and rearrange the different personas (such as electronics_beauty_outdoors) and save it as USER_SEGMENT.\" width=\"480\" height=\"172\"><\/p>\n<ol start=\"4\">\n<li>We also perform a one hot encoding on the <code>USER_GENDER<\/code> column.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19769\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-16.jpg\" alt=\"We also perform a one hot encoding on the USER_GENDER column.\" width=\"493\" height=\"462\"><\/p>\n<h3>Interactions table<\/h3>\n<p>Finally, in the interactions table, we complete the following steps:<\/p>\n<ol>\n<li>Perform a custom transform to extract the event date and time from a timestamp.<\/li>\n<\/ol>\n<p>Custom transforms are quite powerful because they allow you to insert a snippet of code and run the transform using different runtime engines such as PySpark, Python, or SparkSQL. All you have to do is to start your transform with <code>df<\/code>, which denotes the DataFrame.<\/p>\n<p>The following code is an example using a custom PySpark transform to extract the date and time from the timestamp:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from pyspark.sql.functions import from_unixtime, to_date, date_format\r\ndf = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))\r\ndf = df.withColumn(  'EVENT_DATE', to_date('DATE_TIME')).withColumn(  'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))\r\n<\/code><\/pre>\n<\/div>\n<ol start=\"2\">\n<li>Perform a one hot encoding on the EVENT_TYPE<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19770\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-17.jpg\" alt=\"Perform a one hot encoding on the EVENT_TYPE column.\" width=\"363\" height=\"456\"><\/p>\n<ol start=\"3\">\n<li>Lastly, drop any columns we don\u2019t need.<\/li>\n<\/ol>\n<h2>Performing data analysis<\/h2>\n<p>In addition to common built-in data analysis such as scatterplots and histograms, SageMaker Data Wrangler gives you the ability to build custom visualizations using the <a href=\"https:\/\/altair-viz.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Altair library<\/a>.<\/p>\n<p>In the following histogram chart, we binned the user by age ranges on the x axis and the total percentage of users on the y axis.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19771\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-18.jpg\" alt=\"In the following histogram chart, we binned the user by age ranges on the x axis and the total percentage of users on the y axis.\" width=\"800\" height=\"330\"><\/p>\n<p>We can also use the quick model functionality to show feature importance. The F1 score indicating the model\u2019s predictive accuracy is also shown in the following visualization. This enables you to iterate by adding new datasets and performing additional features engineering to incrementally improve model accuracy.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19772\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-19.jpg\" alt=\"\" width=\"800\" height=\"402\"><\/p>\n<p>The following visualization is a box plot by age and state. This is particularly useful to understand the interquartile range and possible outliers.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19773\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-20.jpg\" alt=\"We can also use the quick model functionality to show feature importance.\" width=\"800\" height=\"270\"><\/p>\n<h2>Building a data flow<\/h2>\n<p>SageMaker Data Wrangler builds a data flow and keeps the dependencies of all the transforms, data analysis, and table joins. This allows you to keep a lineage of your exploratory data analysis but also allows you to reproduce past experiments consistently.<\/p>\n<p>In this section, we join our interactions and items tables.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19775\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-21.jpg\" alt=\"In this section, we join our interactions and items tables.\" width=\"800\" height=\"520\"><\/p>\n<ol>\n<li>Join our tables using the <code>ITEM_ID<\/code> key.<\/li>\n<li>Use a custom transform to aggregate our dataset by <code>USER_ID<\/code> and generate other features by pivoting the <code>ITEM_CATEGORY<\/code> and <code>EVENT_TYPE<\/code>:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import pyspark.sql.functions as F\r\ndf = df.groupBy([\"USER_ID\"]).pivot(\"ITEM_CATEGORY\")\r\n.agg(F.sum(\"EVENT_TYPE_PRODUCTVIEWED\").alias(\"EVENT_TYPE_PRODUCTVIEWED\"),\t\r\nF.sum(\"EVENT_TYPE_PRODUCTADDED\").alias(\"EVENT_TYPE_PRODUCTADDED\"),\t\r\nF.sum(\"EVENT_TYPE_CARTVIEWED\").alias(\"EVENT_TYPE_CARTVIEWED\"),\t\r\nF.sum(\"EVENT_TYPE_CHECKOUTSTARTED\").alias(\"EVENT_TYPE_CHECKOUTSTARTED\"),\t\r\nF.sum(\"EVENT_TYPE_ORDERCOMPLETED\").alias(\"EVENT_TYPE_ORDERCOMPLETED\"),\t\r\nF.sum(F.col(\"ITEM_PRICE\") * F.col(\"EVENT_TYPE_ORDERCOMPLETED\")).alias(\"TOTAL_REVENUE\"),\t\r\nF.avg(F.col(\"ITEM_FEATURED\").cast(\"integer\")).alias(\"FEATURED_ITEM_FRAC\"),\t\r\nF.avg(\"GENDER_AFFINITY_F\").alias(\"FEM_AFFINITY_FRAC\"),\t\r\nF.avg(\"GENDER_AFFINITY_M\").alias(\"MASC_AFFINITY_FRAC\")).fillna(0)\r\n<\/code><\/pre>\n<\/div>\n<ol start=\"3\">\n<li>Join our dataset with the <code>users<\/code> tables.<\/li>\n<\/ol>\n<p>The following screenshot shows what our DAG looks like after joining all the tables together.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19776\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-22.jpg\" alt=\"The following screenshot shows what our DAG looks like after joining all the tables together.\" width=\"800\" height=\"304\"><\/p>\n<ol start=\"4\">\n<li>Now that we have combined all three tables, run data analysis for target leakage.<\/li>\n<\/ol>\n<p>Target leakage or data leakage is one of the most common and difficult problems when building a model. Target leakages mean that you use features as part of training your model that aren\u2019t available upon inference time. For example, if you try to predict a car crash and one of the features is <code>airbag_deployed<\/code>, you don\u2019t know if the airbag has been deployed until the crash happened.<\/p>\n<p>The following screenshot shows that we don\u2019t have a strong target leakage candidate after running the data analysis.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19777\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-23.jpg\" alt=\"The following screenshot shows that we don\u2019t have a strong target leakage candidate after running the data analysis.\" width=\"800\" height=\"448\"><\/p>\n<ol start=\"5\">\n<li>Finally, we run a quick model on the joined dataset.<\/li>\n<\/ol>\n<p>The following screenshot shows that our F1 score is 0.89 after joining additional data and performing further feature transformations.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19778\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-24.jpg\" alt=\"The following screenshot shows that our F1 score is 0.89 after joining additional data and performing further feature transformations.\" width=\"800\" height=\"401\"><\/p>\n<h2>Exporting your data flow<\/h2>\n<p>SageMaker Data Wrangler gives you the ability to export your data flow into a Jupyter notebook with code pre-populated for the following options:<\/p>\n<ul>\n<li>SageMaker Data Wrangler job<\/li>\n<li>SageMaker Pipelines<\/li>\n<li>SageMaker Feature Store<\/li>\n<\/ul>\n<p>SageMaker Data Wrangler can also output a Python file.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19779\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/ML-1680-25.jpg\" alt=\"SageMaker Data Wrangler can also output a Python file.\" width=\"800\" height=\"465\"><\/p>\n<p>The SageMaker Data Wrangler job pre-populated in a Jupyter notebook ready to be run.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19780\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1680-26.jpg\" alt=\"\" width=\"800\" height=\"940\"><\/p>\n<h2>Conclusion<\/h2>\n<p>SageMaker Data Wrangler makes it easy to ingest data and perform data preparation tasks such as exploratory data analysis, feature selection, feature engineering, and more advanced data analysis such as feature importance, target leakage, and model explainability using an easy and intuitive user interface. SageMaker Data Wrangler makes the transition of converting your data flow into an operational artifact such as a SageMaker Data Wrangler job, SageMaker feature store, or SageMaker pipeline very easy with one click of a button.<\/p>\n<p>Log in into your Studio environment, download the<a href=\"https:\/\/s3.us-east-1.amazonaws.com\/aws-ml-blog\/artifacts\/Exploratory-data-analysis-feature-engineering-Amazon-SageMaker-Data-Wrangler\/Data-Wrangler-Blog-Example.flow\"> .flow file<\/a>, and try SageMaker Data Wrangler today.<\/p>\n<p>\u00a0<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-9250 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/07\/30\/phi-nguyen-100.gif\" alt=\"\" width=\"100\" height=\"132\"><strong>Phi Nguyen<\/strong>\u00a0is a solution architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son\u2019s soccer team or enjoying nature walk with his family.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-19791 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/Roberto-Bruno-Martins.jpg\" alt=\"\" width=\"100\" height=\"133\"><strong>Roberto Bruno Martins\u00a0<\/strong>is a Machine Learning Specialist Solution Architect, helping customers from several industries create, deploy and run machine learning solutions. He\u2019s been working with data since 1994, and has no plans to stop any time soon. In his spare time he plays games, practices martial arts and likes to try new food.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler\/<\/p>\n","protected":false},"author":0,"featured_media":685,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/684"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=684"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/684\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/685"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=684"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=684"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=684"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}