{"id":2040,"date":"2022-03-31T23:38:38","date_gmt":"2022-03-31T23:38:38","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/31\/prepare-data-from-databricks-for-machine-learning-using-amazon-sagemaker-data-wrangler\/"},"modified":"2022-03-31T23:38:38","modified_gmt":"2022-03-31T23:38:38","slug":"prepare-data-from-databricks-for-machine-learning-using-amazon-sagemaker-data-wrangler","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/31\/prepare-data-from-databricks-for-machine-learning-using-amazon-sagemaker-data-wrangler\/","title":{"rendered":"Prepare data from Databricks for machine learning using Amazon SageMaker Data Wrangler"},"content":{"rendered":"<div id=\"\">\n<p>Data science and data engineering teams spend a significant portion of their time in the data preparation phase of a machine learning (ML) lifecycle performing data selection, cleaning, and transformation steps. It\u2019s a necessary and important step of any ML workflow in order to generate meaningful insights and predictions, because bad or low-quality data greatly reduces the relevance of the insights derived.<\/p>\n<p>Data engineering teams are traditionally responsible for the ingestion, consolidation, and transformation of raw data for downstream consumption. Data scientists often need to do additional processing on data for domain-specific ML use cases such as natural language and time series. For example, certain ML algorithms may be sensitive to missing values, sparse features, or outliers and require special consideration. Even in cases where the dataset is in a good shape, data scientists may want to transform the feature distributions or create new features in order to maximize the insights obtained from the models. To achieve these objectives, data scientists have to rely on data engineering teams to accommodate requested changes, resulting in dependency and delay in the model development process. Alternatively, data science teams may choose to perform data preparation and feature engineering internally using various programming paradigms. However, it requires an investment of time and effort in installation and configuration of libraries and frameworks, which isn\u2019t ideal because that time can be better spent optimizing model performance.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a> simplifies the data preparation and feature engineering process, reducing the time it takes to aggregate and prepare data for ML from weeks to minutes by providing a single visual interface for data scientists to select, clean, and explore their datasets. Data Wrangler offers over 300 built-in data transformations to help normalize, transform, and combine features without writing any code. You can import data from multiple data sources, such as <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service (Amazon S3),<\/a> <a href=\"https:\/\/aws.amazon.com\/athena\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena<\/a>, <a href=\"https:\/\/aws.amazon.com\/redshift\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift<\/a>, and <a href=\"https:\/\/partners.amazonaws.com\/partners\/001E000000d8qQcIAI\/Snowflake%20Computing\" target=\"_blank\" rel=\"noopener noreferrer\">Snowflake<\/a>. You can now also use <a href=\"https:\/\/partners.amazonaws.com\/partners\/001E0000016WxP5IAK\/Databricks\" target=\"_blank\" rel=\"noopener noreferrer\">Databricks<\/a> as a data source in Data Wrangler to easily prepare data for ML.<\/p>\n<p>The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. With Databricks as a data source for Data Wrangler, you can now quickly and easily connect to Databricks, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, you can join your data in Databricks with data stored in Amazon S3, and data queried through Amazon Athena, Amazon Redshift, and Snowflake to create the right dataset for your ML use case.<\/p>\n<p>In this post, we transform the Lending Club Loan dataset using Amazon SageMaker Data Wrangler for use in ML model training.<\/p>\n<h2>Solution overview<\/h2>\n<p>The following diagram illustrates our solution architecture.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34577\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image001.png\" alt=\"\" width=\"1412\" height=\"722\"><\/a><\/p>\n<p>The Lending Club Loan dataset contains complete loan data for all loans issued through 2007\u20132011, including the current loan status and latest payment information. It has 39,717 rows, 22 feature columns, and 3 target labels.<\/p>\n<p>To transform our data using Data Wrangler, we complete the following high-level steps:<\/p>\n<ol>\n<li>Download and split the dataset.<\/li>\n<li>Create a Data Wrangler flow.<\/li>\n<li>Import data from Databricks to Data Wrangler.<\/li>\n<li>Import data from Amazon S3 to Data Wrangler.<\/li>\n<li>Join the data.<\/li>\n<li>Apply transformations.<\/li>\n<li>Export the dataset.<\/li>\n<\/ol>\n<h2><strong>Prerequisites<\/strong><\/h2>\n<p>The post assumes you have a running Databricks cluster. If your cluster is running on AWS, verify you have the following configured:<\/p>\n<h3><strong>Databricks setup<\/strong><\/h3>\n<p>Follow <a href=\"https:\/\/docs.databricks.com\/administration-guide\/cloud-configurations\/aws\/instance-profiles.html\" target=\"_blank\" rel=\"noopener noreferrer\">Secure access to S3 buckets using instance profiles<\/a> for the required <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) roles, S3 bucket policy, and Databricks cluster configuration. Ensure the Databricks cluster is configured with the proper <code>Instance Profile<\/code>, selected under the advanced options, to access to the desired S3 bucket.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image003.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34579\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image003.jpg\" alt=\"\" width=\"1080\" height=\"621\"><\/a><\/p>\n<p>After the Databricks cluster is up and running with required access to Amazon S3, you can fetch the <code>JDBC URL<\/code> from your Databricks cluster to be used by Data Wrangler to connect to it.<\/p>\n<h3><strong>Fetch the JDBC URL<\/strong><\/h3>\n<p>To fetch the JDBC URL, complete the following steps:<\/p>\n<ol>\n<li>In Databricks, navigate to the clusters UI.<\/li>\n<li>Choose your cluster.<\/li>\n<li>On the <strong>Configuration<\/strong> tab, choose <strong>Advanced options<\/strong>.<\/li>\n<li>Under <strong>Advanced options<\/strong>, choose the <strong>JDBC\/ODBC<\/strong> tab.<\/li>\n<li>Copy the JDBC URL.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image007.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34581\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image007.jpg\" alt=\"\" width=\"825\" height=\"751\"><\/a><\/li>\n<\/ol>\n<p>Make sure to substitute your personal access <a href=\"https:\/\/docs.databricks.com\/dev-tools\/api\/latest\/authentication.html\" target=\"_blank\" rel=\"noopener noreferrer\">token<\/a> in the URL.<\/p>\n<h3><strong>Data Wrangler setup<\/strong><\/h3>\n<p>This step assumes you have access to Amazon SageMaker, an instance of <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a>, and a Studio user.<\/p>\n<p>To allow access to the Databricks JDBC connection from Data Wrangler, the Studio user requires following permission:<\/p>\n<ul>\n<li><code>secretsmanager:PutResourcePolicy<\/code><\/li>\n<\/ul>\n<p>Follow below steps to update the IAM execution role assigned to the Studio user with above permission, as an IAM administrative user.<\/p>\n<ol>\n<li>On the IAM console, choose <strong>Roles<\/strong> in the navigation pane.<\/li>\n<li>Choose the role assigned to your Studio user.<\/li>\n<li>Choose <strong>Add permissions<\/strong>.<\/li>\n<li>Choose <strong>Create inline policy<\/strong>.<\/li>\n<li>For Service, choose <strong>Secrets Manager<\/strong>.<\/li>\n<li>On <strong>Actions<\/strong>, choose <strong>Access level<\/strong>.<\/li>\n<li>Choose <strong>Permissions management<\/strong>.<\/li>\n<li>Choose <strong>PutResourcePolicy<\/strong>.<\/li>\n<li>For <strong>Resources<\/strong>, choose <strong>Specific<\/strong> and select <strong>Any in this account<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34580\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image005.png\" alt=\"\" width=\"1087\" height=\"609\"><\/a><\/li>\n<\/ol>\n<h2>Download and split the dataset<\/h2>\n<p>You can start by <a href=\"https:\/\/www.kaggle.com\/datasets\/imsparsh\/lending-club-loan-dataset-2007-2011\" target=\"_blank\" rel=\"noopener noreferrer\">downloading the dataset<\/a>. For demonstration purposes, we split the dataset by copying the feature columns <code>id<\/code>, <code>emp_title<\/code>, <code>emp_length<\/code>, <code>home_owner<\/code>, and <code>annual_inc<\/code> to create a second <strong>loans_2.csv<\/strong> file. We remove the aforementioned columns from the original loans file except the <code>id<\/code> column and rename the original file to <strong>loans_1.csv<\/strong>. Upload the <strong>loans_1.csv<\/strong> file to <a href=\"https:\/\/partners.amazonaws.com\/partners\/001E0000016WxP5IAK\/Databricks\" target=\"_blank\" rel=\"noopener noreferrer\">Databricks<\/a> to create a table <code>loans_1<\/code> and <strong>loans_2.csv<\/strong> in an S3 bucket.<\/p>\n<h2><strong>Create a Data Wrangler flow<\/strong><\/h2>\n<p>For information on Data Wrangler pre-requisites, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-getting-started.html\" target=\"_blank\" rel=\"noopener noreferrer\">Get Started with Data Wrangler<\/a>.<\/p>\n<p>Let\u2019s get started by creating a new data flow.<\/p>\n<ol>\n<li>On the Studio console, on the <strong>File<\/strong> menu, choose <strong>New<\/strong>.<\/li>\n<li>Choose <strong>Data Wrangler flow<\/strong>.<\/li>\n<li>Rename the flow as desired.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image009.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34582\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image009.jpg\" alt=\"\" width=\"1596\" height=\"1230\"><\/a><\/li>\n<\/ol>\n<p>Alternatively, you can create a new data flow from the Launcher.<\/p>\n<ul>\n<li>On the Studio console, choose <strong>Amazon SageMaker Studio<\/strong> in the navigation pane.<\/li>\n<li>Choose <strong>New data flow<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image011.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34583\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image011.jpg\" alt=\"\" width=\"1287\" height=\"924\"><\/a><\/li>\n<\/ul>\n<p>Creating a new flow can take a few minutes to complete. After the flow has been created, you see the <strong>Import data<\/strong> page.<\/p>\n<h2><strong>Import data from Databricks into Data Wrangler<\/strong><\/h2>\n<p>Next, we set up Databricks (JDBC) as a data source in Data Wrangler. To import data from Databricks, we first need to add Databricks as a data source.<\/p>\n<ol>\n<li>On the <strong>Import data<\/strong> tab of your Data Wrangler flow, choose <strong>Add data source<\/strong>.<\/li>\n<li>On the drop-down menu, choose <strong>Databricks (JDBC)<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image013.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34584\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image013.jpg\" alt=\"\" width=\"1080\" height=\"407\"><\/a><\/li>\n<\/ol>\n<p>On the <strong>Import data from Databricks<\/strong> page, you enter your cluster details.<\/p>\n<ol start=\"3\">\n<li>For<strong> Dataset name<\/strong>, enter a name you want to use in the flow file.<\/li>\n<li>For<strong> Driver<\/strong>, choose the driver <code>com.simba.spark.jdbc.Driver<\/code>.<\/li>\n<li>For<strong> JDBC URL<\/strong>, enter the URL of your Databricks cluster obtained earlier.<\/li>\n<\/ol>\n<p>The URL should resemble the following format <code>jdbc:spark:\/\/<strong><span>&lt;serve- hostname&gt;<\/span><\/strong>:443\/default;transportMode=http;ssl=1;httpPath=<strong><span>&lt;http- path&gt;<\/span><\/strong>;AuthMech=3;UID=token;PWD=<strong><span>&lt;personal-access-token&gt;<\/span><\/strong><\/code>.<\/p>\n<ol start=\"4\">\n<li>In the SQL query editor, specify the following SQL SELECT statement:\n          <\/li>\n<\/ol>\n<p>If you chose a different table name while uploading data to Databricks, replace loans_1 in the above SQL query accordingly.<\/p>\n<p>In the <strong>SQL query<\/strong> section in Data Wrangler, you can query any table connected to the JDBC Databricks database. The pre-selected<strong> Enable sampling<\/strong> setting retrieves the first 50,000 rows of your dataset by default. Depending on the size of the dataset, unselecting<strong> Enable sampling<\/strong> may result in longer import time.<\/p>\n<ol start=\"5\">\n<li>Choose <strong>Run<\/strong>.<\/li>\n<\/ol>\n<p>Running the query gives a preview of your Databricks dataset directly in Data Wrangler.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image015.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34585\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image015.jpg\" alt=\"\" width=\"1621\" height=\"777\"><\/a><\/p>\n<ol start=\"6\">\n<li>Choose <strong>Import<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image017.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34586\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image017.jpg\" alt=\"\" width=\"1256\" height=\"787\"><\/a><\/li>\n<\/ol>\n<p>Data Wrangler provides the flexibility to set up multiple concurrent connections to the one Databricks cluster or multiple clusters if required, enabling analysis and preparation on combined datasets.<\/p>\n<h2><strong>Import the data from Amazon S3 into Data Wrangler<\/strong><\/h2>\n<p>Next, let\u2019s import the <code>loan_2.csv<\/code> file from Amazon S3.<\/p>\n<ol>\n<li>On the Import tab, choose <strong>Amazon S3 <\/strong>as the data source.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image019.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34587\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image019.jpg\" alt=\"\" width=\"1192\" height=\"310\"><\/a><\/li>\n<li>Navigate to the S3 bucket for the <code>loan_2.csv<\/code> file.<\/li>\n<\/ol>\n<p>When you select the CSV file, you can preview the data.<\/p>\n<ol start=\"3\">\n<li>In the <strong>Details<\/strong> pane, choose <strong>Advanced configuration<\/strong> to make sure <strong>Enable sampling<\/strong> is selected and <strong>COMMA <\/strong>is chosen for <strong>Delimiter<\/strong>.<\/li>\n<li>Choose <strong>Import<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image021.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34588\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image021.jpg\" alt=\"\" width=\"1233\" height=\"829\"><\/a><\/li>\n<\/ol>\n<p>After the <code>loans_2.csv<\/code> dataset is successfully imported, the data flow interface displays both the Databricks JDBC and Amazon S3 data sources.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image023.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34589\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image023.jpg\" alt=\"\" width=\"1075\" height=\"794\"><\/a><\/p>\n<h2><strong>Join the data <\/strong><\/h2>\n<p>Now that we have imported data from Databricks and Amazon S3, let\u2019s join the datasets using a common unique identifier column.<\/p>\n<ol>\n<li>On the <strong>Data flow<\/strong> tab, for <strong>Data types<\/strong>, choose the plus sign for <code>loans_1<\/code>.<\/li>\n<li>Choose <strong>Join<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image025.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34590\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image025.jpg\" alt=\"\" width=\"1004\" height=\"785\"><\/a><\/li>\n<li>Choose the <code>loans_2.csv<\/code> file as the <strong>Right<\/strong> dataset.<\/li>\n<li>Choose <strong>Configure <\/strong>to set up the join criteria.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image027.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34591\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image027.jpg\" alt=\"\" width=\"1635\" height=\"811\"><\/a><\/li>\n<li>For <strong>Name<\/strong>, enter a name for the join.<\/li>\n<li>For <strong>Join type<\/strong>, choose <strong>Inner<\/strong> for this post.<\/li>\n<li>Choose the <code>id<\/code> column to join on.<\/li>\n<li>Choose <strong>Apply <\/strong>to preview the joined dataset.<\/li>\n<li>Choose <strong>Add <\/strong>to add it to the data flow.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image029.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34592\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image029.jpg\" alt=\"\" width=\"1328\" height=\"825\"><\/a><\/li>\n<\/ol>\n<h2><strong>Apply transformations <\/strong><\/h2>\n<p>Data Wrangler comes with over 300 built-in transforms, which require no coding. Let\u2019s use built-in transforms to prepare the dataset.<\/p>\n<h3><strong>Drop column <\/strong><\/h3>\n<p>First we drop the redundant ID column.<\/p>\n<ol>\n<li>On the joined node, choose the plus sign.<\/li>\n<li>Choose <strong>Add transform<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image031.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34593\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image031.jpg\" alt=\"\" width=\"1325\" height=\"822\"><\/a><\/li>\n<li>Under <strong>Transforms,<\/strong> choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose <strong>Manage columns<\/strong>.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Drop column<\/strong>.<\/li>\n<li>For <strong>Columns to drop<\/strong>, choose the column <code>id_0<\/code>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image033.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34594\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image033.jpg\" alt=\"\" width=\"1132\" height=\"798\"><\/a><\/li>\n<li>Choose <strong>Add<\/strong>.<\/li>\n<\/ol>\n<h3><strong>Format string<\/strong><\/h3>\n<p>Let\u2019s apply string formatting to remove the percentage symbol from the <code>int_rate<\/code> and <code>revol_util<\/code> columns.<\/p>\n<ol>\n<li>On the <strong>Data<\/strong> tab, under <strong>Transforms<\/strong>, choose <strong>+ Add step<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image035.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34595\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image035.jpg\" alt=\"\" width=\"1135\" height=\"609\"><\/a><\/li>\n<li>Choose <strong>Format string<\/strong>.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Strip characters from right<\/strong>.<\/li>\n<\/ol>\n<p>Data Wrangler allows you to apply your chosen transformation on multiple columns simultaneously.<\/p>\n<ol start=\"4\">\n<li>For <strong>Input columns<\/strong>, choose <code>int_rate<\/code> and <code>revol_util<\/code>.<\/li>\n<li>For <strong>Characters to remove<\/strong>, enter <code>%<\/code>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image037.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34596\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image037.jpg\" alt=\"\" width=\"1266\" height=\"811\"><\/a><\/li>\n<li>Choose <strong>Add<\/strong>.<\/li>\n<\/ol>\n<h3><strong>Featurize text<\/strong><\/h3>\n<p>Let\u2019s now vectorize <code>verification_status<\/code>, a text feature column. We convert the text column into term frequency\u2013inverse document frequency (TF-IDF) vectors by applying the count vectorizer and a standard tokenizer as described below. Data Wrangler also provides the option to bring your own tokenizer, if desired.<\/p>\n<ol>\n<li>Under <strong>Transformers<\/strong>, choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose <strong>Featurize text<\/strong>.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Vectorize<\/strong>.<\/li>\n<li>For <strong>Input columns<\/strong>, choose <code>verification_status<\/code>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image039.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34597\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image039.jpg\" alt=\"\" width=\"1226\" height=\"713\"><\/a><\/li>\n<li>Choose <strong>Add<\/strong>.<\/li>\n<\/ol>\n<h2><strong>Export the dataset<\/strong><\/h2>\n<p>After we apply multiple transformations on different columns types, including text, categorical, and numeric, we\u2019re ready to use the transformed dataset for ML model training. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you have multiple options to choose from for downstream consumption of the transformations:<\/p>\n<p>In this post, we take advantage of the <strong>Export data<\/strong> option in the <strong>Transform <\/strong>view to export the transformed dataset directly to Amazon S3.<\/p>\n<ol>\n<li>Choose <strong>Export data<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image041.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34598\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image041.jpg\" alt=\"\" width=\"1000\" height=\"715\"><\/a><\/li>\n<li>For <strong>S3 location<\/strong>, choose <strong>Browse<\/strong> and choose your S3 bucket.<\/li>\n<li>Choose <strong>Export data<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image043.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34599\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/24\/ML-8541-image043.jpg\" alt=\"\" width=\"1257\" height=\"409\"><\/a><\/li>\n<\/ol>\n<h2><strong>Clean up<\/strong><\/h2>\n<p>If your work with Data Wrangler is complete, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-shut-down.html\" target=\"_blank\" rel=\"noopener noreferrer\">shut down your Data Wrangler instance<\/a> to avoid incurring additional fees.<\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p>In this post, we covered how you can quickly and easily set up and connect Databricks as a data source in Data Wrangler, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, we looked at how you can join your data in Databricks with data stored in Amazon S3. We then applied data transformations on the combined dataset to create a data preparation pipeline. To explore more Data Wrangler\u2019s analysis capabilities, including target leakage and bias report generation, refer to the following blog post <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/accelerate-data-preparation-using-amazon-sagemaker-data-wrangler-for-diabetic-patient-readmission-prediction\/\" target=\"_blank\" rel=\"noopener noreferrer\">Accelerate data preparation using Amazon SageMaker Data Wrangler for diabetic patient readmission prediction<\/a>.<\/p>\n<p>To get started with Data Wrangler, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler.html\" target=\"_blank\" rel=\"noopener noreferrer\">Prepare ML Data with Amazon SageMaker Data Wrangler<\/a>, and see the latest information on the Data Wrangler <a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">product page<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Roop-Bains.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-31932 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Roop-Bains.png\" alt=\"\" width=\"100\" height=\"132\"><\/a>Roop Bains <\/strong>is a Solutions Architect at AWS focusing on AI\/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/17\/ialek1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-34304 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/17\/ialek1.png\" alt=\"\" width=\"100\" height=\"140\"><\/a><strong>Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics.<\/strong> Igor works with strategic partners helping them build complex, AWS-optimized architectures. Prior joining AWS, as a Data\/Solution Architect, he implemented many projects in Big Data, including several data lakes in the Hadoop ecosystem. As a Data Engineer, he was involved in applying AI\/ML to fraud detection and office automation. Igor\u2019s projects were in a variety of industries including communications, finance, public safety, manufacturing, and healthcare. Earlier, Igor worked as full stack engineer\/tech lead.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17926 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Huong-Nguyen.jpg\" alt=\"\" width=\"100\" height=\"133\"><strong>Huong Nguyen<\/strong> is a Sr. Product Manager at AWS. She is leading the user experience for SageMaker Studio. She has 13 years\u2019 experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys reading, being in nature, and spending time with her family.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/31\/Henry-Wang.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-34726 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/31\/Henry-Wang.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Henry Wang <\/strong>is a software development engineer at AWS. He recently joined the Data Wrangler team after graduating from UC Davis. He has an interest in data science and machine learning and does 3D printing as a hobby.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/prepare-data-from-databricks-for-machine-learning-using-amazon-sagemaker-data-wrangler\/<\/p>\n","protected":false},"author":0,"featured_media":2041,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2040"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=2040"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2040\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/2041"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=2040"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=2040"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=2040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}