{"id":887,"date":"2021-09-18T20:01:00","date_gmt":"2021-09-18T20:01:00","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/18\/perform-interactive-data-engineering-and-data-science-workflows-from-amazon-sagemaker-studio-notebooks\/"},"modified":"2021-09-18T20:01:00","modified_gmt":"2021-09-18T20:01:00","slug":"perform-interactive-data-engineering-and-data-science-workflows-from-amazon-sagemaker-studio-notebooks","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/18\/perform-interactive-data-engineering-and-data-science-workflows-from-amazon-sagemaker-studio-notebooks\/","title":{"rendered":"Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a> is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up Studio notebooks to explore and prepare datasets to build, train, and deploy ML models in a single pane of glass.<\/p>\n<p>We\u2019re excited to announce a new set of capabilities that enable interactive Spark-based data processing from Studio notebooks. Data scientists and data engineers can now visually browse, discover, and connect to Spark data processing environments running on <a href=\"http:\/\/aws.amazon.com\/emr\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EMR<\/a>, right from your Studio notebooks in a few simple clicks. After you\u2019re connected, you can interactively query, explore and visualize data, and run Spark jobs to prepare data using the built-in SparkMagic notebook environments for Python and Scala.<\/p>\n<p>Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow, and businesses are increasingly using Apache Spark for fast data preparation. Studio already offers purpose-built and best-in-class tooling such as <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/experiments.html\" target=\"_blank\" rel=\"noopener noreferrer\">Experiments<\/a>, <a href=\"https:\/\/aws.amazon.com\/sagemaker\/clarify\/\" target=\"_blank\" rel=\"noopener noreferrer\">Clarify<\/a>, and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-monitor.html\" target=\"_blank\" rel=\"noopener noreferrer\">Model Monitor<\/a> for ML. The newly launched capability of easily accessing purpose-built Spark environments from Studio notebooks enables Studio to serve as a unified environment for data science and data engineering workflows. In this post, we present an example of predicting the sentiment of a movie review.<\/p>\n<p>We start with explaining how you can set up connecting Studio securely to an EMR cluster configured with various authentication methods. We provide CloudFormation templates to make it easy for you to deploy resources such as networking, EMR clusters, and Studio with a few simple clicks so that you can follow along with the examples in your own AWS account. We then demonstrate how you can use a Studio notebook to visually discover, authenticate with, and connect to an EMR cluster. After we\u2019re connected, we query a Hive table on Amazon EMR using SparkSQL and PyHive. We then locally preprocess and feature engineer the retrieved data, train an ML model, deploy it, and get predictions\u2014all from the Studio notebook.<\/p>\n<h2>Solution overview<\/h2>\n<p>Studio runs on an environment managed by AWS. In this solution, the network access for the new Studio domain is configured as VPC Only. For more details on different connectivity methods, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/securing-amazon-sagemaker-studio-connectivity-using-a-private-vpc\/\" target=\"_blank\" rel=\"noopener noreferrer\">Securing Amazon SageMaker Studio connectivity using a private VPC.<\/a> The elastic network interface created in the private subnet connects to required AWS services through VPC endpoints.<\/p>\n<p>The following diagram represents the different components used in this solution.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28258\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/1-5694-Architecture.jpg\" alt=\"\" width=\"800\" height=\"359\"><\/p>\n<p>For connecting to the EMR cluster, we walk through three authentication options. We use a separate <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template stack for each of these authentication scenarios.<\/p>\n<p>In each of the options, the CloudFormation template also does the following:<\/p>\n<ul>\n<li>Creates and populates a Hive table with a movie reviews dataset. We use this dataset to explore and query the data.<\/li>\n<li>Creates a Studio domain, along with a user named <code>studio-user<\/code>.<\/li>\n<li>Creates building blocks, including the VPC, subnet, EMR cluster, and other required resources to successfully run the examples.<\/li>\n<\/ul>\n<h3>Kerberos<\/h3>\n<p>In the Kerberos authentication mode CloudFormation template, we create a Kerberized EMR cluster and configure it with a bootstrap action to create a Linux user and install Python libraries (Pandas, requests, and <a href=\"https:\/\/matplotlib.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Matplotlib<\/a>).<\/p>\n<p>You can set up Kerberos authentication in a few different ways (for more information, see <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-kerberos-options.html\" target=\"_blank\" rel=\"noopener noreferrer\">Kerberos Architecture Options<\/a>):<\/p>\n<ul>\n<li>Cluster-dedicated key distribution center (KDC)<\/li>\n<li>Cluster-dedicated KDC with Active Directory cross-realm trust<\/li>\n<li>External KDC<\/li>\n<li>External KDC integrated with Active Directory<\/li>\n<\/ul>\n<p>The KDC can have its own user database or it can use <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-kerberos-cross-realm.html\" target=\"_blank\" rel=\"noopener noreferrer\">cross-realm trust<\/a> with an Active Directory that holds the identity store. For this post, we use a cluster-dedicated KDC that holds its own user database. First, the EMR cluster has <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-security-configurations.html\" target=\"_blank\" rel=\"noopener noreferrer\">security configuration<\/a> enabled to support Kerberos and is launched with a <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Secure-Data-Analytics-with-SageMaker-Notebook-Instance-and-Kerberized-EMR-Cluster\/createlinuxusers.sh\" target=\"_blank\" rel=\"noopener noreferrer\">bootstrap action<\/a> to create Linux users on all nodes and install the necessary libraries. The CloudFormation template launches the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Secure-Data-Analytics-with-SageMaker-Notebook-Instance-and-Kerberized-EMR-Cluster\/configurekdc.sh\" target=\"_blank\" rel=\"noopener noreferrer\">bash step<\/a> after the cluster is ready. This step creates HDFS directories for the Linux users with default credentials.<\/p>\n<h3>LDAP<\/h3>\n<p>In the LDAP authentication mode CloudFormation template, we provision an <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Compute Cloud<\/a> (Amazon EC2) instance with an LDAP server and configure the EMR cluster to use this server for authentication.<\/p>\n<h3>No-Auth<\/h3>\n<p>In the No-Auth authentication mode CloudFormation template, we use a standard EMR cluster with no authentication enabled.<\/p>\n<h2>Deploy the resources with AWS CloudFormation<\/h2>\n<p>Complete the following steps to deploy the environment:<\/p>\n<ol>\n<li>Sign in to the <a href=\"http:\/\/aws.amazon.com\/console\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a> as an <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) user, preferably an admin user.<\/li>\n<li>Choose <strong>Launch Stack<\/strong> to launch the CloudFormation template for the appropriate authentication scenario. Make sure the Region used to deploy the CloudFormation stack has no existing Studio domain. If you already have a Studio domain in a Region, you may choose a different Region.<\/li>\n<\/ol>\n<ol start=\"3\">\n<li>Choose <strong>Next<\/strong>.<\/li>\n<li>For <strong>Stack name<\/strong>, enter a name for the stack (for example, <code>blog<\/code>).<\/li>\n<li>Leave the other values as default.<\/li>\n<li>Continue to choose <strong>Next<\/strong>. If you are using the Kerberos stack, in the \u201cParameters\u201d section, enter the <code>CrossRealmTrustPrincipalPassword<\/code> and <code>KdcAdminPassword<\/code>. You can enter the example password provided in both the fields: <code>CfnIntegrationTest-1<\/code>.<\/li>\n<li>On the review page, select the check box to confirm that AWS CloudFormation might create resources.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28259\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/2-5694.jpg\" alt=\"\" width=\"800\" height=\"307\"><\/p>\n<ol start=\"8\">\n<li>Choose <strong>Create stack<\/strong>.<\/li>\n<\/ol>\n<p>Wait until the status of the stack changes from <code>CREATE_IN_PROGRESS<\/code> to <code>CREATE_COMPLETE<\/code>. The process usually takes 10\u201315 minutes.<\/p>\n<p>Note: If you would like to try multiple stacks, please follow the steps in the \u201cClean up\u201d section. Remember that you must <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/gs-studio-delete-domain.html#gs-studio-delete-domain-studio\">delete the SageMaker Studio Domain<\/a> before the next stack can be successfully launched.<\/p>\n<h2>Connect a Studio notebook to an EMR cluster<\/h2>\n<p>After we deploy the stack, we can create a connection between our Studio notebook and the EMR cluster. Establishing this connection allows us to connect code to our data hosted on Amazon EMR.<\/p>\n<p>Complete the following steps to set up and connect your notebook to the EMR cluster:<\/p>\n<ol>\n<li>On the SageMaker console, choose <strong>Amazon SageMaker Studio<\/strong>.<\/li>\n<li>Choose <strong>Open Studio<\/strong> for <code>studio-user<\/code>\u00a0to open the Studio IDE.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28260\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/3-5694-control-panel.jpg\" alt=\"\" width=\"800\" height=\"483\"><\/p>\n<p>Next, we download the code for this walkthrough from <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3).<\/p>\n<ol start=\"3\">\n<li>Choose <strong>File<\/strong> in the Studio IDE, then choose <strong>New<\/strong> and <strong>Terminal<\/strong>.<\/li>\n<li>Run the following commands in the terminal:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp s3:\/\/aws-ml-blog\/artifacts\/sma-milestone1\/smstudio-pyspark-hive-sentiment-analysis.ipynb .\n\naws s3 cp s3:\/\/aws-ml-blog\/artifacts\/sma-milestone1\/smstudio-ds-pyhive-sentiment-analysis.ipynb .\n\naws s3 cp s3:\/\/aws-ml-blog\/artifacts\/sma-milestone1\/preprocessing.py .\n<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"5\">\n<li>For <strong>Select Kernel<\/strong>, choose either the PySpark (SparkMagic) or Python 3 (Data Science) kernels depending upon the examples you want to run.<\/li>\n<\/ol>\n<p>The <code>smstudio-pyspark-hive-sentiment-analysis.ipynb<\/code> notebook demonstrates examples that you can run using the PySpark (SparkMagic) kernel. The <code>smstudio-ds-pyhive-sentiment-analysis.ipynb<\/code> notebook demonstrates examples that you can run using the IPython-based kernel.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28261\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/4-5694-select-kernel.jpg\" alt=\"\" width=\"380\" height=\"164\"><\/p>\n<ol start=\"6\">\n<li>Choose the <strong>Cluster<\/strong> menu on the top of the notebook.<\/li>\n<li>For <strong>Connect to cluster<\/strong>, choose a cluster to connect to and choose <strong>Connect<\/strong>.<\/li>\n<\/ol>\n<p>This adds a code block to the active cell and runs automatically to establish connection.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28262\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/5-5694-connect.jpg\" alt=\"\" width=\"800\" height=\"510\"><\/p>\n<p>We connect to and run Spark code on a remote EMR cluster through <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-livy.html\" target=\"_blank\" rel=\"noopener noreferrer\">Livy<\/a>, an open-source REST server for Spark. Depending on the authentication method required by Livy on the chosen EMR cluster, appropriate code is injected into a new cell and is run to connect to the cluster. You can use this code to establish a connection to the EMR cluster if you\u2019re using this notebook at a later time. Examples of the types of commands injected include the following:<\/p>\n<ul>\n<li>Kerberos-based authentication to Livy.<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28263\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/7-5694-Livy.jpg\" alt=\"\" width=\"800\" height=\"40\"><\/p>\n<ul>\n<li>LDAP-based authentication to Livy.<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28264\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/8-5694-LDAP.jpg\" alt=\"\" width=\"800\" height=\"52\"><\/p>\n<ul>\n<li>No-Auth authentication to Livy. For No-Auth authentication, the following dialog asks you to select the credential type.<\/li>\n<\/ul>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28265\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/9-5694-select-credential.jpg\" alt=\"\" width=\"550\" height=\"187\"><\/p>\n<p>Selecting <strong>HTTP basic authentication<\/strong> injects the following code into a new cell on the Studio notebook:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28266\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/10-5694-http.jpg\" alt=\"\" width=\"800\" height=\"42\"><\/p>\n<p>Selecting <strong>No credential<\/strong> injects the following code into a new cell on the Studio notebook:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28267\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/11-5694-no-credential.jpg\" alt=\"\" width=\"800\" height=\"39\"><\/p>\n<p>This code runs automatically. You\u2019re prompted to enter a user name and password for the EMR cluster authentication if authentication is required. After you\u2019re authenticated, a Spark application is started.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28268\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/12-5694-successfully-read.jpg\" alt=\"\" width=\"800\" height=\"198\"><\/p>\n<p>You can also change the EMR cluster that the Studio notebook is connected to by using the method described. Simply browse to find the cluster you want to switch to and connect to it. The Studio notebook can only be connected to one EMR cluster at a time.<\/p>\n<p>If you\u2019re using the PySpark kernel, you can use the PySpark magic <code>%%info<\/code> to display the current session information.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28269\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/13-5694-info.jpg\" alt=\"\" width=\"800\" height=\"207\"><\/p>\n<h3>Monitoring and debugging<\/h3>\n<p>If you want to set up SSH tunneling to access the Spark UI, complete the following steps. The link under <strong>Spark UI and Driver log<\/strong> isn\u2019t enabled unless the steps for SSH tunneling for Spark UI is followed.<\/p>\n<ul>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-ssh-tunnel-local.html\" target=\"_blank\" rel=\"noopener noreferrer\">Option 1<\/a> \u2013 Set up an SSH tunnel to the primary node using local port forwarding<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-ssh-tunnel.html\" target=\"_blank\" rel=\"noopener noreferrer\">Option 2, part 1<\/a> \u2013 Set up an SSH tunnel to the primary node using dynamic port forwarding<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-connect-master-node-proxy.html\" target=\"_blank\" rel=\"noopener noreferrer\">Option 2, part 2<\/a> \u2013 Configure proxy settings to view websites hosted on the primary node<\/li>\n<\/ul>\n<p>For information on how to view web interfaces on EMR clusters, see <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ManagementGuide\/emr-web-interfaces.html\" target=\"_blank\" rel=\"noopener noreferrer\">View web interfaces hosted on Amazon EMR clusters<\/a>.<\/p>\n<h2>Explore and query the data<\/h2>\n<p>In this section, we present examples of how to explore and query the data using either the PySpark (SparkMagic) kernel or Python3 (Data Science) kernel.<\/p>\n<h3>Query data from the PySpark (SparkMagic) kernel<\/h3>\n<p>In this example, we use the PySpark kernel to connect to a Kerberos-protected EMR cluster and query data from a Hive table and use that for ML training.<\/p>\n<ol>\n<li>Open the <code>smstudio-pyspark-hive-sentiment-analysis.ipynb<\/code> notebook and choose the PySpark (SparkMagic) kernel.<\/li>\n<li>Choose the <strong>Cluster<\/strong> menu on the top of the notebook.<\/li>\n<li>For <strong>Connect to cluster<\/strong>, choose <strong>Connect<\/strong>.<\/li>\n<\/ol>\n<p>This adds a code block to the active cell and runs automatically to establish connection.<\/p>\n<p>When using the PySpark kernel, an automatic <code>SparkContext<\/code> and <code>HiveContext<\/code> are created automatically after connecting to an EMR cluster. You can use <code>HiveContext<\/code> to query data in the Hive table and make it available in a Spark DataFrame.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28270\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/14-5694-sqlContext.jpg\" alt=\"\" width=\"700\" height=\"255\"><\/p>\n<ol start=\"4\">\n<li>Next, we query the <code>movie_reviews<\/code> table and get the data in a Spark DataFrame.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28271\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/15-5694-movie_reviews.jpg\" alt=\"\" width=\"800\" height=\"46\"><\/p>\n<p>We can use the DataFrame to look at the shape of the dataset and size of each class (positive and negative). The following screenshots show that we have a balanced dataset.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28272\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/16-5694.jpg\" alt=\"\" width=\"800\" height=\"335\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28273\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/17-5694.jpg\" alt=\"\" width=\"800\" height=\"481\"><\/p>\n<p>You can visualize the shape and size of the dataset using Matplotlib.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28274\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/18-5694-Chart.jpg\" alt=\"\" width=\"800\" height=\"806\"><\/p>\n<p>You can use the <code>pyspark.sql.functions<\/code> module as shown in the following screenshot to inspect the length of the reviews.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28275\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/19-5694-from-pyspark.jpg\" alt=\"\" width=\"800\" height=\"573\"><\/p>\n<p>You can use SparkSQL queries using <code>%%sql<\/code> from the notebook and save results to a local DataFrame. This allows for a quick data exploration. The maximum rows returned by default is 2,500. You can set the max rows by using the<code> -n<\/code> argument.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28276\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/20-5694-sql.jpg\" alt=\"\" width=\"800\" height=\"281\"><\/p>\n<p>As we continue through the notebook, we query the movie reviews table in Hive, storing the results into a DataFrame. The SparkMagic environment allows you to send local data to the remote cluster using <code>%%send_to_spark<\/code><strong>. <\/strong>We send the Amazon S3 location (bucket and key) variables to the remote cluster, then convert the Spark DataFrame to a Pandas DataFrame. Next, we upload it to Amazon S3 and use this data as input to the preprocessing step that creates training and validation data. This data trains a sentiment analysis model using the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/blazingtext.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker BlazingText algorithm<\/a>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28278\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/21-5694-send_to_spark.jpg\" alt=\"\" width=\"800\" height=\"235\"><\/p>\n<h3>Query data using the PyHive library from the Python3 (Data Science) kernel<\/h3>\n<p>In this example, we use the Python 3 (Data Science) kernel. We use the PyHive library to connect to the Hive table. We then query data from a Hive table and use that for ML training.<\/p>\n<p>Note: Please use <code>LDAP<\/code> or <code>No Auth<\/code> authentication mechanisms to connect to EMR before running the following sample code.<\/p>\n<ol>\n<li>Open the <code>smstudio-ds-pyhive-sentiment-analysis.ipynb<\/code> notebook and choose the Python 3 (Data Science) kernel.<\/li>\n<li>Choose the <strong>Cluster<\/strong> menu on the top of the notebook.<\/li>\n<li>For <strong>Connect to cluster<\/strong>, choose a cluster to connect to and choose <strong>Connect<\/strong>.<\/li>\n<\/ol>\n<p>This adds a code block to the active cell and runs automatically to establish connection.<\/p>\n<p>We run each cell in the notebook to walkthrough the PyHive example.<\/p>\n<ol start=\"4\">\n<li>First, we import the <code>hive<\/code> module from the PyHive library.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28277\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/21-5694-from-pyhive.jpg\" alt=\"\" width=\"800\" height=\"28\"><\/p>\n<ol start=\"5\">\n<li>You can connect to the Hive table using the following code.<\/li>\n<\/ol>\n<p>We use the private DNS name of the EMR primary in the following code. Replace the host with the correct DNS name. You can find this in the output of the CloudFormation stack under the key <code>EMRMasterDNSName<\/code>. You can also find this information on the Amazon EMR console (expand the cluster name and locate <strong>Master public DNS<\/strong> under in summary section).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28279\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/22-5694.jpg\" alt=\"\" width=\"800\" height=\"36\"><\/p>\n<ol start=\"6\">\n<li>You can retrieve the data from the Hive table using the following code.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28280\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/23-5694.jpg\" alt=\"\" width=\"800\" height=\"48\"><\/p>\n<ol start=\"7\">\n<li>Continue running the cells in the notebook to upload the data to Amazon S3, preprocess the data for ML, train a SageMaker model, and deploy the model for prediction, as described later in this post.<\/li>\n<\/ol>\n<h2>Preprocess data and feature engineering<\/h2>\n<p>We perform data preprocessing and feature engineering on the data using SageMaker Processing. With Processing, you can use a simplified, managed experience to run data preprocessing or postprocessing and model evaluation workloads on the SageMaker platform. A processing job downloads input from Amazon S3, then uploads output to Amazon S3 during or after the processing job. The <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/ml-1954\/preprocessing.py\" target=\"_blank\" rel=\"noopener noreferrer\">preprocessing.py<\/a> script does the required text preprocessing with the movie reviews dataset and splits the dataset into training data and validation data for the model training.<\/p>\n<p>The notebook utilizes the scikit-learn processor within a Docker image to perform the processing job.<\/p>\n<p>We use the SageMaker instance type ml.m5.xlarge for processing, training, and model hosting. If you don\u2019t have access to this instance type and see a <code>ResourceLimitExceeded<\/code> error, use another instance type that you have access to. You can also request a service limit increase via the <a href=\"https:\/\/console.aws.amazon.com\/support\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Support Center<\/a>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28281\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/24-5694-local.jpg\" alt=\"\" width=\"800\" height=\"645\"><\/p>\n<h2>Train a SageMaker model<\/h2>\n<p>SageMaker Experiments allows us to organize, track, and review ML experiments with Studio notebooks. We can log metrics and information as we progress through the training process and evaluate results as we run the models. We create a SageMaker experiment and trial, a SageMaker estimator, and set the hyperparameters. We then kick off a training job by calling the <code>fit<\/code> method on the estimator. We use Spot Instances to reduce the training cost.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28282\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/25-5694.jpg\" alt=\"\" width=\"800\" height=\"459\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28283\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/26-5694.jpg\" alt=\"\" width=\"800\" height=\"177\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28284\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/27-5694.jpg\" alt=\"\" width=\"800\" height=\"160\"><\/p>\n<h2>Deploy the model and get predictions<\/h2>\n<p>When the training is complete, we host the model for real-time inference. We use the <code>deploy<\/code> method of the SageMaker estimator to easily deploy the model and create an endpoint.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28285\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/28-5694.jpg\" alt=\"\" width=\"800\" height=\"43\"><\/p>\n<p>After the model is deployed, we test the deployed endpoint with test data and get predictions.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28286\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/29-5694.jpg\" alt=\"\" width=\"800\" height=\"357\"><\/p>\n<p>Finally, we clean up the resources such as the SageMaker endpoint and the S3 bucket created in the notebook.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28287\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/16\/30-5694.jpg\" alt=\"\" width=\"800\" height=\"117\"><\/p>\n<h2>Bring your own image<\/h2>\n<p>If you want to bring your own image for the Studio kernels to perform the tasks we described, you need to install the following dependencies to your kernel. The following code lists the pip command along with the library name:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">pip install sparkmagic\npip install sagemaker-studio-sparkmagic-lib\npip install sagemaker-studio-analytics-extension<\/code><\/pre>\n<\/p><\/div>\n<p>If you want to connect to a Kerberos-protected EMR cluster, you also need to install the kinit client. Depending on your OS, the command to install the kinit client varies. The following is the command for an Ubuntu or Debian based image:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">apt-get install -y -qq krb5-user<\/code><\/pre>\n<\/p><\/div>\n<h2>Clean up<\/h2>\n<p>You can complete the following steps to clean up resources deployed for this solution. This also deletes the S3 bucket, so you should copy the contents in the bucket to a backup location if you want to retain the data for later use.<\/p>\n<ol>\n<li>Delete Amazon SageMaker Studio Apps<\/li>\n<\/ol>\n<p>Navigate to Amazon SageMaker Studio Console. Click on your username (<code>studio-user<\/code>) then delete all the apps listed under \u201cApps\u201d by clicking the \u201cDelete app\u201d button. Wait until the Status shows as \u201ccompleted.<\/p>\n<ol start=\"2\">\n<li>Delete EFS volume<\/li>\n<\/ol>\n<p>Navigate to Amazon EFS. Locate the filesystem that was created by SageMaker (you can confirm this by clicking on the File System Id and confirming the tag \u201c<code>ManagedByAmazonSageMakerResource<\/code>\u201d on the Tags tab)<\/p>\n<ol start=\"3\">\n<li>Finally, delete the CloudFormation Template by navigating to the CloudFormation console\n<ol type=\"a\">\n<li>On the CloudFormation console, choose\u00a0Stacks.<\/li>\n<li>Select the stack deployed for this solution.<\/li>\n<li>Choose\u00a0Delete.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>In conclusion, we walked through how you can visually browse, discover, and connect to Spark data processing environments running on Amazon EMR, right from Studio notebooks in a few simple clicks. We demonstrated connecting to EMR clusters using various authentication mechanisms\u2014Kerberos, LDAP, and No-Auth. We then explored and queried a sample dataset from a Hive table on Amazon EMR using SparkSQL and PyHive. We locally preprocessed and feature engineered the retrieved data, trained an ML model to predict the sentiment of a movie review, and deployed it and to get predictions\u2014all from the Studio notebook. Through this example, we demonstrated how to unify data preparation and ML workflows on Studio notebooks.<\/p>\n<p>For more information, see the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/whatis.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Developer Guide<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-9352\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/08\/06\/praveen-100.jpg\" alt=\"\" width=\"100\" height=\"132\"><strong>Praveen Veerath<\/strong> is a Machine Learning Specialist Solution Architect for AWS. He leads multiple AI\/ML and cloud native architecture engagements with AWS strategic customers in designing Machine Learning infrastructure at scale.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16063 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/17\/Sriharsha-M-S.jpg\" alt=\"\" width=\"103\" height=\"103\"><strong>Sriharsha M S<\/strong> is an AI\/ML specialist solutions architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI\/ML to solve complex business problems. He provides technical guidance and design advice to implement AI\/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-28295 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/17\/Sumedha-Swamy.jpg\" alt=\"\" width=\"100\" height=\"133\">Sumedha Swamy <\/strong>is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-28294 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/17\/Edward-Sun.jpg\" alt=\"\" width=\"100\" height=\"137\"><\/strong><strong>Edward Sun <\/strong>is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solution and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML ecosystem. In his spare time, Edward is big fan of camping, hiking and fishing and enjoys the time spending with his family.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/10\/Rama-Thamman.jpg\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-18205 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/10\/Rama-Thamman.jpg\" alt=\"\" width=\"100\" height=\"127\"><\/a>Rama Thamman<\/strong> is a Software Development Manager with the AI Platforms team, leading the ML Migrations team.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/perform-interactive-data-engineering-and-data-science-workflows-from-amazon-sagemaker-studio-notebooks\/<\/p>\n","protected":false},"author":0,"featured_media":888,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/887"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=887"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/887\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/888"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}