{"id":45,"date":"2020-08-17T07:53:41","date_gmt":"2020-08-17T07:53:41","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/08\/17\/accessing-data-sources-from-amazon-sagemaker-r-kernels\/"},"modified":"2020-08-17T07:53:41","modified_gmt":"2020-08-17T07:53:41","slug":"accessing-data-sources-from-amazon-sagemaker-r-kernels","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/08\/17\/accessing-data-sources-from-amazon-sagemaker-r-kernels\/","title":{"rendered":"Accessing data sources from Amazon SageMaker R kernels"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> notebooks <a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2019\/08\/amazon-sagemaker-notebooks-available-with-pre-installed-r-kernel\/\" target=\"_blank\" rel=\"noopener noreferrer\">now support R out-of-the-box<\/a>, without needing you to manually install <a href=\"https:\/\/github.com\/IRkernel\/IRkernel\" target=\"_blank\" rel=\"noopener noreferrer\">R kernels<\/a> on the instances. Also, the notebooks come pre-installed with the <a href=\"https:\/\/rstudio.github.io\/reticulate\/\" target=\"_blank\" rel=\"noopener noreferrer\">reticulate<\/a> library, which offers an R interface for the <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/latest\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Python SDK<\/a> and enables you to invoke Python modules from within an R script. You can easily run machine learning (ML) models in R using the Amazon SageMaker R kernel to access the data from multiple data sources. The R kernel is available by default in all Regions that Amazon SageMaker is available in.<\/p>\n<p><a href=\"https:\/\/www.r-project.org\/about.html\" target=\"_blank\" rel=\"noopener noreferrer\">R<\/a> is a programming language built for statistical analysis and is very popular in data science communities. In this post, we will show you how to connect to the following data sources from the Amazon SageMaker R kernel using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Java_Database_Connectivity\" target=\"_blank\" rel=\"noopener noreferrer\">Java Database Connectivity (JDBC)<\/a>:<\/p>\n<p>For more information about using Amazon SageMaker features using R, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/r-guide.html\" target=\"_blank\" rel=\"noopener noreferrer\">R User Guide to Amazon SageMaker<\/a>.<\/p>\n<h2>Solution overview<\/h2>\n<p>To build this solution, we first need to <a href=\"https:\/\/docs.aws.amazon.com\/batch\/latest\/userguide\/create-public-private-vpc.html\" target=\"_blank\" rel=\"noopener noreferrer\">create a VPC with public and private subnets<\/a>. This will allow us to securely communicate with different resources and data sources inside an isolated network. Next, we create the data sources in the custom VPC and the notebook instance with all necessary configuration and access to connect various data sources using R.<\/p>\n<p>To make sure that the data sources are not reachable from the Internet, we create them inside a private subnet of the VPC. For this post, we create the following:<\/p>\n<p>Connect to the Amazon EMR cluster inside the private subnet using <a href=\"https:\/\/docs.aws.amazon.com\/systems-manager\/latest\/userguide\/session-manager.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Systems Manager Session Manager<\/a> to create Hive tables.<\/p>\n<p>To run the code using the R kernel in Amazon SageMaker, create an Amazon SageMaker notebook. Download the JDBC drivers for the data sources. <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/notebook-lifecycle-config.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a lifecycle configuration for the notebook<\/a> containing the setup script for R packages, and attach the lifecycle configuration to the notebook on create and on start to make sure the setup is complete.<\/p>\n<p>Finally, we can use the <a href=\"http:\/\/aws.amazon.com\/console\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a> to navigate to the notebook to run code using the R kernel and access the data from various sources. The entire solution is also available in the <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-r-kernel-access-data-sources\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a>.<\/p>\n<h2>Solution architecture<\/h2>\n<p>The following architecture diagram shows how you can use Amazon SageMaker to run code using the R kernel by establishing connectivity to various sources. You can also use the Amazon Redshift query editor or Amazon Athena query editor to create data resources. You need to use the Session Manager in <a href=\"https:\/\/aws.amazon.com\/systems-manager\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Systems Manager<\/a> to SSH to the Amazon EMR cluster to create Hive resources.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14599\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-001.jpg\" alt=\"\" width=\"800\" height=\"501\"><\/p>\n<h2>Launching the AWS CloudFormation template<\/h2>\n<p>To automate resource creation, you run an <a href=\"https:\/\/aws.amazon.com\/cloudformation\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template. The template gives you the option to create an Amazon EMR cluster, Amazon Redshift cluster, or Amazon Aurora MySQL-compatible cluster automatically, as opposed to executing each step manually. It will take a few minutes to create all the resources.<\/p>\n<ol>\n<li>Choose the following link to launch the CloudFormation stack, which creates the required AWS resources to implement this solution:<br \/><a href=\"https:\/\/console.aws.amazon.com\/cloudformation\/home#\/stacks\/create\/template?stackName=Blog-SageMaker-R&amp;templateURL=https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/CloudFormation\/cft.yaml\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-47 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2017\/02\/10\/launchstack.png\" alt=\"\" width=\"107\" height=\"20\"><\/a>\n<\/li>\n<li>On the <strong>Create stack<\/strong> page, choose <strong>Next<\/strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14600\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-005.jpg\" alt=\"\" width=\"800\" height=\"427\">\n<\/li>\n<li>Enter a stack name.<\/li>\n<li>You can change the default values for the following stack details:<\/li>\n<\/ol>\n<table border=\"1px\" cellpadding=\"3px\">\n<tbody readability=\"5\">\n<tr>\n<td width=\"159\"><strong><em>Stack Details<\/em><\/strong><\/td>\n<td width=\"284\"><strong><em>Default Values<\/em><\/strong><\/td>\n<\/tr>\n<tr readability=\"2\">\n<td width=\"159\">Choose Second Octet for Class B VPC Address (10.xxx.0.0\/16)<\/td>\n<td width=\"284\">0<\/td>\n<\/tr>\n<tr readability=\"2\">\n<td width=\"159\">SageMaker Jupyter Notebook Instance Type<\/td>\n<td width=\"284\">ml.t2.medium<\/td>\n<\/tr>\n<tr readability=\"2\">\n<td width=\"159\">Create EMR Cluster Automatically?<\/td>\n<td width=\"284\">\u201cYes\u201d<\/td>\n<\/tr>\n<tr readability=\"2\">\n<td width=\"159\">Create Redshift Cluster Automatically?<\/td>\n<td width=\"284\">\u201cYes\u201d<\/td>\n<\/tr>\n<tr readability=\"2\">\n<td width=\"159\">Create Aurora MySQL DB Cluster Automatically?<\/td>\n<td width=\"284\">\u201cYes\u201d<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<ol start=\"5\">\n<li>Choose <strong>Next<\/strong>.<\/li>\n<li>On the <strong>Configure stack options<\/strong> page, choose <strong>Next<\/strong>.<\/li>\n<li>Select <strong>I acknowledge that AWS CloudFormation might create IAM resources<\/strong>.<\/li>\n<li>Choose <strong>Create stack<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14601\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-007.jpg\" alt=\"\" width=\"800\" height=\"208\"><\/p>\n<p>You can now see the stack being created, as in the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14602\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-009.jpg\" alt=\"\" width=\"800\" height=\"295\"><\/p>\n<p>When stack creation is complete, the status shows as <code>CREATE_COMPLETE<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14603\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-011.jpg\" alt=\"\" width=\"800\" height=\"363\"><\/p>\n<ol start=\"9\">\n<li>On the <strong>Outputs<\/strong> tab, record the keys and their corresponding values.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14604\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-013.jpg\" alt=\"\" width=\"800\" height=\"364\"><\/p>\n<p>You use the following keys later in this post:<\/p>\n<ul>\n<li>\n<strong>AuroraClusterDBName<\/strong> \u2013 Aurora cluster database name<\/li>\n<li>\n<strong>AuroraClusterEndpointWithPort<\/strong> \u2013 Aurora cluster endpoint address with port number<\/li>\n<li>\n<strong>AuroraClusterSecret<\/strong> \u2013 Aurora cluster credentials secret ARN<\/li>\n<li>\n<strong>EMRClusterDNSAddress<\/strong> \u2013 EMR cluster DNS name<\/li>\n<li>\n<strong>EMRMasterInstanceId<\/strong> \u2013 EMR cluster primary instance ID<\/li>\n<li>\n<strong>PrivateSubnets<\/strong> \u2013 Private subnets<\/li>\n<li>\n<strong>PublicSubnets<\/strong> \u2013 Public subnets<\/li>\n<li>\n<strong>RedshiftClusterDBName<\/strong> \u2013 Amazon Redshift cluster database name<\/li>\n<li>\n<strong>RedshiftClusterEndpointWithPort<\/strong> \u2013 Amazon Redshift cluster endpoint address with port number<\/li>\n<li>\n<strong>RedshiftClusterSecret<\/strong> \u2013 Amazon Redshift cluster credentials secret ARN<\/li>\n<li>\n<strong>SageMakerNotebookName<\/strong> \u2013 Amazon SageMaker notebook instance name<\/li>\n<li>\n<strong>SageMakerRS3BucketName<\/strong> \u2013 Amazon SageMaker S3 data bucket<\/li>\n<li>\n<strong>VPCandCIDR<\/strong> \u2013 VPC ID and CIDR block<\/li>\n<\/ul>\n<h2>Creating your notebook with necessary R packages and JAR files<\/h2>\n<p>JDBC is an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Application_programming_interface\" target=\"_blank\" rel=\"noopener noreferrer\">application programming interface<\/a> (API) for the programming language <a href=\"https:\/\/en.wikipedia.org\/wiki\/Java_(programming_language)\" target=\"_blank\" rel=\"noopener noreferrer\">Java<\/a>, which defines how you can access a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Database\" target=\"_blank\" rel=\"noopener noreferrer\">database<\/a>. <a href=\"https:\/\/cran.r-project.org\/web\/packages\/RJDBC\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">RJDBC<\/a> is a package in R that allows you to connect to various data sources using the JDBC interface. The notebook instance that the CloudFormation template created ensures that the necessary JAR files for Hive, Presto, Amazon Athena, Amazon Redshift and MySQL are present in order to establish a JDBC connection.<\/p>\n<ol>\n<li>In the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Console<\/a>, under <strong>Notebook<\/strong>, choose <strong>Notebook instances<\/strong>.<\/li>\n<li>Search for the notebook that matches the <code>SageMakerNotebookName<\/code> key you recorded earlier.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14605\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-015.jpg\" alt=\"\" width=\"800\" height=\"249\">\n<\/li>\n<li>Select the notebook instance.<\/li>\n<li>Click on \u201cOpen Jupyter\u201d under \u201cActions\u201d to locate the \u201cjdbc\u201d directory.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14606\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-017.jpg\" alt=\"\" width=\"800\" height=\"136\">\n<\/li>\n<\/ol>\n<p>The CloudFormation template downloads the JAR files for <a href=\"https:\/\/mvnrepository.com\/artifact\/org.apache.hive\/hive-jdbc\/2.3.6\" target=\"_blank\" rel=\"noopener noreferrer\">Hive<\/a>, <a href=\"https:\/\/prestodb.io\/download.html\" target=\"_blank\" rel=\"noopener noreferrer\">Presto<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/connect-with-jdbc.html\" target=\"_blank\" rel=\"noopener noreferrer\">Athena<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-connection.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift<\/a>, and <a href=\"https:\/\/downloads.mariadb.org\/connector-java\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Aurora MySQL<\/a>-compatible inside the \u201cjdbc\u201d directory.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14607\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-019.jpg\" alt=\"\" width=\"800\" height=\"251\"><\/p>\n<ol start=\"5\">\n<li>Locate the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/notebook-lifecycle-config.html\" target=\"_blank\" rel=\"noopener noreferrer\">lifecycle configuration<\/a> attached.<\/li>\n<\/ol>\n<p>A lifecycle configuration allows you to install packages or sample notebooks on your notebook instance, configure networking and security for it, or otherwise use a shell script for customization. A lifecycle configuration provides shell scripts that run when you create the notebook instance or when you start the notebook.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14608\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-021.jpg\" alt=\"\" width=\"800\" height=\"315\"><\/p>\n<ol start=\"6\">\n<li>Inside the <strong>Lifecycle configuration<\/strong> section, choose <strong>View script<\/strong> to see the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/LCScript\/SageMaker_LC_Script.sh\" target=\"_blank\" rel=\"noopener noreferrer\">lifecycle configuration script<\/a> that sets up the R kernel in Amazon SageMaker to make JDBC connections to data sources using R.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14609\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-023.jpg\" alt=\"\" width=\"800\" height=\"315\"><\/p>\n<p>It installs the <a href=\"https:\/\/anaconda.org\/r\/r-rjdbc\" target=\"_blank\" rel=\"noopener noreferrer\">RJDBC package and dependencies in the Anaconda environment<\/a> of the Amazon SageMaker notebook.<\/p>\n<h2>Connecting to Hive and Presto<\/h2>\n<p>Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as <a href=\"https:\/\/aws.amazon.com\/emr\/features\/spark\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Spark<\/a>, Apache Hive, <a href=\"https:\/\/aws.amazon.com\/emr\/features\/hbase\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache HBase<\/a>, <a href=\"https:\/\/aws.amazon.com\/blogs\/big-data\/use-apache-flink-on-amazon-emr\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Flink<\/a>, <a href=\"https:\/\/aws.amazon.com\/emr\/features\/hudi\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Hudi<\/a>, and Presto.<\/p>\n<p>You can create a test table in Hive by logging in to the EMR master node from the AWS console using the Session Manager capability in Systems Manager. Systems Manager gives you visibility and control of your infrastructure on AWS. Systems Manager also provides a unified user interface so you can view operational data from multiple AWS services and allows you to automate operational tasks across your AWS resources. Session Manager is a fully managed Systems Manager capability that lets you manage your <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Compute Cloud<\/a> (Amazon EC2) instances, on-premises instances, and virtual machines (VMs) through an interactive, one-click browser-based shell or through the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI).<\/p>\n<p>You use the following values from the AWS CloudFormation <strong>Outputs<\/strong> tab in this step:<\/p>\n<ul>\n<li>\n<strong>EMRClusterDNSAddress<\/strong> \u2013 EMR cluster DNS name<\/li>\n<li>\n<strong>EMRMasterInstanceId<\/strong> \u2013 EMR cluster primary instance ID<\/li>\n<li>\n<strong>SageMakerNotebookName<\/strong> \u2013 Amazon SageMaker notebook instance name<\/li>\n<\/ul>\n<ol>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/systems-manager\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Systems Manager Console<\/a>, under <strong>Instances &amp; Nodes<\/strong>, choose <strong>Session Manager<\/strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14610\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-025.jpg\" alt=\"\" width=\"800\" height=\"394\">\n<\/li>\n<li>Choose <strong>Start Session<\/strong>.<img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14611\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-027.jpg\" alt=\"\" width=\"800\" height=\"399\">\n<\/li>\n<li>Start an SSH session with the EMR primary node by locating the instance ID as specified by the value of the key <code>EMRMasterInstanceId<\/code><strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14612\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-029.jpg\" alt=\"\" width=\"800\" height=\"260\"><\/strong>\n<\/li>\n<\/ol>\n<p>This starts the browser-based shell.<\/p>\n<ol start=\"4\">\n<li readability=\"-2\">Run the following SSH commands:\n<div class=\"hide-language\" readability=\"7\">\n<pre><code class=\"lang-bash\"># change user to hadoop \r\nwhoami\r\nsudo su - hadoop<\/code><\/pre>\n<\/div>\n<\/li>\n<li readability=\"3.5\">Create a test table in Hive from the EMR master node as you have already logged in using SSH:\n<div class=\"hide-language\" readability=\"18\">\n<pre><code class=\"lang-bash\"># Run on the EMR master node to create a table called students in Hive\r\nhive -e \"CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));\"\r\n\r\n# Run on the EMR master node to insert data to students created above\r\nhive -e \"INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);\"\r\n\r\n# Verify \r\nhive -e \"SELECT * from students;\"\r\nexit\r\nexit<\/code><\/pre>\n<\/div>\n<\/li>\n<\/ol>\n<p>The following screenshot shows the view in the browser-based shell.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14613\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-031.jpg\" alt=\"\" width=\"800\" height=\"608\"><\/p>\n<ol start=\"6\">\n<li>Close the browser after exiting the shell.<\/li>\n<\/ol>\n<p>To query the data from Amazon EMR using the Amazon SageMaker R kernel, you open the notebook the CloudFormation template created.<\/p>\n<ol start=\"7\">\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Console<\/a>, under <strong>Notebook<\/strong>, chose <strong>Notebook instances<\/strong>.<\/li>\n<li>Find the notebook as specified by the value of the key <code>SageMakerNotebookName<\/code>.<\/li>\n<li>Choose <strong>Open Jupyter<\/strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14614\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-033.jpg\" alt=\"\" width=\"800\" height=\"274\">\n<\/li>\n<li>To demonstrate connectivity from the Amazon SageMaker R kernel, choose <strong>Upload<\/strong> and upload the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/Notebook\/hive_connect.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">ipynb<\/a> notebook.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14615\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-035.jpg\" alt=\"\" width=\"800\" height=\"164\">\n<ol type=\"a\">\n<li>Alternatively, from the <strong>New <\/strong>drop-down menu, choose <strong>R <\/strong>to open a new notebook.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14616\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-037.jpg\" alt=\"\" width=\"800\" height=\"477\">\n<\/li>\n<li>Enter the code as mentioned in \u201chive_connect.ipynb\u201d, replacing the <code>emr_dns<\/code> value with the value from key <code>EMRClusterDNSAddress<\/code>:<\/li>\n<\/ol>\n<\/li>\n<li>Run all the cells in the notebook to connect to Hive on Amazon EMR using the Amazon SageMaker R console.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14617\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-039.jpg\" alt=\"\" width=\"800\" height=\"496\">\n<\/li>\n<\/ol>\n<p>You follow similar steps to connect Presto:<\/p>\n<ol start=\"12\">\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Console<\/a>, open the notebook you created.<\/li>\n<li>Choose <strong>Open Jupyter<\/strong>.<\/li>\n<li>Choose <strong>Upload<\/strong> to upload the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/Notebook\/presto_connect.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">ipynb<\/a> notebook.\n<ol type=\"a\">\n<li>Alternatively, from the <strong>New<\/strong> drop-down menu, choose <strong>R<\/strong> to open a new notebook.<\/li>\n<li>Enter the code as mentioned in \u201cpresto_connect.ipynb\u201d, replacing the <code>emr_dns<\/code> value with the value from key <code>EMRClusterDNSAddress<\/code>:<\/li>\n<\/ol>\n<\/li>\n<li>Run all the cells in the notebook to connect to PrestoDB on Amazon EMR using the Amazon SageMaker R console.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14618\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-041.jpg\" alt=\"\" width=\"800\" height=\"490\">\n<\/li>\n<\/ol>\n<h2>Connecting to Amazon Athena<\/h2>\n<p>Amazon Athena is an interactive query service that makes it easy to analyze data in <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) using standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. To connect to Amazon Athena from the Amazon SageMaker R kernel using RJDBC, we <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/connect-with-jdbc.html\" target=\"_blank\" rel=\"noopener noreferrer\">use the Amazon Athena JDBC driver<\/a>, which is already downloaded to the notebook instance via the lifecycle configuration script.<\/p>\n<p>You also need to set the query result location in Amazon S3. For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/querying.html\" target=\"_blank\" rel=\"noopener noreferrer\">Working with Query Results, Output Files, and Query History<\/a>.<\/p>\n<ol>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/athena\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena Console<\/a>, choose <strong>Get Started<\/strong>.<\/li>\n<li>Choose <strong>Set up a query result location in Amazon S3<\/strong>.<\/li>\n<li>For <strong>Query result location<\/strong>, enter the Amazon S3 location as specified by the value of the key <code>SageMakerRS3BucketName<\/code>.<\/li>\n<li>Optionally, add a prefix, such as <code>results<\/code>.<\/li>\n<li>Choose <strong>Save<\/strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14619\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-043.jpg\" alt=\"\" width=\"800\" height=\"362\">\n<\/li>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/getting-started.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a database or schema and table in Athena<\/a> with the example Amazon S3 data.<\/li>\n<li>Similar to connecting to Hive and Presto, to establish a connection from Athena to Amazon SageMaker using the R kernel, you can upload the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/Notebook\/athena_connect.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">ipynb<\/a> notebook.\n<ol type=\"a\">\n<li>Alternatively, open a new notebook and enter the code in \u201cathena_connect.ipynb\u201d, replacing the <code>s3_bucket<\/code> value with the value from key <code>SageMakerRS3BucketName<\/code>:<\/li>\n<\/ol>\n<\/li>\n<li>Run all the cells in the notebook to connect to Amazon Athena from the Amazon SageMaker R console.<img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14620\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-045.jpg\" alt=\"\" width=\"800\" height=\"499\">\n<\/li>\n<\/ol>\n<h2>Connecting to Amazon Redshift<\/h2>\n<p>Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. To connect to Amazon Redshift from the Amazon SageMaker R kernel using RJDBC, we use the <a href=\"https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/configure-jdbc-connection.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift JDBC driver<\/a>, which is already downloaded to the notebook instance via the lifecycle configuration script.<\/p>\n<p>You need the following keys and their values from the AWS CloudFormation <strong>Outputs<\/strong> tab:<\/p>\n<ul>\n<li>\n<strong>RedshiftClusterDBName<\/strong> \u2013 Amazon Redshift cluster database name<\/li>\n<li>\n<strong>RedshiftClusterEndpointWithPort<\/strong> \u2013 Amazon Redshift cluster endpoint address with port number<\/li>\n<li>\n<strong>RedshiftClusterSecret<\/strong> \u2013 Amazon Redshift cluster credentials secret ARN<\/li>\n<\/ul>\n<p>The CloudFormation template creates a secret for the Amazon Redshift cluster in <a href=\"https:\/\/aws.amazon.com\/secrets-manager\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Secrets Manager<\/a>, which is a service that helps you protect secrets needed to access your applications, services, and IT resources. Secrets Manager lets you easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle.<\/p>\n<ol>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/secretsmanager\/home\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Secrets Manager Console<\/a>, choose <strong>Secrets<\/strong>.<\/li>\n<li>Choose the secret denoted by the <code>RedshiftClusterSecret<\/code> key value.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14621\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-047.jpg\" alt=\"\" width=\"800\" height=\"394\">\n<\/li>\n<li>In the <strong>Secret value<\/strong> section, choose <strong>Retrieve secret value<\/strong> to get the user name and password for the Amazon Redshift cluster.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14622\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-049.jpg\" alt=\"\" width=\"800\" height=\"228\">\n<\/li>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/redshiftv2\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift Console<\/a>, choose <strong>Editor<\/strong> (which is essentially the <a href=\"https:\/\/docs.aws.amazon.com\/redshift\/latest\/mgmt\/query-editor.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift query editor<\/a>).<\/li>\n<li>For <strong>Database name<\/strong>, enter <code>redshiftdb<\/code>.<\/li>\n<li>For <strong>Database password<\/strong>, enter your password.<\/li>\n<li>Choose <strong>Connect to database<\/strong>.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14623\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-051.jpg\" alt=\"\" width=\"800\" height=\"527\">\n<\/li>\n<li readability=\"2.5\">Run the following SQL statements to create a table and insert a couple of records:\n<div class=\"hide-language\" readability=\"16\">\n<pre><code class=\"lang-sql\">CREATE TABLE public.students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));\r\nINSERT INTO public.students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);\r\n<\/code><\/pre>\n<\/div>\n<\/li>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Console<\/a>, open your notebook.<\/li>\n<li>Choose <strong>Open Jupyter<\/strong>.<\/li>\n<li>Upload the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/Notebook\/redshift_connect.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">ipynb<\/a> notebook.\n<ol type=\"a\">\n<li>Alternatively, open a new notebook and enter the code as mentioned in \u201credshift_connect.ipynb\u201d, replacing the values for <code>RedshiftClusterEndpointWithPort<\/code>, <code>RedshiftClusterDBName<\/code>, and <code>RedshiftClusterSecret<\/code>:<\/li>\n<\/ol>\n<\/li>\n<li>Run all the cells in the notebook to connect to Amazon Redshift on the Amazon SageMaker R console.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14624\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-053.jpg\" alt=\"\" width=\"800\" height=\"481\">\n<\/li>\n<\/ol>\n<h2>Connecting to Amazon Aurora MySQL-compatible<\/h2>\n<p>Amazon Aurora is a MySQL-compatible <a href=\"https:\/\/aws.amazon.com\/relational-database\/\" target=\"_blank\" rel=\"noopener noreferrer\">relational database<\/a> built for the cloud, which combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases. To connect to Amazon Aurora from the Amazon SageMaker R kernel using RJDBC, we use the <a href=\"https:\/\/downloads.mariadb.org\/connector-java\/\" target=\"_blank\" rel=\"noopener noreferrer\">MariaDB JDBC driver<\/a>, which is already downloaded to the notebook instance via the lifecycle configuration script.<\/p>\n<p>You need the following keys and their values from the AWS CloudFormation <strong>Outputs<\/strong> tab:<\/p>\n<ul>\n<li>\n<strong>AuroraClusterDBName<\/strong> \u2013 Aurora cluster database name<\/li>\n<li>\n<strong>AuroraClusterEndpointWithPort<\/strong> \u2013 Aurora cluster endpoint address with port number<\/li>\n<li>\n<strong>AuroraClusterSecret<\/strong> \u2013 Aurora cluster credentials secret ARN<\/li>\n<\/ul>\n<p>The CloudFormation template creates a secret for the Aurora cluster in Secrets Manager.<\/p>\n<ol>\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/secretsmanager\/home\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Secrets Manager Console<\/a>, locate the secret as denoted by the <code>AuroraClusterSecret<\/code> key value.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14625\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-055.jpg\" alt=\"\" width=\"800\" height=\"384\">\n<\/li>\n<li>In the <strong>Secret value <\/strong>section, choose <strong>Retrieve secret value <\/strong>to get the user name and password for the Aurora cluster.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14626\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-057.jpg\" alt=\"\" width=\"800\" height=\"235\">\n<\/li>\n<\/ol>\n<p>To connect to the cluster, you follow similar steps as with other services.<\/p>\n<ol start=\"3\">\n<li>On the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\/home\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Console<\/a>, open your notebook.<\/li>\n<li>Choose <strong>Open Jupyter<\/strong>.<\/li>\n<li>Upload the <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/SageMaker-R-Connect-Data-Sources\/Notebook\/aurora_connect.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">ipynb<\/a> notebook.\n<ol type=\"a\">\n<li>Alternatively, open a new notebook and enter the code as mentioned in \u201caurora_connect.ipynb\u201d, replacing the values for <code>AuroraClusterEndpointWithPort<\/code>, <code>AuroraClusterDBName<\/code>, and <code>AuroraClusterSecret<\/code>:<\/li>\n<\/ol>\n<\/li>\n<li>Run all the cells in the notebook to connect Amazon Aurora on the Amazon SageMaker R console.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-14627\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/sagemaker_r_datasources-059.jpg\" alt=\"\" width=\"800\" height=\"483\">\n<\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how to connect to various data sources, such as Hive and PrestoDB on Amazon EMR, Amazon Athena, Amazon Redshift, and Amazon Aurora MySQL-compatible cluster, in your environment to analyze, profile, run statistical computions using R from Amazon SageMaker. You can extend this method to other data sources via JDBC.<\/p>\n<hr>\n<h2>Author Bio<\/h2>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-14596\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/kunal-ghosh.jpg\" alt=\"\" width=\"100\" height=\"128\">Kunal Ghosh is a Solutions Architect at AWS. His passion is building efficient and effective solutions on the cloud, especially involving analytics, AI, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-14597\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/10\/gagan-brahmi.jpg\" alt=\"\" width=\"100\" height=\"132\">Gagan Brahmi is a Specialist Solutions Architect focused on Big Data &amp; Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/accessing-data-sources-from-amazon-sagemaker-r-kernels\/<\/p>\n","protected":false},"author":1,"featured_media":46,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/45"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=45"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/45\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/46"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=45"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=45"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=45"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}