{"id":290,"date":"2020-09-25T23:17:22","date_gmt":"2020-09-25T23:17:22","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/25\/data-visualization-and-anomaly-detection-using-amazon-athena-and-pandas-from-amazon-sagemaker\/"},"modified":"2020-09-25T23:17:22","modified_gmt":"2020-09-25T23:17:22","slug":"data-visualization-and-anomaly-detection-using-amazon-athena-and-pandas-from-amazon-sagemaker","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/25\/data-visualization-and-anomaly-detection-using-amazon-athena-and-pandas-from-amazon-sagemaker\/","title":{"rendered":"Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker"},"content":{"rendered":"<div id=\"\">\n<p>Many organizations use <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> for their machine learning (ML) requirements and source data from a data lake stored on <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3). The petabyte scale source data on Amazon S3 may not always be clean because data lakes ingest data from several source systems, such as like flat files, external feeds, databases, and Hadoop. It may contain extreme values in source attributes, considered as outliers in the data. Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, missing data, or simply through natural deviations in populations. Outliers in training data can easily impact model accuracy of many ML models, like linear and logistic regression. These anomalies result in ML scientists and analysts facing skewed results. Outliers can dramatically impact ML models and change the model equation completely with bad predictions or estimations.<\/p>\n<p>Data scientists and analysts are looking for a way to remove outliers. Analysts come from a strong data background, and are very fluent in writing SQL queries with programming languages. The following tools are a natural choice for ML scientists to remove outliers and carry out data visualization:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/aws.amazon.com\/athena\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena<\/a> \u2013 An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL<\/li>\n<li>\n<a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pandas<\/a> \u2013 An open-source, high-performance, easy-to-use library that provides for data structures and data analysis library like matplotlib for <a href=\"https:\/\/www.python.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Python<\/a> programming language<\/li>\n<li>\n<a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> \u2013 A fully managed service that provides you with the ability to build, train, and deploy ML models quickly<\/li>\n<\/ul>\n<p>To illustrate how to use <a href=\"https:\/\/aws.amazon.com\/athena\/\" target=\"_blank\" rel=\"noopener noreferrer\">Athena<\/a> with <a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pandas<\/a> for anomaly detection and visualization using Amazon SageMaker, we clean a set of <a href=\"https:\/\/registry.opendata.aws\/nyc-tlc-trip-records-pds\/\" target=\"_blank\" rel=\"noopener noreferrer\">New York City Taxi and Limousine Commission (TLC) Trip Record Data<\/a> by removing outlier records. In this dataset, outliers are when a taxi trip\u2019s duration is for multiple days, 0 seconds, or less than 0 seconds. Then we use the Pandas matplotlib library to plot graphs to visualize trip duration values.<\/p>\n<h2>Solution overview<\/h2>\n<p>To implement this solution, you perform the following high-level steps:<\/p>\n<ol>\n<li>Create an <a href=\"https:\/\/aws.amazon.com\/glue\/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;whats-new-cards.sort-order=desc\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Glue<\/a> Data Catalog and browse the data on the Athena console.<\/li>\n<li>Create an Amazon SageMaker Jupyter notebook and install PyAthena.<\/li>\n<li>Identify anomalies using Athena SQL-Pandas from the Jupyter notebook.<\/li>\n<li>Visualize data and remove outliers using Athena SQL-Pandas.<\/li>\n<\/ol>\n<p>The following diagram illustrates the architecture of this solution.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15561 size-full\" title=\"Solution Architecture\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/1-Architecture.jpg\" alt=\"\" width=\"900\" height=\"484\"><\/p>\n<h2>Prerequisites<\/h2>\n<p>To follow this post, you should be familiar with the following:<\/p>\n<ul>\n<li>The Amazon S3 file upload process<\/li>\n<li>AWS Glue crawlers and the Data Catalog<\/li>\n<li>Basic SQL queries<\/li>\n<li>Jupyter notebooks<\/li>\n<li>Assigning a basic <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) policy to a role<\/li>\n<\/ul>\n<h2>Preparing the data<\/h2>\n<p>For this post, we use <a href=\"https:\/\/registry.opendata.aws\/nyc-tlc-trip-records-pds\/\" target=\"_blank\" rel=\"noopener noreferrer\">New York City Taxi and Limousine Commission (TLC) Trip Record Data<\/a>, which is a publicly available dataset.<\/p>\n<ol>\n<li>Download the file <code>yellow_tripdata_2019-01.csv<\/code> to your local machine.<\/li>\n<li>Create the S3 bucket <code>s3-yellow-cab-trip-details<\/code> (your name will be different).<\/li>\n<li>Upload the file to your bucket using the Amazon S3 console.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone\" title=\"AWS s3 console\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/2-s3-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"326\"><\/p>\n<h2>Creating the Data Catalog and browsing the data<\/h2>\n<p>After you upload the data to Amazon S3, you create the Data Catalog in AWS Glue. This allows you to run SQL queries using Athena.<\/p>\n<ol>\n<li>On the AWS Glue console, create a new database.<\/li>\n<li>For <strong>Database name<\/strong>, enter <code>db_yellow_cab_trip_details<\/code>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15563 size-full\" title=\"Adding database\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/3-Add-database.jpg\" alt=\"\" width=\"600\" height=\"454\"><\/p>\n<ol start=\"3\">\n<li>Create an AWS Glue crawler to gather the metadata in the file and catalog it.<\/li>\n<\/ol>\n<p>For this post, I use the database (<code>db_yellow_cab_trip_details<\/code>) to save tables with the added pre-fix as <code>src_<\/code>.<\/p>\n<ol start=\"4\">\n<li>Run the crawler.<\/li>\n<\/ol>\n<p>The crawler can take 2\u20133 minutes to complete. You can check the status on <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a>.<\/p>\n<p>The following screenshot shows the crawler details on the AWS Glue console.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15564 size-full\" title=\"AWS Glue console\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/4-AWS-glue-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"546\"><\/p>\n<p>When the crawler is complete, the table is available in the Data Catalog. All the metadata and column-level information is displayed with corresponding data types.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15565 size-full\" title=\"Data Catalog\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/5-Screenshot-1.jpg\" alt=\"\" width=\"900\" height=\"404\"><\/p>\n<p>We can now check the data on the Athena console to make sure we can read the file as a table and run a SQL query.<\/p>\n<p>Run your query with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-sql\">SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10;<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows your output on the Athena console.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15566 size-full\" title=\"Athena Console\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/6-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"386\"><\/p>\n<h2>Creating a Jupyter notebook and installing PyAthena<\/h2>\n<p>You now create a new notebook instance from Amazon SageMaker and install PyAthena using Jupyter.<\/p>\n<p>Amazon SageMaker has managed built-in Jupyter notebooks that allow you to write code in Python, Julia, R, or Scala to explore, analyze, and do modeling with a small set of data.<\/p>\n<p>Make sure the role used for your notebook has access on Athena (use IAM policies to verify and add <code>S3FullAccess<\/code> and <code>AmazonAthenaFullAccess<\/code>).<\/p>\n<p>To create your notebook, complete the following steps:<\/p>\n<ol>\n<li>On the Amazon SageMaker console, under <strong>Notebook<\/strong>, choose <strong>Notebook instances<\/strong>.<\/li>\n<li>Choose <strong>Create notebook instance<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15567 size-full\" title=\"Creating a notebook instance\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/7-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"190\"><\/p>\n<ol start=\"3\">\n<li>On the <strong>Create notebook instance page<\/strong>, enter a name and choose an instance type.<\/li>\n<\/ol>\n<p>We recommend using an ml.m4.10xlarge instance, due to the size of the dataset. You should choose an appropriate instance depending on your data; costs vary for different instances.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone\" title=\"Creating a notebook instance\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/8-Create-notebook-instance.jpg\" alt=\"\" width=\"900\" height=\"1198\"><\/p>\n<p>Wait until the Notebook instance status shows as <code>InService<\/code> (this step can take up to 5 minutes).<\/p>\n<ol start=\"4\">\n<li>When the instance is ready, choose <strong>Open Jupyter<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15569 size-full\" title=\"Opening Jupyter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/9-notebook-nyc-taxi-trip.jpg\" alt=\"\" width=\"900\" height=\"370\"><\/p>\n<ol start=\"5\">\n<li>Open the <code>conda_python3<\/code> kernel from the notebook instance.<\/li>\n<li>Enter the following commands to install PyAthena:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">! pip install --upgrade pip\r\n! pip install PyAthena\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output.<\/p>\n<p><strong><em> <img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15570 size-full\" title=\"Output screenshot\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/10-Jupyter-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"714\"><\/em><\/strong><\/p>\n<p>You can also install PyAthena when you create the notebook instance by using <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/notebook-lifecycle-config.html\" target=\"_blank\" rel=\"noopener noreferrer\">lifecycle configurations<\/a>. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">#!\/bin\/bash\r\nsudo -u ec2-user -i &lt;&lt;'EOF'\r\nsource \/home\/ec2-user\/anaconda3\/bin\/activate python3\r\npip install --upgrade pip\r\npip install --upgrade  PyAthena\r\nsource \/home\/ec2-user\/anaconda3\/bin\/deactivate\r\nEOF\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows where you enter the preceding code in the <strong>Scripts<\/strong> section when creating a lifecycle configuration.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15571 size-full\" title=\"Entering code into the scripts section\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/11-Create-lifecycle-configuration.jpg\" alt=\"\" width=\"900\" height=\"864\"><\/p>\n<p>You can run a SQL query from the notebook to validate connectivity to Athena and pull data for visualization.<\/p>\n<p>To import the libraries, enter the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from pyathena import connect\r\nimport pandas as pd\r\nimport matplotlib.pyplot as plt\r\nimport numpy as np\r\n\r\nfrom IPython.core.display import display, HTML\r\ndisplay(HTML(\"&lt;style&gt;.container { width:100% !important; }&lt;\/style&gt;\"))\r\n<\/code><\/pre>\n<\/div>\n<p>To connect to Athena, enter the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">conn = connect(s3_staging_dir='s3:\/\/&lt;your-Query-result-location&gt;',region_name='us-east-1')<\/code><\/pre>\n<\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15583 size-full\" title=\"Importing libraries\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/11A.jpg\" alt=\"\" width=\"900\" height=\"271\"><\/p>\n<p>To check the sample data, enter the following query:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-sql\">df_sample = pd.read_sql(\"SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10\", conn)\r\ndf_sample\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15572 size-full\" title=\"Running query on Athena connection\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/12-Check-sample-data-by.jpg\" alt=\"\" width=\"900\" height=\"345\"><\/p>\n<h2>Detecting anomalies with Athena, Pandas, and Amazon SageMaker<\/h2>\n<p>Now that we can connect to Athena, we can run SQL queries to find the records that have unusual <code>trip_duration<\/code> values.<\/p>\n<p>The following Athena query checks anomalies in the <code>trip_duration<\/code> data to find the top 50 records with the maximum duration:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">df_anomaly_duration= pd.read_sql(\"select tpep_dropoff_datetime,tpep_pickup_datetime, \r\ndate_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, \r\ndate_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, \r\ndate_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, \r\ndate_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day \r\nfrom db_yellow_cab_trip_details.src_yellow_cab_trip_details \r\norder by 3 desc limit 50\", conn)\r\n\r\ndf_anomaly_duration\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output; there are many outliers (trips with a duration greater than 1 day).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15579 size-full\" title=\"Athena query output\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/12A.jpg\" alt=\"\" width=\"900\" height=\"497\"><\/p>\n<p>The output shows the duration in seconds, minutes, hours, and days.<\/p>\n<p>The following query checks for anomalies and shows the top 50 records with the lowest minimum duration (negative value or 0 seconds):<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">df_anomaly_duration= pd.read_sql(\"select tpep_dropoff_datetime,tpep_pickup_datetime, \r\ndate_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, \r\ndate_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, \r\ndate_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, \r\ndate_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day \r\nfrom db_yellow_cab_trip_details.src_yellow_cab_trip_details \r\norder by 3 asc limit 50\", conn)\r\n\r\ndf_anomaly_duration\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output; multiple trips have a negative value or duration of 0.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15573 size-full\" title=\"Athena query output\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/13-Sample-athena-queries.jpg\" alt=\"\" width=\"900\" height=\"499\"><\/p>\n<p>Similarly, we can use different SQL queries using to analyze the data and find other outliers. We can also clean the data by using SQL queries and, if needed, save the data in Amazon S3 with <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/ctas-examples.html\" target=\"_blank\" rel=\"noopener noreferrer\">CTAS queries<\/a>.<\/p>\n<h2>Visualizing the data and removing outliers<\/h2>\n<p>Pull the data using the following Athena query in a Pandas DataFrame, and use <code>matplotlib.pyplot<\/code> to create a visual graph to see the outliers:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">df_full = pd.read_sql(\"SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second \r\nfrom db_yellow_cab_trip_details.src_yellow_cab_trip_details \", conn)\r\nplt.figure(figsize=(12,12))\r\nplt.scatter(range(len(df_full[\"duration_second\"])), np.sort(df_full[\"duration_second\"]))\r\nplt.xlabel('index')\r\nplt.ylabel('duration_second')\r\n\r\nplt.show()\r\n<\/code><\/pre>\n<\/div>\n<p>The process of plotting the full dataset can take 7\u201310 minutes. To reduce time, add a limit to the number of records in the query:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-sql\">SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second \r\nfrom db_yellow_cab_trip_details.src_yellow_cab_trip_details <strong>limit 100000 <\/strong><\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15574 size-full\" title=\"Visualization output\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/14-Visualization.jpg\" alt=\"\" width=\"900\" height=\"759\"><\/p>\n<p>To ignore outliers, run the following query. You replot the graph after removing the outlier records in which the duration is equal or less than 0 seconds or longer than 1 day:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">df_clean_data = pd.read_sql(\"SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second from db_yellow_cab_trip_details.src_yellow_cab_trip_details where \r\ndate_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp))  &gt; 0 and date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) &lt; 1 \", conn)\r\nplt.figure(figsize=(12,12))\r\nplt.scatter(range(len(df_clean_data[\"duration_second\"])), np.sort(df_clean_data[\"duration_second\"]))\r\nplt.xlabel('index')\r\nplt.ylabel('duration_second')\r\nplt.show()\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15575 size-full\" title=\"Visualization output\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/15-Visualization2.jpg\" alt=\"\" width=\"900\" height=\"828\"><\/p>\n<h2>Cleaning up<\/h2>\n<p>When you\u2019re done, delete the notebook instance to avoid recurring deployment costs.<\/p>\n<ol>\n<li>On the Amazon SageMaker notebook, choose your notebook instance.<\/li>\n<li>Choose <strong>Stop<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15576 size-full\" title=\"Deleting notebook instance\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/16-notebook-nyc-taxi-trip.jpg\" alt=\"\" width=\"900\" height=\"250\"><\/p>\n<ol start=\"3\">\n<li>When the status shows as Stopped, choose <strong>Delete<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15577 size-full\" title=\"Deleting notebook instance\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/02\/17.jpg\" alt=\"\" width=\"900\" height=\"173\"><\/p>\n<h2>Conclusion<\/h2>\n<p>This post walked you through finding and removing outliers from your dataset and data visualization. We used an Amazon SageMaker notebook to run analytical queries using Athena SQL, and used Athena to read the dataset, which is saved in Amazon S3 with the metadata catalog in AWS Glue. We used queries in Athena to find anomalies in the data and ignore these outliers. We also used notebook instances to visualize graphs using Pandas\u2019 <code>matplotlib.pyplot<\/code> library.<\/p>\n<p>You can try this solution for your use-cases to remove outliers using Athena SQL and SageMaker notebook. If you have comments or feedback, please leave them below.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15656 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/04\/Rahul-Sonawane.jpg\" alt=\"\" width=\"101\" height=\"119\"><strong>Rahul Sonawane<\/strong> is a Senior Consultant, Big Data at the Shared Delivery Teams at Amazon Web Services.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15655 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/04\/Behram-Irani.jpg\" alt=\"\" width=\"99\" height=\"124\"><strong>Behram Irani<\/strong> is a Senior Solutions Architect, Data &amp; Analytics at Amazon Web Services.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/data-visualization-and-anomaly-detection-using-amazon-athena-and-pandas-from-amazon-sagemaker\/<\/p>\n","protected":false},"author":0,"featured_media":291,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/290"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=290"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/290\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/291"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}