{"id":1523,"date":"2022-02-02T18:57:24","date_gmt":"2022-02-02T18:57:24","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/02\/prepare-and-analyze-json-and-orc-data-with-amazon-sagemaker-data-wrangler\/"},"modified":"2022-02-02T18:57:24","modified_gmt":"2022-02-02T18:57:24","slug":"prepare-and-analyze-json-and-orc-data-with-amazon-sagemaker-data-wrangler","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/02\/prepare-and-analyze-json-and-orc-data-with-amazon-sagemaker-data-wrangler\/","title":{"rendered":"Prepare and analyze JSON and ORC data with Amazon SageMaker Data Wrangler"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a> is a new capability of <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and analyze data for ML in a seamless, visual, low-code experience. It lets you easily and quickly connect to AWS components like <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3), <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena<\/a>, <a href=\"https:\/\/aws.amazon.com\/redshift\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Redshift<\/a>, and <a href=\"https:\/\/aws.amazon.com\/lake-formation\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lake Formation<\/a>, and external sources like Snowflake. Data Wrangler also supports standard data types such as CSV and Parquet.<\/p>\n<p>Data Wrangler now additionally supports Optimized Row Columnar (<a href=\"https:\/\/orc.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">ORC<\/a>), JavaScript Object Notation (JSON), and JSON Lines (JSONL) file formats:<\/p>\n<ul>\n<li><strong>ORC<\/strong> \u2013 The ORC file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data. ORC is widely used in the Hadoop ecosystem.<\/li>\n<li><strong>JSON <\/strong>\u2013 The JSON file format is a lightweight, commonly used data interchange format.<\/li>\n<li><strong>JSONL <\/strong>\u2013 JSON Lines, also called newline-delimited JSON, is a convenient format for storing structured data that may be processed one record at a time.<\/li>\n<\/ul>\n<p>You can preview ORC, JSON, and JSONL data prior to importing the datasets into Data Wrangler. After you import the data, you can also use one of the newly launched transformers to work with columns that contain JSON strings or arrays that are commonly found in nested JSONs.<\/p>\n<h2>Import and analyze ORC data with Data Wrangler<\/h2>\n<p>Importing ORC data is in Data Wrangler is easy and similar to importing files in any other supported formats. Browse to your ORC file in Amazon S3 and in the <strong>DETAILS <\/strong>pane, choose ORC as the file type during import.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32568 size-medium\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image001-274x300.png\" alt=\"\" width=\"274\" height=\"300\"><\/a><\/p>\n<p>If you\u2019re new to Data Wrangler, review <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-getting-started.html\" target=\"_blank\" rel=\"noopener noreferrer\">Get Started with Data Wrangler<\/a>. Also, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-import.html\" target=\"_blank\" rel=\"noopener noreferrer\">Import<\/a> to learn about the various import options.<\/p>\n<h2>Import and analyze JSON data with Data Wrangler<\/h2>\n<p>Now let\u2019s import files in JSON format with Data Wrangler and work with columns that contain JSON strings or arrays. We also demonstrate how to deal with nested JSONs. With Data Wrangler, importing JSON files from Amazon S3 is a seamless process. This is similar to importing files in any other supported formats. After you import the files, you can preview the JSON files as shown in the following screenshot. Make sure to set the file type to JSON in the <strong>DETAILS<\/strong> pane.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32569\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image003.png\" alt=\"\" width=\"1276\" height=\"593\"><\/a><\/p>\n<p>Next, let\u2019s work on structured columns in the imported JSON file.<\/p>\n<p>To deal with structured columns in JSON files, Data Wrangler is introducing two new transforms: <strong>Flatten structured column<\/strong> and <strong>Explode array column<\/strong>, which can be found under the <strong>Handle structured column <\/strong>option in the <strong>ADD TRANSFORM<\/strong> pane.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32570\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image005.png\" alt=\"\" width=\"1276\" height=\"591\"><\/a><\/p>\n<p>Let\u2019s start by applying the <strong>Explode array column<\/strong> transform to one of the columns in our imported data. Before applying the transform, we can see the column <code>topping<\/code> is an array of JSON objects with <code>id<\/code> and <code>type<\/code> keys.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image007.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32571\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image007.png\" alt=\"\" width=\"1276\" height=\"590\"><\/a><\/p>\n<p>After we apply the transform, we can observe the new rows added as a result. Each element in the array is now a new row in the resulting DataFrame.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image009.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32572\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image009.png\" alt=\"\" width=\"1276\" height=\"591\"><\/a><\/p>\n<p>Now let\u2019s apply the <strong>Flatten structured column<\/strong> transform on the <code>topping_flattened<\/code> column that was created as a result of the <strong>Explode array column<\/strong> transformation we applied in the previous step.<\/p>\n<p>Before applying the transform, we can see the keys <code>id<\/code> and <code>type<\/code> in the <code>topping_flattened<\/code> column.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image011.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32573\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image011.png\" alt=\"\" width=\"1276\" height=\"593\"><\/a><\/p>\n<p>After applying the transform, we can now observe the keys <code>id<\/code> and <code>type<\/code> under the <code>topping_flattened<\/code> column as new columns <code>topping_flattened_id<\/code> and <code>topping_flattened_type<\/code>, which are created as a result of the transformation. You also have the option to flatten only specific keys by entering the comma separated key names for <strong>Keys to flatten on<\/strong>. If left empty, all the keys inside the JSON string or struct are flattened.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image013.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32575\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7982-image013.png\" alt=\"\" width=\"1276\" height=\"590\"><\/a><\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how to import file formats in ORC and JSON easily with Data Wrangler. We also applied the newly launched transformations that allow us to transform any structured columns in JSON data. This makes working with columns that contain JSON strings or arrays a seamless experience.<\/p>\n<p>As next steps, we recommend you replicate the demonstrated examples in your own Data Wrangler visual interface. If you have any questions related to Data Wrangler, feel free to leave them in the comment section.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Balaji-Tummala.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-32545 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Balaji-Tummala.png\" alt=\"\" width=\"100\" height=\"111\"><\/a>Balaji Tummala<\/strong> is a Software Development Engineer at Amazon SageMaker. He helps support Amazon SageMaker Data Wrangler and is passionate about building performant and scalable software. Outside of work, he enjoys reading fiction and playing volleyball.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-32544 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\" alt=\"\" width=\"100\" height=\"124\"><\/a>Arunprasath Shankar<\/strong> is an Artificial Intelligence and Machine Learning (AI\/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/prepare-and-analyze-json-and-orc-data-with-amazon-sagemaker-data-wrangler\/<\/p>\n","protected":false},"author":0,"featured_media":1524,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1523"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1523"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1523\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1524"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}