{"id":1525,"date":"2022-02-02T20:07:54","date_gmt":"2022-02-02T20:07:54","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/02\/balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler\/"},"modified":"2022-02-02T20:07:54","modified_gmt":"2022-02-02T20:07:54","slug":"balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/02\/balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler\/","title":{"rendered":"Balance your data for machine learning with Amazon SageMaker Data Wrangler"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a> is a new capability of <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. It contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code.<\/p>\n<p>Today, we\u2019re excited to announce new transformations that allow you to balance your datasets easily and effectively for ML model training. We demonstrate how these transformations work in this post.<\/p>\n<h2>New balancing operators<\/h2>\n<p>The newly announced balancing operators are grouped under the <strong>Balance data<\/strong> transform type in the <strong>ADD TRANFORM<\/strong> pane.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image001-500.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32593 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image001-500.png\" alt=\"\" width=\"500\" height=\"286\"><\/a><\/p>\n<p>Currently, the transform operators support only binary classification problems. In binary classification problems, the classifier is tasked with classifying each sample to one of two classes. When the number of samples in the majority class (bigger) is considerably larger than the number of samples in the minority (smaller) class, the dataset is considered imbalanced. This skew is challenging for ML algorithms and classifiers because the training process tends to be biased towards the majority class.<\/p>\n<p>Balancing schemes, which augment the data to be more balanced before training the classifier, were proposed to address this challenge. The simplest balancing methods are either oversampling the minority class by duplicating minority samples or undersampling the majority class by removing majority samples. The idea of adding synthetic minority samples to tabular data was first proposed in the Synthetic Minority Oversampling Technique (SMOTE), where synthetic minority samples are created by interpolating pairs of the original minority points. SMOTE and other balancing schemes were extensively studied empirically and shown to improve prediction performance in various scenarios, as per the publication <a href=\"https:\/\/arxiv.org\/pdf\/2201.08528.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">To SMOTE, or not to SMOTE<\/a>.<\/p>\n<p>Data Wrangler now supports the following balancing operators as part of the <strong>Balance data<\/strong> transform:<\/p>\n<ul>\n<li><strong>Random oversampler<\/strong> \u2013 Randomly duplicate minority samples<\/li>\n<li><strong>Random undersampler<\/strong> \u2013 Randomly remove majority samples<\/li>\n<li><strong>SMOTE<\/strong> \u2013 Generate synthetic minority samples by interpolating real minority samples<\/li>\n<\/ul>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image003-500.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32594 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image003-500.png\" alt=\"\" width=\"500\" height=\"615\"><\/a><\/p>\n<p>Let\u2019s now discuss the different balancing operators in detail.<\/p>\n<h2>Random oversample<\/h2>\n<p>Random oversampling includes selecting random examples from the minority class with a replacement and supplementing the training data with multiple copies of this instance. Therefore, it\u2019s possible that a single instance may be selected multiple times. With the <strong>Random<\/strong> <strong>oversample<\/strong> transform type, Data Wrangler automatically oversamples the minority class for you by duplicating the minority samples in your dataset.<\/p>\n<h2>Random undersample<\/h2>\n<p>Random undersampling is the opposite of random oversampling. This method seeks to randomly select and remove samples from the majority class, consequently reducing the number of examples in the majority class in the transformed data. The <strong>Random<\/strong> <strong>undersample<\/strong> transform type lets Data Wrangler automatically undersample the majority class for you by removing majority samples in your dataset.<\/p>\n<h2>SMOTE<\/h2>\n<p>In SMOTE, synthetic minority samples are added to the data to achieve the desired ratio between majority and minority samples. The synthetic samples are generated by interpolation of pairs of the original minority points. The <strong>SMOTE<\/strong> transform supports balancing datasets including numeric and non-numeric features. Numeric features are interpolated by weighted average. However, you can\u2019t apply weighted average interpolation to non-numeric features\u2014it\u2019s impossible to average <code>\u201cdog\u201d<\/code> and <code>\u201ccat\u201d<\/code> for example. Instead, non-numeric features are copied from either original minority sample according to the averaging weight.<\/p>\n<p>For example, consider two samples, A and B:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">A = [1, 2, \"dog\", \"carnivore\"]\nB = [0, 0, \"cow\", \"herbivore\"]<\/code><\/pre>\n<\/p><\/div>\n<p>Assume the samples are interpolated with weights 0.3 for sample A and 0.7 for sample B. Therefore, the numeric fields are averaged with these weights to yield 0.3 and 0.6, respectively. The next field is filled with <code>\u201cdog\u201d<\/code> with probability 0.3 and <code>\u201ccow\u201d<\/code> with probability 0.7. Similarly, the next one equals <code>\u201ccarnivore\u201d<\/code> with probability 0.3 and <code>\u201cherbivore\u201d<\/code> with probability 0.7. The random copying is done independently for each feature, so sample C below is a possible result:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">C = [0.3, 0.6, \"dog\", \"herbivore\"]<\/code><\/pre>\n<\/p><\/div>\n<p>This example demonstrates how the interpolation process could result in unrealistic synthetic samples, such as an herbivore dog. This is more common with categorical features but can occur in numeric features as well. Even though some synthetic samples may be unrealistic, SMOTE could still improve classification performance.<\/p>\n<p>To heuristically generate more realistic samples, SMOTE interpolates only pairs that are close in features space. Technically, each sample is interpolated only with its k-nearest neighbors, where a common value for k is 5. In our implementation of SMOTE, only the numeric features are used to calculate the distances between points (the distances are used to determine the neighborhood of each sample). It\u2019s common to normalize the numeric features before calculating distances. Note that the numeric features are normalized only for the purpose of calculating the distance; the resulting interpolated features aren\u2019t normalized.<\/p>\n<p>Let\u2019s now balance the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/adult\" target=\"_blank\" rel=\"noopener noreferrer\">Adult Dataset<\/a> (also known as the Census Income dataset) using the built-in SMOTE transform provided by Data Wrangler. This multivariate dataset includes six numeric features and eight string features. The goal of the dataset is a binary classification task to predict whether the income of an individual exceeds $50,000 per year or not based on census data.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32581\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image005.png\" alt=\"\" width=\"1412\" height=\"488\"><\/a><\/p>\n<p>You can also see the distribution of the classes visually by creating a histogram using the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-analyses.html#data-wrangler-visualize-histogram\" target=\"_blank\" rel=\"noopener noreferrer\">histogram analysis type in Data Wrangler<\/a>. The target distribution is imbalanced and the ratio of records with <code>&gt;50K<\/code> to <code>&lt;=50K<\/code> is about 1:4.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image007-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32588 size-full alignnone\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image007-1.png\" alt=\"\" width=\"500\" height=\"303\"><\/a><\/p>\n<p>We can balance this data using the <strong>SMOTE<\/strong> operator found under the <strong>Balance Data<\/strong> transform in Data Wrangler with the following steps:<\/p>\n<ol>\n<li>Choose <code>income<\/code> as the target column.<\/li>\n<\/ol>\n<p>We want the distribution of this column to be more balanced.<\/p>\n<ol start=\"2\">\n<li>Set the desired ratio to <code>0.66<\/code>.<\/li>\n<\/ol>\n<p>Therefore, the ratio between the number of minority and majority samples is 2:3 (instead of the raw ratio of 1:4).<\/p>\n<ol start=\"3\">\n<li>Choose <strong>SMOTE<\/strong> as the transform to use.<\/li>\n<li>Leave the default values for <strong>Number of neighbors<\/strong> to average and whether or not to normalize.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image009-400.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32595 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image009-400.png\" alt=\"\" width=\"400\" height=\"453\"><\/a><\/li>\n<li>Choose <strong>Preview <\/strong>to get a preview of the applied transformation and choose <strong>Add<\/strong> to add the transform to your data flow.<\/li>\n<\/ol>\n<p>Now we can create a new histogram similar to what we did before to see the realigned distribution of the classes. The following figure shows the histogram of the <code>income<\/code> column after balancing the dataset. The distribution of samples is now 3:2, as was intended.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image011-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32589 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/ML-7992-image011-1.png\" alt=\"\" width=\"500\" height=\"305\"><\/a><\/p>\n<p>We can now export this new balanced data and train a classifier on it, which could yield superior prediction quality.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how to balance imbalanced binary classification data using Data Wrangler. Data Wrangler offers three balancing operators: random undersampling, random oversampling, and SMOTE to rebalance data in your unbalanced datasets. All three methods offered by Data Wrangler support multi-modal data including numeric and non-numeric features.<\/p>\n<p>As next steps, we recommend you replicate the example in this post in your Data Wrangler data flow to see what we discussed in action. If you\u2019re new to Data Wrangler or <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Studio<\/a>, refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-getting-started.html\" target=\"_blank\" rel=\"noopener noreferrer\">Get Started with Data Wrangler<\/a>. If you have any questions related to this post, please add it in the comment section.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Yotam-Elor.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32585 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Yotam-Elor.jpg\" alt=\"\" width=\"100\" height=\"127\"><\/a>Yotam Elor<\/strong> is a Senior Applied Scientist at Amazon SageMaker. His research interests are in machine learning, particularly for tabular data.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-32544 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/31\/Arunprasath-Shankar.jpg\" alt=\"\" width=\"100\" height=\"124\"><\/a>Arunprasath Shankar<\/strong> is an Artificial Intelligence and Machine Learning (AI\/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler\/<\/p>\n","protected":false},"author":0,"featured_media":1526,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1525"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1525"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1525\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1526"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1525"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1525"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1525"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}