{"id":760,"date":"2021-01-20T03:01:54","date_gmt":"2021-01-20T03:01:54","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/01\/20\/labeling-mixed-source-industrial-datasets-with-amazon-sagemaker-ground-truth\/"},"modified":"2021-01-20T03:01:54","modified_gmt":"2021-01-20T03:01:54","slug":"labeling-mixed-source-industrial-datasets-with-amazon-sagemaker-ground-truth","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/01\/20\/labeling-mixed-source-industrial-datasets-with-amazon-sagemaker-ground-truth\/","title":{"rendered":"Labeling mixed-source, industrial datasets with Amazon SageMaker Ground Truth"},"content":{"rendered":"<div id=\"\">\n<p>Prior to using any kind of supervised machine learning (ML) algorithm, data has to be labeled. <a href=\"https:\/\/aws.amazon.com\/sagemaker\/groundtruth\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Ground Truth<\/a> simplifies and accelerates this task. Ground Truth uses pre-defined templates to assign <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-label-images.html\" target=\"_blank\" rel=\"noopener noreferrer\">labels that classify the content of images<\/a> or <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/new-label-videos-with-amazon-sagemaker-ground-truth\/\" target=\"_blank\" rel=\"noopener noreferrer\">videos<\/a> or <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/verifying-and-adjusting-your-data-labels-to-create-higher-quality-training-datasets-with-amazon-sagemaker-ground-truth\/\" target=\"_blank\" rel=\"noopener noreferrer\">verify existing labels<\/a>. Ground Truth allows you to define workflows for labeling various kinds of data, such as text, video, or images, without writing any code. Although these templates are applicable to a wide range of use cases in which the data to be labeled is in a single format or from a single source, industrial workloads often require labeling data from different sources and in different formats. This post explores the use case of industrial welding data consisting of sensor readings and images to show how to implement customized, complex, mixed-source labeling workflows using Ground Truth.<\/p>\n<p>For this post, you deploy an <a href=\"https:\/\/aws.amazon.com\/cloudformation\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template in your AWS account to provision the foundational resources to get started with implementing of this labeling workflow. This provides you with hands-on experience for the following topics:<\/p>\n<ul>\n<li>Creating a private labeling workforce in Ground Truth<\/li>\n<li>Creating a custom labeling job using the Ground Truth framework with the following components:\n<ul>\n<li>Designing a pre-labeling <a href=\"https:\/\/aws.amazon.com\/lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> function that pulls data from different sources and runs a format conversion where necessary<\/li>\n<li>Implementing a customized labeling user interface in Ground Truth using crowd templates that dynamically loads the data generated by the pre-labeling Lambda function<\/li>\n<li>Consolidating labels from multiple workers using a customized post-labeling Lambda function<\/li>\n<\/ul>\n<\/li>\n<li>Configuring a custom labeling job using Ground Truth with a customized interface for displaying multiple pieces of data that have to be labeled as a single item<\/li>\n<\/ul>\n<p>Prior to diving deep into the implementation, I provide an introduction into the use case and show how the Ground Truth custom labeling framework eases the implementation of highly complex labeling workflows. To make full use of this post, you need an AWS account on which you can deploy CloudFormation templates. The total cost incurred on your account for following this post is under $1.<\/p>\n<h2>Labeling complex datasets for industrial welding quality control<\/h2>\n<p>Although the mechanisms discussed in this post are generally applicable to any labeling workflow with different data formats, I use data from a welding quality control use case. In this use case, the manufacturing company running the welding process wants to predict whether the welding result will be OK or if a number of anomalies have occurred during the process. To implement this using a supervised ML model, you need to obtain labeled data with which to train the ML model, such as datasets representing welding processes that need to be labeled to indicate whether the process was normal or not. We implement this labeling process (not the ML or modeling process) using Ground Truth, which allows welding experts to make assessments about the result of a welding and assign this result to a dataset consisting of images and sensor data.<\/p>\n<p>The CloudFormation template creates an <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket in your AWS account that contains images (prefix <code>images<\/code>) and CSV files (prefix <code>sensor_data<\/code>). The images contain pictures taken during an industrial welding process similar to the following, where a welding beam is applied onto a metal surface (for image source, see <a href=\"https:\/\/www.kaggle.com\/danielbacioiu\/tig-stainless-steel-304\" target=\"_blank\" rel=\"noopener noreferrer\">TIG Stainless Steel 304<\/a>):<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17964 aligncenter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-1.jpg\" alt=\"\" width=\"800\" height=\"439\"><\/p>\n<p>\u00a0<\/p>\n<p>The CSV files contain sensor data representing current, electrode position, and voltage measured by sensors on the welding machine. For the full dataset, see the <a href=\"https:\/\/github.com\/wzx140\/welding_prediction\/\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. A raw sample of this CSV data is as follows:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-csv\">0|96.19|1023|420|4.5|4.5|1|8\r\n0.1|96.13|894|424|4.5|4.5|1|8\r\n0.2|96.06|884|425|4.5|4.5|1|8\r\n0.3|96.05|884|426|4.5|4.5|1|8\r\n0.4|96.12|887|426|4.5|4.5|1|8\r\n0.5|96.17|902|426|4.5|4.5|2|8\r\n0.6|95.82|974|426|4.5|4.5|2|8\r\n0.7|95.45|1304|426|4.5|4.5|3|8\r\n0.8|95.15|1410|428|4.5|4.5|3|8\r\n0.9|94.96|1446|428|4.5|4.5|3|8\r\n1|94.79|1464|428|4.5|4.5|3|8\r\n...<\/code><\/pre>\n<\/p><\/div>\n<p>The first column of the data is a timestamp in milliseconds normalized to the start of the welding process. Each row consists of various sensor values associated with the timestamp. The first row is the electrode position, the second row is the current, and the third row is the voltage (the other values are irrelevant here). For instance, the row with timestamp <code>1<\/code>, 100 milliseconds after the start of the welding process, has an electrode position of <code>94.79<\/code>, a current of <code>1464<\/code>, and a voltage of <code>428<\/code>.<\/p>\n<p>Because it\u2019s difficult for humans to make assessments using the raw CSV data, I also show how to preprocess such data on the fly for labeling and turn it into more easily readable plots. This way, the welding experts can view the images and the plots to make their assessment about the welding process.<\/p>\n<h2>Deploying the CloudFormation template<\/h2>\n<p>To simplify the setup and configurations needed in the following, I created a CloudFormation template that deploys several foundations into your AWS account. To start this process, complete the following steps:<\/p>\n<ol>\n<li>Sign in to your AWS account.<\/li>\n<li>Choose one of the following links, depending on which AWS Region you\u2019re using:<\/li>\n<\/ol>\n<ol start=\"3\">\n<li>Keep all the parameters as they are and select <strong>I acknowledge that AWS CloudFormation might create IAM resources with custom names<\/strong> and <strong>I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND<\/strong><strong>.<\/strong><\/li>\n<li>Choose <strong>Create stack<\/strong> to start the deployment.<\/li>\n<\/ol>\n<p>The deployment takes about 3\u20135 minutes, during which time a bucket with data to label, some AWS Lambda functions, and an <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) role are deployed. The process is complete when the status of the deployment switches to <code>CREATE_COMPLETE<\/code>.<\/p>\n<p>The <strong>Outputs<\/strong> tab has additional information, such as the Amazon S3 path to the manifest file, which you use throughout this post. Therefore, it\u2019s recommended to keep this browser tab open and follow the rest of the post in another tab.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17965\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-2.jpg\" alt=\"\" width=\"800\" height=\"289\"><\/p>\n<h2>Creating a Ground Truth labeling workforce<\/h2>\n<p>Ground Truth offers three options for defining <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-workforce-management.html\">workforces<\/a> that complete the labeling: <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-workforce-management-public.html\">Amazon Mechanical Turk<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-workforce-management-vendor.html\">vendor-specific workforces<\/a>, and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-workforce-private.html\">private workforces<\/a>. In this section, we configure a private workforce because we want to complete the labeling ourselves. Create a private workforce with the following steps:<\/p>\n<ol>\n<li>On the Amazon SageMaker console, under <strong>Ground Truth<\/strong>, choose <strong>Labeling workforces<\/strong>.<\/li>\n<li>On the <strong>Private <\/strong>tab, choose <strong>Create private team<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17966\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-3.jpg\" alt=\"\" width=\"800\" height=\"323\"><\/p>\n<ol start=\"3\">\n<li>Enter a name for the labeling workforce. For our use case, I enter welding-experts.<\/li>\n<li>Select <strong>Invite new workers by email<\/strong>.<\/li>\n<li>Enter your e-mail address, an organization name, and a contact e-mail (which may be the same as the one you just entered).<\/li>\n<li>Choose <strong>Create private team<\/strong>.<\/li>\n<\/ol>\n<p>The console confirms the creation of the labeling workforce at the top of the screen. When you refresh the page, the new workforce shows on the <strong>Private <\/strong>tab, under <strong>Private teams<\/strong>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17968\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-4.jpg\" alt=\"\" width=\"800\" height=\"255\"><\/p>\n<p>You also receive an e-mail with login instructions, including a temporary password and a link to open the login page.<\/p>\n<ol start=\"7\">\n<li>Choose the link and use your e-mail and temporary password to authenticate and change the password for the login.<\/li>\n<\/ol>\n<p>It\u2019s recommended to keep this browser tab open so you don\u2019t have to log in again. This concludes all necessary steps to create your workforce.<\/p>\n<h2>Configuring a custom labeling job<\/h2>\n<p>In this section, we create a labeling job and use this job to explain the details and data flow of a custom labeling job.<\/p>\n<ol>\n<li>On the Amazon SageMaker console, under <strong>Ground Truth<\/strong>, choose <strong>Labeling jobs<\/strong>.<\/li>\n<li>Choose <strong>Create labeling job<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17969\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-5.jpg\" alt=\"\" width=\"800\" height=\"248\"><\/p>\n<ol start=\"3\">\n<li>Enter a name for your labeling job, such as WeldingLabelJob1.<\/li>\n<li>Choose <strong>Manual data setup<\/strong>.<\/li>\n<li>For <strong>Input dataset location<\/strong>, enter the ManifestS3Path value from the CloudFormation stack <strong>Outputs <\/strong><\/li>\n<li>For <strong>Output dataset location<\/strong>, enter the <code>ProposedOutputPath<\/code> value from the CloudFormation stack <strong>Outputs<\/strong><\/li>\n<li>For <strong>IAM role<\/strong>, choose <strong>Enter a custom IAM role ARN<\/strong>.<\/li>\n<li>Enter the <code>SagemakerServiceRoleArn<\/code> value from the CloudFormation stack <strong>Outputs <\/strong><\/li>\n<li>For the task type, choose <strong>Custom<\/strong>.<\/li>\n<li>Choose <strong>Next<\/strong>.<\/li>\n<\/ol>\n<p>The IAM role is a customized role created by the CloudFormation template that allows Ground Truth to invoke Lambda functions and access Amazon S3.<\/p>\n<ol start=\"11\">\n<li>Choose to use a private labeling workforce.<\/li>\n<li>From the drop-down menu, choose the workforce <strong>welding-experts<\/strong>.<\/li>\n<li>For task timeout and task expiration time, 1 hour is sufficient.<\/li>\n<li>The number of workers per dataset object is 1.<\/li>\n<li>In the <strong>Lambda functions <\/strong>section, for <strong>Pre-labeling task Lambda function<\/strong>, choose the function that starts with <code>PreLabelingLambda-.<\/code><\/li>\n<li>For <strong>Post-labeling task Lambda function<\/strong>, choose the function that starts with <code>PostLabelingLambda-<\/code>.<\/li>\n<li>Enter the following code into the templates section. This HTML code specifies the interface that the workers in the private label workforce see when labeling items. For our use case, the template displays four images, and the categories to classify welding results is as follows:\n<div class=\"hide-language\">\n<pre><code class=\"lang-html\">&lt;script src=\"https:\/\/assets.crowd.aws\/crowd-html-elements.js\"&gt;&lt;\/script&gt;\r\n&lt;crowd-form&gt;\r\n  &lt;crowd-classifier\r\n    name=\"WeldingClassification\"\r\n    categories=\"['Good Weld', 'Burn Through', 'Contamination', 'Lack of Fusion', 'Lack of Shielding Gas', 'High Travel Speed', 'Not sure']\"\r\n    header=\"Please classify the welding process.\"\r\n  &gt;\r\n      &lt;classification-target&gt;\r\n        &lt;div&gt;\r\n          &lt;h3&gt;Welding Image&lt;\/h3&gt;\r\n\t      \t&lt;p&gt;&lt;strong&gt;Welding Camera Image &lt;\/strong&gt;{{ task.input.image.title }}&lt;\/p&gt;\r\n\t      \t&lt;p&gt;&lt;a href=\"{{ task.input.image.file | grant_read_access }}\" target=\"_blank\"&gt;Download Image&lt;\/a&gt;&lt;\/p&gt;\r\n\t      \t&lt;p&gt;\r\n\t      \t\t&lt;img style=\"height: 30vh; margin-bottom: 10px\" src=\"{{ task.input.image.file | grant_read_access }}\"\/&gt;\r\n\t      \t&lt;\/p&gt;\r\n\t    &lt;\/div&gt;\r\n\t    &lt;hr\/&gt;\r\n        &lt;div&gt;\r\n          &lt;h3&gt;Current Graph&lt;\/h3&gt;\r\n\t      \t&lt;p&gt;&lt;strong&gt;Current Graph &lt;\/strong&gt;{{ task.input.current.title }}&lt;\/p&gt;\r\n\t      \t&lt;p&gt;&lt;a href=\"{{ task.input.current.file | grant_read_access }}\" target=\"_blank\"&gt;Download Current Plot&lt;\/a&gt;&lt;\/p&gt;\r\n\t      \t&lt;p&gt;\r\n\t      \t\t&lt;img style=\"height: 30vh; margin-bottom: 10px\" src=\"{{ task.input.current.file | grant_read_access }}\"\/&gt;\r\n\t      \t&lt;\/p&gt;\r\n\t    &lt;\/div&gt;\r\n        &lt;hr\/&gt;\r\n        &lt;div&gt;\r\n          &lt;h3&gt;Electrode Position Graph&lt;\/h3&gt;\r\n\t      \t&lt;p&gt;&lt;strong&gt;Electrode Position Graph &lt;\/strong&gt;{{ task.input.electrode.title }}&lt;\/p&gt;\r\n\t      \t&lt;p&gt;&lt;a href=\"{{ task.input.electrode.file | grant_read_access }}\" target=\"_blank\"&gt;Download Electrode Position Plot&lt;\/a&gt;&lt;\/p&gt;\r\n\t      \t&lt;p&gt;\r\n\t      \t\t&lt;img style=\"height: 30vh; margin-bottom: 10px\" src=\"{{ task.input.electrode.file | grant_read_access }}\"\/&gt;\r\n\t      \t&lt;\/p&gt;\r\n\t    &lt;\/div&gt;\r\n        &lt;hr\/&gt;\r\n        &lt;div&gt;\r\n          &lt;h3&gt;Voltage Graph&lt;\/h3&gt;\r\n\t      \t&lt;p&gt;&lt;strong&gt;Voltage Graph &lt;\/strong&gt;{{ task.input.voltage.title }}&lt;\/p&gt;\r\n\t      \t&lt;p&gt;&lt;a href=\"{{ task.input.voltage.file | grant_read_access }}\" target=\"_blank\"&gt;Download Voltage Plot&lt;\/a&gt;&lt;\/p&gt;\r\n\t      \t&lt;p&gt;\r\n\t      \t\t&lt;img style=\"height: 30vh; margin-bottom: 10px\" src=\"{{ task.input.voltage.file | grant_read_access }}\"\/&gt;\r\n\t      \t&lt;\/p&gt;\r\n\t    &lt;\/div&gt;\r\n      &lt;\/classification-target&gt;\r\n\r\n\r\n\r\n    &lt;full-instructions header=\"Classification Instructions\"&gt;\r\n      &lt;p&gt;Read the task carefully and inspect the image as well as the plots.&lt;\/p&gt;\r\n      &lt;p&gt;\r\n\t\t  The image is a picture taking during the welding process. The plots show the corresponding sensor data for\r\n\t\t  the electrode position, the voltage and the current measured during the welding process.\r\n\t  &lt;\/p&gt;\r\n    &lt;\/full-instructions&gt;\r\n\r\n    &lt;short-instructions&gt;\r\n      &lt;p&gt;Read the task carefully and inspect the image as well as the plots&lt;\/p&gt;\r\n    &lt;\/short-instructions&gt;\r\n  &lt;\/crowd-classifier&gt;\r\n&lt;\/crowd-form&gt;\r\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<p>The wizard for creating the labeling job has a preview function in the section <strong>Custom labeling task setup<\/strong>, which you can use to check if all configurations work properly.<\/p>\n<ol start=\"18\">\n<li>To preview the interface, choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17970\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-6.jpg\" alt=\"\" width=\"800\" height=\"260\"><\/p>\n<p>This opens a new browser tab and shows a test version of the labeling interface, similar to the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17971\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-7.jpg\" alt=\"\" width=\"800\" height=\"641\"><\/p>\n<ol start=\"19\">\n<li>To create the labeling job, choose <strong>Create<\/strong>.<\/li>\n<\/ol>\n<p>Ground Truth sets up the labeling job as specified, and the dashboard shows its status.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17972\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-8.jpg\" alt=\"\" width=\"800\" height=\"154\"><\/p>\n<h2>Assigning labels<\/h2>\n<p>To finalize the labeling job that you configured, you log in to the worker portal and assign labels to different data items consisting of images and data plots. The details on how the different components of the labeling job work together are explained in the next section.<\/p>\n<ol>\n<li>On the Amazon SageMaker console, under <strong>Ground Truth<\/strong>, choose <strong>Labeling workforces<\/strong>.<\/li>\n<li>On the <strong>Private <\/strong>tab, choose the link for <strong>Labeling portal sign-in URL<\/strong>.<\/li>\n<\/ol>\n<p>When Ground Truth is finished preparing the labeling job, you can see it listed in the <strong>Jobs <\/strong>section. If it\u2019s not showing up, wait a few minutes and refresh the tab.<\/p>\n<ol start=\"3\">\n<li>Choose <strong>Start working<\/strong>.<img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17973 alignnone\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-9.jpg\" alt=\"\" width=\"800\" height=\"264\"><\/li>\n<\/ol>\n<p>This launches the labeling UI, which allows you to assign labels to mixed datasets consisting of welding images and plots for current, electrode position, and voltage.<\/p>\n<p>For this use case, you can assign seven different labels to a single dataset. These different classes and labels are defined in the HTML of the UI, but you can also insert them dynamically using the pre-labeling Lambda function (discussed in the next section). Because we don\u2019t actually use the labeled data for ML purposes, you can assign the labels randomly to the five items that are displayed by Ground Truth for this labeling job.<\/p>\n<p>After labeling all the items, the UI switches back to the list with available jobs. This concludes the section about configuring and launching the labeling job. In the next section, I explain the mechanics of a custom labeling job in detail and also dive deep into the different elements of the HTML interface.<\/p>\n<h2>Custom labeling deep dive<\/h2>\n<p>A custom labeling job combines the data to be labeled with three components to create a workflow that allows workers from the labeling workforce to assign labels to each item in the dataset:<\/p>\n<ul>\n<li><strong>Pre-labeling Lambda function<\/strong> \u2013 Generates the content to be displayed on the labeling interface using the manifest file specified during the configuration of the labeling job. For this use case, the function also converts the CSV files into human readable plots and stores these plots as images in the S3 bucket under the prefix <code>plots<\/code>.<\/li>\n<li><strong>Labeling interface<\/strong> \u2013 Uses the output of the pre-labeling function to generate a user interface. For this use case, the interface displays four images (the picture taken during the welding process and the three graphs for current, electrode position, and voltage) and a form that allows workers to classify the welding process.<\/li>\n<li><strong>Label consolidation Lambda<\/strong> <strong>function<\/strong> \u2013 Allows you to implement custom strategies to consolidate classifications of one or several workers into a single response. For our workforce, this is very simple because there is only a single worker whose labels are consolidated into a file, which is stored by Ground Truth into Amazon S3.<\/li>\n<\/ul>\n<p>Before we analyze these three components, I provide insights into the structure of the manifest file, which describes the data sources for the labeling job.<\/p>\n<h3>Manifest and dataset files<\/h3>\n<p>The manifest file is a file conforming to the <a href=\"http:\/\/jsonlines.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">JSON lines<\/a> format, in which each line represents one item to label. Ground Truth expects either a key <code>source or source-ref<\/code> in each line of the file. For this use case, I use <code>source<\/code>, and the mapped value must be a string representing an Amazon S3 path. For this post, we only label five items, and the JSON lines are similar to the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\"source\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/dataset\/dataset-1.json\"}<\/code><\/pre>\n<\/p><\/div>\n<p>For our use case with multiple input formats and files, each line in the manifest points to a dataset file that is also stored on Amazon S3. Our dataset is a JSON document, which contains references to the welding images and the CSV file with the sensor data:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\r\n  \"sensor_data\": {\"s3Path\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/sensor_data\/weld.1.csv\"},\r\n  \"image\": {\"s3Path\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/images\/weld.1.png\"}\r\n}<\/code><\/pre>\n<\/p><\/div>\n<p>Ground Truth takes each line of the manifest file and triggers the pre-labeling Lambda function, which we discuss next.<\/p>\n<h3>Pre-labeling Lambda function<\/h3>\n<p>A pre-labeling Lambda function creates a JSON object that is used to populate the item-specific portions of the labeling interface. For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-custom-templates-step3.html\" target=\"_blank\" rel=\"noopener noreferrer\">Processing with AWS Lambda<\/a>.<\/p>\n<p>Before Ground Truth displays an item for labeling to a worker, it runs the pre-labeling function and forwards the information in the manifest\u2019s JSON line to the function. For our use case, the event passed to the function is as follows:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\r\n  \"version\": \"2018-10-06\", \r\n  \"labelingJobArn\": \"arn:aws:sagemaker:eu-west-1:XXX:labeling-job\/weldinglabeljob1\",\r\n  \"dataObject\": { \r\n    \"source\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/dataset\/dataset-1.json\" \r\n  }\r\n}<\/code><\/pre>\n<\/p><\/div>\n<p>Although I omit the implementation details here (for those interested, the code is deployed with the CloudFormation template for review), the function for our labeling job uses this input to complete the following steps:<\/p>\n<ol>\n<li>Download the file referenced in the <code>source<\/code> field of the input (see the preceding code).<\/li>\n<li>Download the dataset file that is referenced in the source<\/li>\n<li>Download a CSV file containing the sensor data. The dataset file is expected to have a reference to this CSV file.<\/li>\n<li>Generate plots for current, electrode position, and voltage from the contents of the CSV file.<\/li>\n<li>Upload the plot files to Amazon S3.<\/li>\n<li>Generate a JSON object containing the references to the aforementioned plot files and the welding image referenced in the dataset file.<\/li>\n<\/ol>\n<p>When these steps are complete, the function returns a JSON object with two parts :<\/p>\n<ul>\n<li><strong>taskInput<\/strong> \u2013 Fully customizable JSON object that contains information to be displayed on the labeling UI.<\/li>\n<li><strong>isHumanAnnotationRequired<\/strong> \u2013 A string representing a Boolean value (<code>True or False<\/code>), which you can use to exclude objects from being labeled by humans. I don\u2019t use this flag for this use case because we want to label all the provided data items.<\/li>\n<\/ul>\n<p>For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-custom-templates-step3.html\" target=\"_blank\" rel=\"noopener noreferrer\">Processing with AWS Lambda<\/a>.<\/p>\n<p>Because I want to show the welding images and the three graphs for current, electrode position, and voltage, the result of the Lambda function is as follows for the first dataset:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\r\n  \"taskInput\": { \r\n    \"image\": { \r\n      \"file\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/images\/weld.1.png\", \r\n      \"title\": \" from image at s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/images\/weld.1.png\"\r\n    }, \r\n    \"voltage\": { \r\n      \"file\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/plots\/weld.1.csv-current.png\", \r\n      \"title\": \" from file at plots\/weld.1.csv-current.png\"\r\n    },\r\n    \"electrode\": { \r\n      \"file\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/plots\/weld.1.csv-electrode_pos.png\", \r\n      \"title\": \" from file at plots\/weld.1.csv-electrode_pos.png\" \r\n    }, \r\n    \"current\": { \r\n      \"file\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/plots\/weld.1.csv-voltage.png\", \r\n      \"title\": \" from file at plots\/weld.1.csv-voltage.png\" \r\n    } \r\n  }, \r\n  \"isHumanAnnotationRequired\": \"true\"\r\n}<\/code><\/pre>\n<\/p><\/div>\n<p>In the preceding code, the <code>taskInput<\/code> is fully customizable; the function returns the Amazon S3 paths to the images to display, and also a title, which has some non-functional text. Next, I show how to access these different parts of the <code>taskInput<\/code> JSON object when building the customized labeling UI displayed to workers by Ground Truth.<\/p>\n<h3>Labeling UI: Accessing taskInput content<\/h3>\n<p>Ground Truth uses the output of the Lambda function to fill in content into the HTML skeleton that is provided at the creation of the labeling job. In general, the contents of the <code>taskInput<\/code> output object is accessed using <code>task.input<\/code> in the HTML code.<\/p>\n<p>For instance, to retrieve the Amazon S3 path where the welding image is stored from the output, you need to access the path <code>taskInput\/image\/file<\/code>. Because the <code>taskInput<\/code> object from the function output is mapped to <code>task.input<\/code> in the HTML, the corresponding reference to the welding image file is <code>task.input.image.file<\/code>. This reference is directly integrated into the HTML code of the labeling UI to display the welding image:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-html\">&lt;img style=\"height: 30vh; margin-bottom: 10px\" src=\"{{ task.input.image.file | grant_read_access }}\"\/&gt;<\/code><\/pre>\n<\/p><\/div>\n<p>The <code>grant_read_access<\/code> <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-custom-templates-step2.html\" target=\"_blank\" rel=\"noopener noreferrer\">filter<\/a> is needed for files in S3 buckets that aren\u2019t publicly accessible. This makes sure that the URL passed to the browser contains a short-lived access token for the image and thereby avoids having to make resources publicly accessible for labeling jobs. This is often mandatory because the data to be labeled, such as machine data, is confidential. Because the pre-labeling function has also converted the CSV files into plots and images, their integration into the UI is analogous.<\/p>\n<h3>Label consolidation Lambda function<\/h3>\n<p>The second Lambda function that was configured for the custom labeling job runs when all workers have labeled an item or the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/APIReference\/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-TaskAvailabilityLifetimeInSeconds\" target=\"_blank\" rel=\"noopener noreferrer\">time limit<\/a> of the labeling job is reached. The key task of this function is to derive a single label from the responses of the workers. Additionally, the function can be for any kind of further processing of the labeled data, such as storing them on Amazon S3 in a format ideally suited for the ML pipeline that you use.<\/p>\n<p>Although there are different possible strategies to consolidate labels, I focus on the cornerstones of the implementation for such a function and show how they translate to our use case. The consolidation function is triggered by an event similar to the following JSON code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{ \r\n  \"version\": \"2018-10-06\", \r\n  \"labelingJobArn\": \"arn:aws:sagemaker:eu-west-1:261679111194:labeling-job\/weldinglabeljob1\", \r\n  \"payload\": { \r\n    \"s3Uri\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/output\/WeldingLabelJob1\/annotations\/consolidated-annotation\/consolidation-request\/iteration-1\/2020-09-15_16:16:11.json\" \r\n  }, \r\n  \"labelAttributeName\": \"WeldingLabelJob1\", \r\n  \"roleArn\": \"arn:aws:iam::261679111194:role\/AmazonSageMaker-Service-role-unn4d0l4j0\", \r\n  \"outputConfig\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/output\/WeldingLabelJob1\/annotations\", \r\n  \"maxHumanWorkersPerDataObject\": 1 \r\n}<\/code><\/pre>\n<\/p><\/div>\n<p>The key item in this event is the <code>payload<\/code>, which contains an <code>s3Uri<\/code> pointing to a file stored on Amazon S3. This payload file contains the list of datasets that have been labeled and the labels assigned to them by workers. The following code is an example of such a list entry:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{ \r\n  \"datasetObjectId\": \"4\", \r\n  \"dataObject\": { \r\n    \"s3Uri\": \"s3:\/\/iiot-custom-label-blog-bucket-unn4d0l4j0\/dataset\/dataset-5.json\" \r\n  }, \r\n  \"annotations\": [ \r\n    { \r\n      \"workerId\": \"private.eu-west-1.abd2ec3e354db315\",\r\n      \"annotationData\": { \r\n          \"content\":\"{\"WeldingClassification\":{\"label\":\"Not sure\"}}\"\r\n      } \r\n    } \r\n  ] \r\n}<\/code><\/pre>\n<\/p><\/div>\n<p>Along with an identifier that you could use to determine which worker labeled the item, each entry lists for each dataset which labels have been assigned. For example, in the case of multiple workers, there are multiple entries in <code>annotations<\/code>. Because I created a single worker that labeled all the items for this post, there is only a single entry. The file <code>dataset-5.json<\/code> has been labeled with <code>Not Sure<\/code> for the classifier <code>WeldingClassification<\/code>.<\/p>\n<p>The label consolidation function has to iterate over all list entries and determine for each dataset a label to use as the ground truth for supervised ML training. Ground Truth expects the function to return a list containing an entry for each dataset item with the following structure:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{ \r\n  \"datasetObjectId\": \"4\", \r\n  \"consolidatedAnnotation\": { \r\n    \"content\": { \r\n      \"WeldingLabelJob1\": {\r\n         \"WeldingClassification\": \"Not sure\" \r\n      } \r\n    }\r\n  } \r\n}\r\n<\/code><\/pre>\n<p>Each entry of the returned list must contain the <code>datasetObjectId<\/code> for the corresponding entry in the payload file and a JSON object <code>consolidatedAnnotation<\/code>, which contains an object <code>content<\/code>. Ground Truth expects content to contain a key that equals the name of the labeling job, (for our use case, <code>WeldingLabelJob1<\/code>). For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-custom-templates-step3.html\" target=\"_blank\" rel=\"noopener noreferrer\">Processing with AWS Lambda<\/a>.<br \/>You can change this behavior when you create the labeling job by selecting <strong>I want to specify a label attribute name different from the labeling job name <\/strong>and entering a label attribute name.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17975\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-10.jpg\" alt=\"\" width=\"800\" height=\"231\"><\/p>\n<p>The content inside this key equaling the name of the labeling job is freely configurable and can be arbitrarily complex. For our use case, it\u2019s enough to return the assigned label <code>Not Sure<\/code>. If any of these formatting requirements are not met, Ground Truth assumes the labeling job didn\u2019t run properly and failed.<\/p>\n<p>Because I specified <code>output<\/code> as the desired prefix during the creation of the labeling job, the requirements are met, and Ground Truth uploads the list of JSON entries into the bucket and prefix specified during the creation of the consolidated labels, and they are uploaded with the following prefix:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">output\/WeldingLabelJob1\/annotations\/consolidated-annotation\/consolidation-response\/iteration-1\/<\/code><\/pre>\n<\/p><\/div>\n<p>You can use such files for training ML algorithms in Amazon SageMaker or for further processing.<\/p>\n<h3>Cleaning up<\/h3>\n<p>To avoid incurring future charges, delete all resources created for this post.<\/p>\n<ol>\n<li>On the AWS CloudFormation console, choose <strong>Stacks<\/strong>.<\/li>\n<li>Select the stack <strong>iiot-custom-label-blog<\/strong>.<\/li>\n<li>Choose <strong>Delete<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-17976\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Amazon-SageMaker-Ground-Truth-11.jpg\" alt=\"\" width=\"800\" height=\"236\"><\/p>\n<p>This step removes all files and the S3 bucket from your account. The process takes about 3\u20135 minutes.<\/p>\n<h2>Conclusion<\/h2>\n<p>Supervised ML requires labeled data, and Ground Truth provides a platform for creating labeling workflows. This post showed how to build a complex industrial IoT labeling workflow, in which data from multiple sources needs to be considered for labeling items. The post explained how to create a custom labeling job and provided details on the mechanisms Ground Truth requires to implement such a workflow. To get started with writing your own custom labeling job, refer to the\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-custom-templates.html\" target=\"_blank\" rel=\"noopener noreferrer\">custom labeling documentation<\/a>\u00a0page for Ground Truth and potentially re-deploy the CloudFormation template of this post to get a sample for the pre-labeling and consolidation lambdas. Additionally, the blog post \u201c<a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/creating-custom-labeling-jobs-with-aws-lambda-and-amazon-sagemaker-ground-truth\/\" target=\"_blank\" rel=\"noopener noreferrer\">Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth<\/a>\u201d provides additional insights into building custom labeling jobs.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17980 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/03\/Markus-Bestehorn.jpg\" alt=\"\" width=\"100\" height=\"134\">As a Principal Prototyping Engagement Manager, <strong>Dr. Markus Bestehorn<\/strong>\u00a0is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His \u201ccareer\u201d started as a 7-year-old when he got his hands on a computer with two 5.25\u201d floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he\u2019s not on the computer, he runs or climbs mountains.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/labeling-mixed-source-industrial-datasets-with-amazon-sagemaker-ground-truth\/<\/p>\n","protected":false},"author":0,"featured_media":761,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/760"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=760"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/760\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/761"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=760"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=760"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=760"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}