{"id":1481,"date":"2022-01-14T21:40:51","date_gmt":"2022-01-14T21:40:51","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/01\/14\/label-text-for-aspect-based-sentiment-analysis-using-sagemaker-ground-truth\/"},"modified":"2022-01-14T21:40:51","modified_gmt":"2022-01-14T21:40:51","slug":"label-text-for-aspect-based-sentiment-analysis-using-sagemaker-ground-truth","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/01\/14\/label-text-for-aspect-based-sentiment-analysis-using-sagemaker-ground-truth\/","title":{"rendered":"Label text for aspect-based sentiment analysis using SageMaker Ground Truth"},"content":{"rendered":"<div id=\"\">\n<p>The Amazon Machine Learning Solutions Lab (MLSL) recently created a tool for annotating text with named-entity recognition (NER) and relationship labels using <a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-labeling\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Ground Truth<\/a>.\u00a0Annotators use this tool to label text with named entities and link their relationships, thereby building a dataset for training state-of-the-art natural language processing (NLP) machine learning (ML) models. Most importantly, this is now publicly available to all AWS customers.<\/p>\n<h2>Customer Use Case: Booking.com<\/h2>\n<p><a title=\"http:\/\/Booking.com\" href=\"http:\/\/booking.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Booking.com<\/a> is one of the world\u2019s leading online travel platforms. Understanding what customers are saying about the company\u2019s 28 million+ property listings on the platform is essential for maintaining a top-notch customer experience. Previously, Booking.com could only utilize traditional sentiment analysis to interpret customer-generated reviews at scale. Looking to upgrade the specificity of these interpretations, Booking.com recently turned to the MLSL for help with building a custom annotated dataset for training an aspect-based sentiment analysis model.<\/p>\n<p>Traditional sentiment analysis is the process of classifying a piece of text as positive, negative, or neutral as a <strong>singular sentiment<\/strong>. This works to broadly understand if users are satisfied or unsatisfied with a particular experience. For example, with traditional sentiment analysis, the following text may be classified as \u201cneutral\u201d:<\/p>\n<blockquote>\n<p><em>Our stay at the hotel was nice. The staff was friendly and the rooms were clean, but our beds were quite uncomfortable.<\/em><\/p>\n<\/blockquote>\n<p>Aspect-based sentiment analysis offers a more nuanced understanding of content. In the case of Booking.com, rather than taking a customer review as a whole and classifying it categorically, it can take sentiment from within a review and assign it to specific aspects. For example, customer reviews of a given hotel might praise the immaculate pool and fitness area, but give critical feedback on the restaurant and lounge.<\/p>\n<p>The statement which would have been classified as \u201cneutral\u201d by traditional sentiment analysis will, with aspect-based sentiment analysis, become:<\/p>\n<blockquote>\n<p><em>Our stay at the hotel was nice. The staff was friendly and the rooms were clean, but our beds were quite uncomfortable.<\/em><\/p>\n<\/blockquote>\n<ul>\n<li>Hotel: Positive<\/li>\n<li>Staff: Positive<\/li>\n<li>Room: Positive<\/li>\n<li>Beds: Negative<\/li>\n<\/ul>\n<p>Booking.com sought to build a custom aspect-based sentiment analysis model that would tell them which specific parts of the guest experience (from a list of 50+ aspects) were\u00a0<strong>positive<\/strong>,\u00a0<strong>negative<\/strong>, or <strong>neutral<\/strong>.<\/p>\n<p>Before Booking.com could build a training dataset for this model, they needed a way to annotate it. MLSL\u2019s annotation tool provided the much-needed customized solution. Human review was performed on a large collection of hotel reviews. Then, annotators completed named-entity annotation on sentiment and guest-experience text spans and phrases before linking appropriate spans together.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image001.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-31592 aligncenter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image001.jpg\" alt=\"\" width=\"493\" height=\"157\"><\/a><\/p>\n<p>The new aspect-based model lets Booking.com personalize both accommodations and reviews to its customers. Highlighting the positive and negative aspects of each accommodation enables the customers to choose their perfect match. In addition, different customers care about different aspects of the accommodation, and the new model opens up the opportunity to show the most relevant reviews to each one.<\/p>\n<h2>Labeling Requirements<\/h2>\n<p>Although Ground Truth provides a built-in NER text annotation capability, it doesn\u2019t provide the ability to link entities together. With this in mind, Booking.com and MLSL worked out the following high-level requirements for a new named entity recognition text labeling tool that:<\/p>\n<ul>\n<li>Accepts as input: <strong>text<\/strong>, <strong>entity labels<\/strong>, <strong>relationship labels<\/strong>, and <strong>classification labels<\/strong>.<\/li>\n<li>Optionally accepts as input pre-annotated data with the preceding label and relationship annotations<strong>.<\/strong><\/li>\n<li>Presents the annotator with either unannotated or pre-annotated text.<\/li>\n<li>Allows annotators to highlight and annotate arbitrary text with an entity label.<\/li>\n<li>Allows annotators to create relationships between two entity annotations.<\/li>\n<li>Allows annotators to easily navigate large numbers of entity labels.<\/li>\n<li>Supports grouping entity labels into categories.<\/li>\n<li>Allow overlapping relationships, which means that the same annotated text segment can be related to more than one other annotated text segment.<\/li>\n<li>Allows overlapping entity label annotations, which means that two annotations can overlap the same piece of text. For example, the text \u201cSeattle Space Needle\u201d can have both the annotations \u201cSeattle\u201d \u2192 \u201clocations\u201d, and \u201cSeattle Space Needle\u201d \u2192 \u201cattractions\u201d.<\/li>\n<li>Output format is compatible with input format, and it can be fed back into subsequent labeling tasks.<\/li>\n<li>Supports UTF-8 encoded text containing emoji and other multi-byte characters.<\/li>\n<li>Supports left-to-right languages.<\/li>\n<\/ul>\n<h2>Sample Annotation<\/h2>\n<p>Consider the following document:<\/p>\n<blockquote>\n<p><em>We loved the location of this hotel! The rooftop lounge gave us the perfect view of space needle. It is also a short drive away from pike place market and the waterfront.<\/em><br \/><em>Food was only available via room service, which was a little disappointing but makes sense in this post-pandemic world.<\/em><br \/><em>Overall, a reasonably priced experience.<\/em><\/p>\n<\/blockquote>\n<p>Loading this document into the new NER annotation presents a worker with the following interface:<\/p>\n<div id=\"attachment_31593\" class=\"wp-caption alignnone\">\n        <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image005.png\"><img decoding=\"async\" loading=\"lazy\" aria-describedby=\"caption-attachment-31593\" class=\"wp-image-31593 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image005.png\" alt=\"Worker presented with an unannotated document\" width=\"864\" height=\"699\"><\/a> <\/p>\n<p id=\"caption-attachment-31593\" class=\"wp-caption-text\">Worker presented with an unannotated document<\/p>\n<\/p><\/div>\n<p>In this case, the worker\u2019s job is to:<\/p>\n<ul>\n<li>Label entities related to the property (location, price, food, etc.)<\/li>\n<li>Label entities related to sentiment (positive, negative, or neutral)<\/li>\n<li>Link property-related named entities to sentiment-related keywords to accurately capture the guest experience<\/li>\n<\/ul>\n<div class=\"wp-caption alignnone\">\n        <img decoding=\"async\" loading=\"lazy\" class=\"size-full\" src=\"https:\/\/s3.amazonaws.com\/aws-ml-blog\/artifacts\/ml-4415-NER-annotation%20\/new3-in-progress.gif\" alt=\"Worker performing annotations\" width=\"1335\" height=\"1080\"> <\/p>\n<p class=\"wp-caption-text\">Worker performing annotations<\/p>\n<\/p><\/div>\n<p>Annotation speed was an important consideration of the tool. Using a sequence of intuitive keyboard shortcuts and mouse gestures, annotators can drive the interface and:<\/p>\n<ul>\n<li>Add and remove named entity annotations<\/li>\n<li>Add relationships between named entities<\/li>\n<li>Jump to the beginning and end of the document<\/li>\n<li>Submit the document<\/li>\n<\/ul>\n<p>Additionally, there is support for overlapping labels. For example, <code>Seattle Space Needle<\/code>: in this phrase, <code>Seattle<\/code> is annotated both as\u00a0a location by itself and as a part of the attraction name.<\/p>\n<p>The completed annotation provides a more complete, nuanced analysis of the data:<\/p>\n<div id=\"attachment_31594\" class=\"wp-caption alignnone\">\n        <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image006.png\"><img decoding=\"async\" loading=\"lazy\" aria-describedby=\"caption-attachment-31594\" class=\"wp-image-31594 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image006.png\" alt=\"Completed document\" width=\"1598\" height=\"1294\"><\/a> <\/p>\n<p id=\"caption-attachment-31594\" class=\"wp-caption-text\">Completed document<\/p>\n<\/p><\/div>\n<p>Relationships can be configured in many levels, from entity categories to other entity categories (for example, from \u201cfood\u201d to \u201csentiment\u201d), or between individual entity types. Relationships are directed, so annotators can link an aspect like food to a sentiment, but not vice-versa (unless explicitly enabled). When drawing relationships, the annotation tool will automatically deduce the relationship label and direction.<\/p>\n<h2>Configuring the NER Annotation Tool<\/h2>\n<p>In this section, we cover how to customize the NER annotation tool for customer-specific use cases. This includes configuring:<\/p>\n<ul>\n<li>The input text to annotate<\/li>\n<li>Entity labels<\/li>\n<li>Relationship Labels<\/li>\n<li>Classification Labels<\/li>\n<li>Pre-annotated data<\/li>\n<li>Worker instructions<\/li>\n<\/ul>\n<p>We\u2019ll cover the specifics of the input and output document formats, as well as provide some examples of each.<\/p>\n<h3>Input Document Format<\/h3>\n<p>The NER annotation tool expects the following JSON formatted input document (Fields with a question mark next to the name are optional).<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\n  text: string;\n  tokenRows?: string[][];\n  documentId?: string;\n  entityLabels?: {\n    name: string;\n    shortName?: string;\n    category?: string;\n    shortCategory?: string;\n    color?: string;\n  }[];\n  classificationLabels?: string[];\n  relationshipLabels?: {\n    name: string;\n    allowedRelationships?: {\n        sourceEntityLabelCategories?: string[];\n        targetEntityLabelCategories?: string[];\n        sourceEntityLabels?: string[];\n        targetEntityLabels?: string[];\n    }[];\n  }[];\n  entityAnnotations?: {\n    id: string;\n    start: number;\n    end: number;\n    text: string;\n    label: string;\n    labelCategory?: string;\n  }[];\n  relationshipAnnotations?: {\n    sourceEntityAnnotationId: string;\n    targetEntityAnnotationId: string;\n    label: string;\n  }[];\n  classificationAnnotations?: string[];\n  meta?: {\n    instructions?: string;\n    disableSubmitConfirmation?: boolean;\n    multiClassification: boolean;\n  };\n}<\/code><\/pre>\n<\/p><\/div>\n<p>In a nutshell, the input format has these characteristics:<\/p>\n<ul>\n<li>Either <code>entityLabels<\/code> or <code>classificationLabels<\/code> (or both) are required to annotate.<\/li>\n<li>If <code>entityLabels<\/code> are given, then <code>relationshipLabels<\/code> can be added.<\/li>\n<li>Relationships can be allowed between different entity\/category labels or a mix of these.<\/li>\n<li>The \u201csource\u201d of a relationship is the entity that the directed arrow starts with, while the \u201ctarget\u201d is where it\u2019s heading.<\/li>\n<\/ul>\n<table border=\"1px\">\n<tbody>\n<tr>\n<td><span><strong>Field<\/strong><\/span><\/td>\n<td><span><strong>Type<\/strong><\/span><\/td>\n<td><span><strong>Description<\/strong><\/span><\/td>\n<\/tr>\n<tr>\n<td>text<\/td>\n<td>string<\/td>\n<td>Required. Input text for annotation.<\/td>\n<\/tr>\n<tr>\n<td>tokenRows<\/td>\n<td>string[][]<\/td>\n<td>Optional. Custom tokenization of input text. Array of arrays of strings. Top level array represents each row of text (line breaks), and second level array represents tokens on each row. All characters\/runes in the input text must be accounted for in tokenRows, including any white space.<\/td>\n<\/tr>\n<tr>\n<td>documentId<\/td>\n<td>string<\/td>\n<td>Optional. Optional value for customers to keep track of document being annotated.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels<\/td>\n<td>object[]<\/td>\n<td>Required if classificationLabels is blank. Array of entity labels.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels[].name<\/td>\n<td>string<\/td>\n<td>Required. Entity label display name.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels[].category<\/td>\n<td>string<\/td>\n<td>Optional. Entity label category name.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels[].shortName<\/td>\n<td>string<\/td>\n<td>Optional. Display this text over annotated entities rather than the full name.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels[].shortCategory<\/td>\n<td>string<\/td>\n<td>Optional. Display this text in the entity annotation select dropdown instead of the first four letters of the category name.<\/td>\n<\/tr>\n<tr>\n<td>entityLabels.color<\/td>\n<td>string<\/td>\n<td>Optional. Hex color code with \u201c#\u201d prefix. If blank, then it will automatically assign a color to the entity label.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels<\/td>\n<td>object[]<\/td>\n<td>Optional. Array of relationship labels.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].name<\/td>\n<td>string<\/td>\n<td>Required. Relationship label display name.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].allowedRelationships<\/td>\n<td>object[]<\/td>\n<td>Optional. Array of values restricting what types of source and destination entity labels this relationship can be assigned to. Each item in array is \u201cOR\u2019ed\u201d together.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].allowedRelationships[].sourceEntityLabelCategories<\/td>\n<td>string[]<\/td>\n<td>Required to set either sourceEntityLabelCategories or sourceEntityLabels (or both). List of legal source entity label category types for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].allowedRelationships[].targetEntityLabelCategories<\/td>\n<td>string[]<\/td>\n<td>Required to set either targetEntityLabelCategories or targetEntityLabels (or both). List of legal\u00a0target entity label category types for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].allowedRelationships[].sourceEntityLabels<\/td>\n<td>string[]<\/td>\n<td>Required to set either sourceEntityLabelCategories or sourceEntityLabels (or both). List of legal source entity label types for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>relationshipLabels[].allowedRelationships[].sourceEntityLabels<\/td>\n<td>string[]<\/td>\n<td>Required to set either targetEntityLabelCategories or targetEntityLabels (or both). List of legal\u00a0target entity label types for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>classificationLabels<\/td>\n<td>string[]<\/td>\n<td>Required if entityLabels is blank. List of document level classification labels.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations<\/td>\n<td>object[]<\/td>\n<td>Optional. Array of entity annotations to pre-annotate input text with.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].id<\/td>\n<td>string<\/td>\n<td>Required. Unique identifier for this entity annotation. Used to reference this entity in relationshipAnnotations.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].start<\/td>\n<td>number<\/td>\n<td>Required. Start rune offset of this entity annotation.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].end<\/td>\n<td>number<\/td>\n<td>Required. End rune offset of this entity annotation.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].text<\/td>\n<td>string<\/td>\n<td>Required. Text content between start and end rune offset.<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].label<\/td>\n<td>string<\/td>\n<td>Required. Associated entity label name (from the names in entityLabels).<\/td>\n<\/tr>\n<tr>\n<td>entityAnnotations[].labelCategory<\/td>\n<td>string<\/td>\n<td>Optional.Associated entity label category (from the categories in entityLabels).<\/td>\n<\/tr>\n<tr>\n<td>relationshipAnnotations<\/td>\n<td>object[]<\/td>\n<td>Optional. Array of relationship annotations.<\/td>\n<\/tr>\n<tr>\n<td>relationshipAnnotations[].sourceEntityAnnotationId<\/td>\n<td>string<\/td>\n<td>Required. Source entity annotation ID for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>relationshipAnnotations[].targetEntityAnnotationId<\/td>\n<td>string<\/td>\n<td>Required. Target entity annotation ID for this relationship.<\/td>\n<\/tr>\n<tr>\n<td>relationshipAnnotations[].label<\/td>\n<td>string<\/td>\n<td>Required. Associated relationship label name.<\/td>\n<\/tr>\n<tr>\n<td>classificationAnnotations<\/td>\n<td>string[]<\/td>\n<td>Optional. Array of classifications to pre-annotate the document with.<\/td>\n<\/tr>\n<tr>\n<td>meta<\/td>\n<td>object<\/td>\n<td>Optional. Additional configuration parameters.<\/td>\n<\/tr>\n<tr>\n<td>meta.instructions<\/td>\n<td>string<\/td>\n<td>Optional. Instructions for the labeling annotator in Markdown format.<\/td>\n<\/tr>\n<tr>\n<td>meta.disableSubmitConfirmation<\/td>\n<td>boolean<\/td>\n<td>Optional. Set to true to disable submit confirmation modal.<\/td>\n<\/tr>\n<tr>\n<td>meta.multiClassification<\/td>\n<td>boolean<\/td>\n<td>Optional. Set to true to enable multi-label mode for classificationLabels.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here are a few sample documents to get a better sense of this input format<\/p>\n<p>Documents that adhere to this schema are provided to Ground Truth as individual line items in an input manifest.<\/p>\n<h3>Output Document Format<\/h3>\n<p>The output format is designed to feedback easily into a new annotation task. Optional fields in the output document are set if they are also set in the input document. The only difference between the input and output formats is the\u00a0<code>meta<\/code> object.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\n  text: string;\n  tokenRows?: string[][];\n  documentId?: string;\n  entityLabels?: {\n    name: string;\n    shortName?: string;\n    category?: string;\n    shortCategory?: string;\n    color?: string;\n  }[];\n  relationshipLabels: {\n    name: string;\n    allowedRelationships?: {\n        sourceEntityLabelCategories?: string[];\n        targetEntityLabelCategories?: string[];\n        sourceEntityLabels?: string[];\n        targetEntityLabels?: string[];\n    }[];\n  }[];\n  classificationLabels?: string[];\n  entityAnnotations?: {\n    id: string;\n    start: number;\n    end: number;\n    text: string;\n    labelCategory?: string;\n    label: string;\n  }[];\n  relationshipAnnotations?: {\n    sourceEntityAnnotationId: string;\n    targetEntityAnnotationId: string;\n    label: string;\n  }[];\n  classificationAnnotations?: string[];\n  meta: {\n    instructions?: string;\n    disableSubmitConfirmation?: boolean;\n    multiClassification: boolean;\n    runes: string[];\n    rejected: boolean;\n    rejectedReason: string;\n  }\n}<\/code><\/pre>\n<\/p><\/div>\n<table border=\"1px\">\n<tbody>\n<tr>\n<td><span><strong>Field<\/strong><\/span><\/td>\n<td><span><strong>Type<\/strong><\/span><\/td>\n<td><span><strong>Description<\/strong><\/span><\/td>\n<\/tr>\n<tr>\n<td>meta.rejected<\/td>\n<td>boolean<\/td>\n<td>Is set to true if the annotator rejected this document.<\/td>\n<\/tr>\n<tr>\n<td>meta.rejectedReason<\/td>\n<td>string<\/td>\n<td>Annotator\u2019s reason given for rejecting the document.<\/td>\n<\/tr>\n<tr>\n<td>meta.runes<\/td>\n<td>string[]<\/td>\n<td>Array of runes accounting for all of the characters in the input text. Used to calculate entity annotation start and end offsets.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here is a sample output document that\u2019s been annotated:<\/p>\n<h3>Runes note:<\/h3>\n<p>A \u201crune\u201d in this context is a single highlight-able character in text, including multi-byte characters such as emoji.<\/p>\n<ul>\n<li>Because different programming languages represent multi-byte characters differently, using \u201cRunes\u201d to define every highlight-able character as a single atomic element means that we have an unambiguous way to describe any given text selection.<\/li>\n<li>For example, Python treats the Swedish flag as four characters:<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image008.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31595\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image008.jpg\" alt=\"\" width=\"428\" height=\"62\"><\/a><br \/>But JavaScript treats the same emoji as two characters<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image010.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31596\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/ML-4415-image010.jpg\" alt=\"\" width=\"452\" height=\"56\"><\/a><\/li>\n<\/ul>\n<p>To eliminate any ambiguity, we will treat the Swedish flag (and all other emoji and multi-byte characters) as a single atomic element.<\/p>\n<ul>\n<li>Offset: Rune position relative to Input Text (starting with index 0)<\/li>\n<\/ul>\n<h2>Performing NER Annotations with Ground Truth<\/h2>\n<p>As a fully managed data labeling service, Ground Truth builds training datasets for ML. For this use case, we use Ground Truth to send a collection of text documents to a pool of workers for annotation. Finally, we review for quality.<\/p>\n<p>Ground Truth can be configured to build a data labeling job using the new NER tool as a custom template.<\/p>\n<p>Specifically, we will:<\/p>\n<ol>\n<li>Create a private labeling workforce of workers to perform the annotation task<\/li>\n<li>Create a Ground Truth input manifest with the documents we want to annotate and then upload it to <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service (Amazon S3)<\/a><\/li>\n<li>Create pre-labeling task and post-labeling task Lambda functions<\/li>\n<li>Create a Ground Truth labeling job using the custom NER template<\/li>\n<li>Annotate documents<\/li>\n<li>Review results<\/li>\n<\/ol>\n<h2>NER Tool Resources<\/h2>\n<p>A complete list of referenced resources and sample documents can be found in the following chart:<\/p>\n<h2>Labeling Workforce Creation<\/h2>\n<p>Ground Truth uses SageMaker labeling workforces to manage workers and distribute tasks. Create a private workforce, a worker team called ner-worker-team, and assign yourself to the team using the instructions found in <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-workforce-create-private-console.html#create-workforce-sm-console\" target=\"_blank\" rel=\"noopener noreferrer\">Create a Private Workforce (Amazon SageMaker Console)<\/a>.<\/p>\n<p>Once you\u2019ve added yourself to a private workforce and confirmed your email, note the worker portal URL from the AWS Management Console:<\/p>\n<ul>\n<li>Navigate to <code>SageMaker<\/code><\/li>\n<li>Navigate to <code>Ground Truth \u2192 Labeling workforces<\/code><\/li>\n<li>Select the <code>Private<\/code> tab<\/li>\n<li>Note the URL\u00a0<code>Labeling portal sign-in URL<\/code><\/li>\n<\/ul>\n<p>Log in to the worker portal to view and start work on labeling tasks.<\/p>\n<h3>Input Manifest<\/h3>\n<p>The Ground Truth input data manifest is a JSON-lines file where each line contains a single worker task. In our case, each line will contain a single JSON encoded Input Document containing the text that we want to annotate and the NER annotation schema.<\/p>\n<p>Download a sample input manifest <code>reviews.manifest<\/code> from\u00a0<a href=\"https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-data\/reviews.manifest\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-data\/reviews.manifest<\/a><\/p>\n<p><strong>Note<\/strong>: each row in the input manifest needs a top-level key <code>source<\/code> or <code>source-ref<\/code>. You can learn more in <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-input-data-input-manifest.html\" target=\"_blank\" rel=\"noopener noreferrer\">Use an Input Manifest File<\/a> in the Amazon SageMaker Developer Guide.<\/p>\n<h3>Upload Input Manifest to Amazon S3<\/h3>\n<p>Upload this input manifest to an S3 bucket using the AWS Management Console or from the command line, thereby replacing <code>your-bucket<\/code> with an actual bucket name.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws s3 cp reviews.manifest s3:\/\/your-bucket\/ner-input\/reviews.manifest<\/code><\/pre>\n<\/p><\/div>\n<h3>Download custom worker template<\/h3>\n<p>Download the NER tool custom worker template from\u00a0<a href=\"https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/worker-template.liquid.html\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/worker-template.liquid.html<\/a> by viewing the source and saving the contents locally, or from the command line:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">wget https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/worker-template.liquid.html<\/code><\/pre>\n<\/p><\/div>\n<h3>Create pre-labeling task and post-labeling task Lambda functions<\/h3>\n<p>Download sample pre-labeling task Lambda function:\u00a0<code>smgt-ner-pre-labeling-task-lambda.py<\/code> from\u00a0<a href=\"https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-scripts\/smgt-ner-pre-labeling-task-lambda.py\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-scripts\/smgt-ner-pre-labeling-task-lambda.py<\/a><\/p>\n<p>Download sample pre-labeling task Lambda function:\u00a0<code>smgt-ner-post-labeling-task-lambda.py<\/code> from\u00a0<a href=\"https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-scripts\/smgt-ner-post-labeling-task-lambda.py\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/assets.solutions-lab.ml\/NER\/0.2.1\/sample-scripts\/smgt-ner-post-labeling-task-lambda.py<\/a><\/p>\n<ul>\n<li>Create pre-labeling task Lambda function from the AWS Management Console:\n<ul>\n<li>Navigate to <code>Lambda<\/code><\/li>\n<li>Select <code>Create function<\/code><\/li>\n<li>Specify <code>Function name<\/code> as\u00a0<code>smgt-ner-pre-labeling-task-lambda<\/code><\/li>\n<li>Select <code>Runtime<\/code> \u2192 <code>Python 3.6<\/code><\/li>\n<li>Select <code>Create function<\/code><\/li>\n<li>In <code>Function code<\/code> \u2192 <code>lambda_hanadler.py<\/code>, paste the contents of\u00a0<code>smgt-ner-pre-labeling-task-lambda.py<\/code><\/li>\n<li>Select <code>Deploy<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Create post-labeling task Lambda function from the AWS Management Console:\n<ul>\n<li>Navigate to <code>Lambda<\/code><\/li>\n<li>Select <code>Create function<\/code><\/li>\n<li>Specify <code>Function name<\/code> as\u00a0<code>smgt-ner-post-labeling-task-lambda<\/code><\/li>\n<li>Select <code>Runtime<\/code> \u2192 <code>Python 3.6<\/code><\/li>\n<li>Expand <code>Change default execution role<\/code><\/li>\n<li>Select\u00a0<code>Create a new role from AWS policy templates<\/code><\/li>\n<li>Enter the <code>Role name<\/code>: <code>smgt-ner-post-labeling-task-lambda-role<\/code><\/li>\n<li>Select <code>Create function<\/code><\/li>\n<li>Select the <code>Permissions<\/code> tab<\/li>\n<li>Select the <code>Role name<\/code>: <code>smgt-ner-post-labeling-task-lambda-role<\/code> to open the IAM console<\/li>\n<li>Add two policies to the role\n<ul>\n<li>Select <code>Attach policies<\/code><\/li>\n<li>Attach the\u00a0<code>AmazonS3FullAccess<\/code> policy<\/li>\n<li>Select <code>Add inline policy<\/code><\/li>\n<li>Select the <code>JSON<\/code> tab<\/li>\n<li>Paste in the following inline policy:\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": {\n        \"Effect\": \"Allow\",\n        \"Action\": \"sts:AssumeRole\",\n        \"Resource\": \"arn:aws:iam::YOUR_ACCOUNT_NUMBER:role\/service-role\/AmazonSageMaker-ExecutionRole-*\"\n    }\n}<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ul>\n<\/li>\n<li>Navigate back to the <code>smgt-ner-post-labeling-task-lambda<\/code> Lambda function configuration page<\/li>\n<li>Select the <code>Configuration<\/code> tab<\/li>\n<li>In <code>Function code<\/code> \u2192 l<code>ambda_hanadler.py<\/code>, paste the contents of\u00a0<code>smgt-ner-post-labeling-task-lambda.py<\/code><\/li>\n<li>Select <code>Deploy<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Create a Ground Truth labeling job<\/h3>\n<p>From the AWS Management Console:<\/p>\n<ul>\n<li>Navigate to the\u00a0<code>Amazon SageMaker<\/code> service<\/li>\n<li>Navigate to\u00a0<code>Ground Truth<\/code> \u2192 <code>Labeling Jobs<\/code>.<\/li>\n<li>Select <code>Create labeling job<\/code><\/li>\n<li>Specify a <code>Job Name<\/code><\/li>\n<li>Select <code>Manual Data Setup<\/code><\/li>\n<li>Specify the Input dataset location where you uploaded the input manifest earlier (e.g., s<code>3:\/\/your-bucket\/ner-input\/sample-smgt-input-manifest.jsonl<\/code>)<\/li>\n<li>Specify the Output dataset location to point to a different folder in the same bucket (e.g., <code>s3:\/\/your-bucket\/ner-output\/<\/code>)<\/li>\n<li>Specify an <code>IAM Role<\/code> by selecting <code>Create new role<\/code>\n<ul>\n<li>Allow this role to access any S3 bucket by selecting <code>S3 buckets you specify<\/code> \u2192 <code>Any S3 bucket<\/code> when creating the policy<\/li>\n<li>In a new AWS Management Console window, open the <code>IAM<\/code> console and select <code>Roles<\/code><\/li>\n<li>Search for the name of the role that you just created (for example, <code>AmazonSageMaker-ExecutionRole-20210301T154158<\/code>)<\/li>\n<li>Select the role name to open the role in the console<\/li>\n<li>Attach the following three policies:\n<ul>\n<li>Select Attach policies<\/li>\n<li>Attach the <code>AWSLambda_FullAccess<\/code> to the role<\/li>\n<li>Select <code>Trust Relationships<\/code> \u2192 <code>Edit Trust Relationships<\/code><\/li>\n<li>Edit the trust relationship JSON,<\/li>\n<li>Replace <code>YOUR_ACCOUNT_NUMBER<\/code> with your numerical AWS Account number, to read:\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Principal\": {\n        \"Service\": \"sagemaker.amazonaws.com\"\n      },\n      \"Action\": \"sts:AssumeRole\"\n    },\n    {\n      \"Effect\": \"Allow\",\n      \"Principal\": {\n        \"AWS\": \"arn:aws:iam::YOUR_ACCOUNT_NUMBER:role\/service-role\/smgt-ner-post-labeling-task-lambda-role\"\n      },\n      \"Action\": \"sts:AssumeRole\"\n    }\n  ]\n}<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Save the trust relationship<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Return to the new Ground Truth job in the previous AWS Management Console window: under <code>Task Category<\/code>, select <code>Custom<\/code><\/li>\n<li>Select <code>Next<\/code><\/li>\n<li>Select <code>Worker types<\/code>: <code>Private<\/code><\/li>\n<li>Select the <code>Private team<\/code> : <code>ner-worker-team<\/code> that was created in the preceding section<\/li>\n<li>In the <code>Custom labeling task setup<\/code> text area, clear the default content and paste in the content of the <code>worker-template.liquid.html<\/code> file obtained earlier<\/li>\n<li>Specify the\u00a0<code>Pre-labeling task Lambda function<\/code> with the previously created function: <code>smgt-ner-pre-labeling<\/code><\/li>\n<li>Specify the <code>Post-labeling task Lambda function<\/code> with the function created earlier: <code>smgt-ner-post-labeling<\/code><\/li>\n<li>Select <code>Create<\/code><\/li>\n<\/ul>\n<h3>Annotate documents<\/h3>\n<p>Once the Ground Truth job is created, we can start annotating documents. Open the worker portal for our workforce created earlier (In the AWS Management Console, navigate to the <code>SageMaker<\/code> ,\u00a0<code>Ground Truth \u2192 Labeling workforces<\/code>, <code>Private<\/code>, and open the\u00a0<code>Labeling portal sign-in URL<\/code> )<\/p>\n<p>Sign in and select the first labeling task in the table, and then select \u201cStart working\u201d to open the annotator.\u00a0Perform your annotations and select submit on all three of the sample documents.<\/p>\n<h3>Review results<\/h3>\n<p>As Ground Truth annotators complete tasks, results will be available in the output S3 bucket:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">s3:\/\/your-bucket\/path-to-your-ner-job\/annotations\/worker-response\/iteration-1\/0\/<\/code><\/pre>\n<\/p><\/div>\n<p>Once all tasks for a labeling job are complete, the consolidated output is available in the <code>output.manifest<\/code> file located here:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">s3:\/\/your-bucket\/path-to-your-ner-job\/manifests\/output\/output.manifest<\/code><\/pre>\n<\/p><\/div>\n<p>This output manifest is a JSON-lines file with one annotated text document per line in the \u201cOutput Document Format\u201d specified previously. This file is compatible with the \u201cInput Document Format\u201d, and it can be fed directly into a subsequent Ground Truth job for another round of annotation. Alternatively, it can be parsed and sent to an ML training job. Some scenarios where we might employ a second round of annotations are:<\/p>\n<ul>\n<li>Breaking the annotation process into two steps where the first annotator identifies entity annotations and the second annotator draws relationships<\/li>\n<li>Taking a sample of our <code>output.manifest<\/code> and sending it to a second, more experienced annotator for review as a quality control check<\/li>\n<\/ul>\n<h3>Custom Ground Truth Annotation Templates<\/h3>\n<p>The NER annotation tool described in this document is implemented as a custom Ground Truth annotation template. AWS customers can build their own custom annotation interfaces using the instructions found here:<\/p>\n<h2>Conclusion<\/h2>\n<p>By working together, Booking.com and the Amazon MLSL were able to develop a powerful text annotation tool that is capable of creating complex named-entity recognition and relationship annotations.<\/p>\n<p>We encourage AWS customers with an NER text annotation use case to try the tool described in this post. If you\u2019d like help accelerate the use of ML in your products and services, please contact the <a href=\"https:\/\/aws.amazon.com\/ml-solutions-lab\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Machine Learning Solutions Lab<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/Daniel-Noble.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-31598 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/Daniel-Noble.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Dan Noble<\/strong> is a Software Development Engineer at Amazon where he helps build delightful user experiences. In his spare time, he enjoys reading, exercising, and having adventures with his family.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/pri-nonis.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-31600 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/pri-nonis.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Pri Nonis<\/strong> is a Deep Learning Architect at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/Niharika-Jayanthi.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-31599 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/06\/Niharika-Jayanthi.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Niharika Jayanthi<\/strong> is a Front End Engineer at AWS, where she develops custom annotation solutions for Amazon SageMaker customers. Outside of work, she enjoys going to museums and working out.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/14\/beka.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-32341 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/14\/beka.jpg\" alt=\"\" width=\"100\" height=\"108\"><\/a><strong>Amit Beka<\/strong> is a Machine Learning Manager at <a title=\"http:\/\/Booking.com\" href=\"http:\/\/booking.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Booking.com<\/a>, with over 15 years of experience in software development and machine learning. He is fascinated with people and languages, and how computers are still puzzled by both.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/labeling-text-for-aspect-based-sentiment-analysis-using-sagemaker-ground-truth\/<\/p>\n","protected":false},"author":0,"featured_media":1482,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1481"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1481"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1481\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1482"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1481"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1481"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1481"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}