{"id":861,"date":"2021-09-17T06:56:44","date_gmt":"2021-09-17T06:56:44","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/17\/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend\/"},"modified":"2021-09-17T06:56:44","modified_gmt":"2021-09-17T06:56:44","slug":"custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/09\/17\/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend\/","title":{"rendered":"Custom document annotation for extracting named entities in documents using Amazon Comprehend"},"content":{"rendered":"<div id=\"\">\n<p>Intelligent document processing (IDP), as defined by IDC, is an approach by which unstructured content and structured data is analyzed and extracted for use in downstream applications. IDP involves document reading, categorization, and data extraction, by using AI\u2019s processes of computer vision (CV), Optical Character Recognition (OCR), and natural language processing (NLP) on provided texts.[1] Companies in financial services, healthcare, and manufacturing process millions of documents each year and find the process to be painstaking. Document processing is manual, slow, expensive, and error-prone, and data is often spread across disparate sources. As a result, creating and managing a document processing pipeline remains a challenge for many companies.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a> is an NLP service that provides APIs to extract key phrases, contextual entities, events, and sentiment from documents. <em>Entities<\/em> refer to things in your document such as people, places, organizations, credit card numbers, and so on. But what if you want to identify entity types unique to your business, like proprietary product codes or industry-specific terms? Custom entity recognition in Amazon Comprehend enables you to train models to extract entities that are unique to your business in just a few easy steps. Historically, you could only use Amazon Comprehend APIs on plain text documents. If you wanted to process Word or PDF documents, you first needed to preprocess or flatten the documents into a plain text format, which often reduces the quality of the context within the document.<\/p>\n<p>Today, Amazon Comprehend launched a feature to help organizations extract insights and automate processing documents of different formats (PDF, Word, plain text) and layouts (bullets, lists). You can now use custom entity recognition in Amazon Comprehend directly on more document types (scanned PDF, machine-readable PDF, and Word documents) without needing to convert your files to plain text. Custom entity recognition can now extract custom entities and process varying document layouts such as dense text and lists or bullets in PDF, Word, and plain text documents\u2014at no additional cost.<\/p>\n<p>To train a custom entity recognition model that you can use on your PDF, Word, and plain text documents, you need to first annotate PDF documents using a custom\u00a0<a class=\"c-link\" href=\"https:\/\/aws.amazon.com\/sagemaker\/groundtruth\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-sk=\"tooltip_parent\">Amazon SageMaker Ground Truth<\/a> annotation template provided by Amazon Comprehend. The custom template renders a PDF and allows you to annotate directly on the document. The custom model leverages both the natural language and structural or positional information (coordinates) of the text to make sure that we accurately extract custom entities that might have been previously impacted when flattening a document.<\/p>\n<p>In this post, we walk through the steps of setting up the custom annotation template and show examples of how to annotate finance documents (SEC filings). After you annotate your data, you can use the generated manifest file to train a custom entity recognition model.<\/p>\n<h2>Feature overview<\/h2>\n<p>We cover the following steps:<\/p>\n<ol>\n<li>Install the annotation tool.<\/li>\n<li>Upload training documents.<\/li>\n<li>Create a private workforce to label the training documents.<\/li>\n<li>Set up the Ground Truth labeling job.<\/li>\n<li>Access the annotation login portal to start labeling documents.<\/li>\n<li>Label documents with the custom template.<\/li>\n<\/ol>\n<p>Only PDF files are used for training a custom entity recognition model, but you can use that model on plain text, Word, and PDFs documents during inference (for now, asynchronous processing only).<\/p>\n<p>The custom annotation template is available to download on <a href=\"https:\/\/github.com\/aws-samples\/amazon-comprehend-semi-structured-documents-annotation-tools\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a>. You need to Git clone or download this application and install it in local or in an <a href=\"https:\/\/aws.amazon.com\/cloud9\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Cloud 9<\/a> integrated development environment (IDE). AWS Cloud9 is a cloud-based IDE that lets you write, run, and debug your code with just a browser. This annotation tool needs to be built and deployed using Python and <a href=\"https:\/\/aws.amazon.com\/serverless\/sam\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Serverless Application Model<\/a> (AWS SAM) (the AWS SAM CLI lets you locally build, test, and debug serverless applications that are defined by AWS SAM templates).<\/p>\n<h2>Prerequisites<\/h2>\n<p>You need to install the following to build and deploy this solution:<\/p>\n<ol>\n<li><a href=\"https:\/\/github.com\/pyenv\/pyenv\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> Python 3.8.x.<\/li>\n<li><a href=\"https:\/\/stedolan.github.io\/jq\/download\/\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> jq.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/serverless-application-model\/latest\/developerguide\/serverless-sam-cli-install.html\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> the AWS SAM CLI.<\/li>\n<li>Make sure you have <a href=\"https:\/\/pypi.org\/project\/pipenv\/\" target=\"_blank\" rel=\"noopener noreferrer\">pip installed<\/a>.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/install-cliv2.html\" target=\"_blank\" rel=\"noopener noreferrer\">Install and configure<\/a> the AWS CLI.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/cli-configure-files.html\" target=\"_blank\" rel=\"noopener noreferrer\">Configure<\/a> your AWS credentials.<\/li>\n<\/ol>\n<p>If you\u2019re setting up this tool in <code>us-east-1<\/code>, you can skip installing the AWS CLI and AWS SAM CLI, because it\u2019s already installed with Python environment. You just need to <a href=\"https:\/\/www.webagesolutions.com\/blog\/how-to-create-python-environment-in-cloud-9\" target=\"_blank\" rel=\"noopener noreferrer\">create a virtual environment<\/a> to use Python 3.8 in AWS Cloud9.<\/p>\n<h2>Install the annotation tool<\/h2>\n<p>Download the annotation tool from <a href=\"https:\/\/github.com\/aws-samples\/amazon-comprehend-semi-structured-documents-annotation-tools\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<\/a>. You can either Git clone this repository or download the setup as a .zip file in your local. The .zip file contains all the installation parts required to build out the custom annotation workflow to annotate PDFs in Ground Truth.<\/p>\n<p>First, you run and configure this tool from the AWS CLI. With AWS SAM, we build and deploy this tool to create an <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template. The CloudFormation template configures resources such as an <a href=\"https:\/\/aws.amazon.com\/lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> function, <a href=\"https:\/\/aws.amazon.com\/iam\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) roles and permissions to create an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> labeling job, and your <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket.<\/p>\n<p>Once you unzip the tool, you need to run commands to build and deploy from with-in the annotation tool. In the CLI, navigate to folder <code>ComprehendSSIEAnnotationTool<\/code> or run<\/p>\n<p><code>cd ComprehendSSIEAnnotationTool<\/code> to change your directory to the annotation tool folder.<\/p>\n<ol>\n<li>Run the following steps to build the tool:<\/li>\n<\/ol>\n<p><code>make bootstrap<\/code><\/p>\n<p><code>make build<\/code><\/p>\n<ol start=\"2\">\n<li>After the build is successful, run the following command to deploy:<\/li>\n<\/ol>\n<p><code>make deploy-guided<\/code><\/p>\n<ol start=\"3\">\n<li>For <strong>Stack Name<\/strong>\u00b8 enter <code>comprehend-annotation-tool<\/code>.<\/li>\n<li>For <strong>AWS Region<\/strong>, enter the Region you\u2019re in.<\/li>\n<li>Leave the remaining parameters as default.<\/li>\n<li>For <strong>Confirm changes before deploy<\/strong>, enter <code>Y<\/code>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28106\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/1-5033-setting-default.jpg\" alt=\"\" width=\"800\" height=\"180\"><\/p>\n<ol start=\"7\">\n<li>For <strong>Allow<\/strong> <strong>SAM CLI IAM role creation<\/strong>, enter <code>Y<\/code>.<\/li>\n<\/ol>\n<p>After you complete these steps, the AWS CloudFormation resources are deployed in your AWS account, which you use to set up the annotation jobs. If deployment fails, delete the CloudFormation stack via the AWS CloudFormation console and try <code>make deploy-guided<\/code> again from your local or AWS Cloud9 IDE to reinstall this tool.<\/p>\n<ol start=\"8\">\n<li>When the status of the CloudFormation deployment changes from <code>In-Progress<\/code> to <code>Complete<\/code> (which usually takes a few minutes), note the S3 bucket name created by the CloudFormation template.<\/li>\n<\/ol>\n<p>You need this S3 bucket to upload training documents that are used for annotation in the next step.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28107\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/2-5033-Outputs.jpg\" alt=\"\" width=\"800\" height=\"430\">\u00a0<\/strong><\/p>\n<h2>Upload the training documents<\/h2>\n<p>You can run the following command in your local to copy local files to the S3 bucket created in the previous step. As a reminder, these training documents should be PDF documents only.<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp --recursive <em><span>&lt;local-path-to-source-docs&gt;<\/span><\/em> s3:\/\/comprehend-semi-structured-documents-${AWS_REGION}-${AWS_ACCOUNT_ID}\/source-semi-structured-documents\/<\/code><\/pre>\n<\/p><\/div>\n<p>Or you can directly upload documents to this S3 bucket created by the CloudFormation template by creating a folder <code>source-semi-structured-documents<\/code>. Refer to this <a href=\"https:\/\/docs.aws.amazon.com\/redshift\/latest\/dg\/tutorial-loading-data-upload-files.html\" target=\"_blank\" rel=\"noopener noreferrer\">link<\/a> to learn how to upload files in Amazon S3 and creating folders.<\/p>\n<h2>Create a private workforce<\/h2>\n<p>You need to create a private workforce in Ground Truth, which labels the training documents you uploaded in Amazon S3. You can view or create a private workforce using the Ground Truth console. For detailed instructions, see the section \u201cCreating a private work team\u201d in <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/developing-ner-models-with-amazon-sagemaker-ground-truth-and-amazon-comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend<\/a>. To follow along with the next steps, we recommend adding yourself to the private workforce. All new workers added to your private workforce receive an enrollment email with the URL, user name, and a temporary password to log in to the labeling portal. Your workers use this login information after we create the labeling job.<\/p>\n<p>After you create a private workforce, note the private workforce name, which we need when creating a Ground Truth labeling job using the AWS CLI.<\/p>\n<h2>Set up the Ground Truth labeling job<\/h2>\n<p>Ground Truth allows you to create custom labeling workflows, which you can use when the provided templates don\u2019t suit the requirements for your labeling efforts.<\/p>\n<p>To create a Ground Truth labeling job, you need to set up parameters such as the input Amazon S3 path, CloudFormation stack name, work team name, Region, job name prefix, entity types, and annotator metadata.<\/p>\n<ul>\n<li>input-s3-path: S3 Uri to the source documents you copied earlier in\u00a0<code>Upload Source Semi-Structured Documents to S3 bucket<\/code><\/li>\n<li>cfn-name: The name of the CloudFormation stack name entered in the\u00a0Package and Deploy\u00a0step: <code>comprehend-annotation-tool<\/code><\/li>\n<li>work-team-name: The workforce name created from the previous step<\/li>\n<li>job-name-prefix: The prefix to have for the SageMaker Ground Truth labeling job (LIMIT: 29 characters). Extra text will be appended to job name prefix, ex.<code>\u00a0-labeling-job-task-20210902T232116<\/code><\/li>\n<li>entity-types: The entities you would like to use during the labeling job (separated by commas)<\/li>\n<\/ul>\n<p>Run the following AWS CLI command to trigger a Ground Truth labeling job:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python bin\/comprehend-ssie-annotation-tool-cli.py \n    --input-s3-path s3:\/\/comprehend-semi-structured-documents-${AWS_REGION}-${AWS_ACCOUNT_ID}\/source-semi-structured-documents\/ \n    --cfn-name comprehend-annotation-tool\n    --work-team-name <em><span>&lt;enter private-work-team-name&gt;<\/span><\/em> \n    --region ${AWS_REGION} \n    --job-name-prefix \"${USER}-job\" \n    --entity-types \"EntityTypeA, EnityTypeB, EntityTypeC\" \n    --annotator-metadata \"key=Info,value=Sample information,key=Due Date,value=Sample date value 12\/12\/1212\"\n<\/code><\/pre>\n<\/p><\/div>\n<p>You can view the labeling job on the SageMaker console. As shown in the following screenshot, our labeling job is created, in progress, and using a custom task type.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28108\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/3-5033-Labeling-jobs.jpg\" alt=\"\" width=\"800\" height=\"253\"><\/p>\n<p>You have now created a custom labeling job for annotating your PDF documents.<\/p>\n<p>Historical options for processing PDF documents required converting documents to raw text format before processing through Amazon Comprehend custom entity recognition models. However, the custom Ground Truth template you created allows you to interact with the document in its native PDF format. It does this by rendering a text layer on top of the PDF. This combination allows you to reference the entity within the table or structured format in a familiar, easier, and more efficient manner than using a document that has been converted to a plain text format. In addition to the user-friendly document rendering, the template also helps pick the right span for the text and provides the ability to wrap lines for entities that span several lines within the document.<\/p>\n<h2>Access the annotation login portal<\/h2>\n<p>You can access the annotation portal either by using the URL in your private workforce enrollment email or via the SageMaker console. To use the SageMaker console, complete the following steps:<\/p>\n<ol>\n<li>Choose <strong>Labeling workforces<\/strong> in the navigation pane.<\/li>\n<li>Choose <strong>Private<\/strong>.<\/li>\n<li>Choose the link under <strong>Labeling portal sign-in URL<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28109\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/4-5033-Private.jpg\" alt=\"\" width=\"800\" height=\"301\"><\/p>\n<ol start=\"4\">\n<li>Enter the email and password you received in your enrollment email.<\/li>\n<li>Choose <strong>Login<\/strong> to start labeling.<\/li>\n<\/ol>\n<h2>Label documents with the custom template<\/h2>\n<p>After you create the labeling job, the labeling workforce can access the documents and begin annotating the required entities. First, we walk you through the annotation UI template, then three examples of how to annotate documents with finance documents (Securities and Exchange Commission (SEC) filings).<\/p>\n<h3>Explanation of the annotation UI template<\/h3>\n<p>On the annotation user interface, you can see one of your PDF documents open (see the following screenshot). On the top left, you can see the custom entity types that you provided during setup. To the right of your entity types, you can see arrows to navigate through your training documents. The document in the example below is blurred out to focus on the annotation UI.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28110\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/5-5033.jpg\" alt=\"\" width=\"800\" height=\"476\">\u00a0<\/strong><\/p>\n<p>Additionally, you have options to remove, undo, or auto tag your annotations on each document. To use auto tag, simply annotate a word and associate it with one of your custom entity types. If you choose <strong>Auto Tag<\/strong>, all other instances of that entity type (for example, all addresses) are automatically annotated with that entity type. After you annotate each PDF, choose <strong>Submit<\/strong> to save annotations before moving to the next document.<\/p>\n<p>The following screenshot shows the options within the right panel of the annotation interface. You can edit, copy, remove, and reset your annotations.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28111\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/6-5033.jpg\" alt=\"\" width=\"305\" height=\"351\"><\/p>\n<p>Now with this understanding of the annotation UI, we show end-to-end examples of how to annotate custom entities for finance documents (SEC filings).<\/p>\n<h3>Finance<\/h3>\n<p>Finance customers can now easily process bank statements and SEC filings to extract custom entities such as board of director names, acquisition price, earnings per share, and more. In the following example, we show how to extract the following entities from SEC form S3 (registration) and SEC form 424B5 (prospectus filing)<strong>: <\/strong><code>OFFERING_PRICE<\/code> and <code>OFFERED_SHARES<\/code>. More specifically, we demonstrate how to annotate these forms so you can extract custom entities that appear in both a table and paragraph and have multiple occurrences.<\/p>\n<p>The SEC dataset is available for download here\u00a0<code>s3:\/\/aws-ml-blog\/artifacts\/custom-document-annotation-comprehend\/sources\/<\/code>. You can use this dataset to train a custom entity recognition model after you finish annotating. Run the below command to copy the S3 data from<code> s3:\/\/aws-ml-blog\/artifacts\/custom-document-annotation-comprehend\/sources\/<\/code> to the AWS CloudFormation created bucket <code>s3:\/\/comprehend-semi-structured-documents-${AWS_REGION}-${AWS_ACCOUNT_ID}\/source-semi-structured-documents\/<\/code> by running following command:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp --recursive s3:\/\/aws-ml-blog\/artifacts\/custom-document-annotation-comprehend\/sources\/ s3:\/\/comprehend-semi-structured-documents-${AWS_REGION}-${AWS_ACCOUNT_ID}\/source-semi-structured-documents\/<\/code><\/pre>\n<\/p><\/div>\n<p>Go to the command prompt where you set up annotation tool and run the following command to set up a training job for the entities OFFERING_PRICE and OFFERED_SHARES in the entity-types parameter:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python bin\/comprehend-ssie-annotation-tool-cli.py \n    --input-s3-path s3:\/\/comprehend-semi-structured-documents-${AWS_REGION}-${AWS_ACCOUNT_ID}\/source-semi-structured-documents\/ \n    --cfn-name annotationtool \n    --work-team-name <em><span>&lt;enter private-work-team-name&gt;<\/span><\/em> \n    --region ${AWS_REGION} \n    --job-name-prefix \"${USER}-job\" \n    --entity-types \"OFFERING_PRICE,OFFERED_SHARES \" \n    --annotator-metadata \"key=Info,value=Sample information,key=Due Date,value=Sample date value 12\/12\/1212\"\n\t--use-textract-only  \n<\/code><\/pre>\n<\/p><\/div>\n<p>Additional customizable options:<\/p>\n<ol>\n<li>Include\u00a0\u2013<code>-annotator-metadata<\/code>\u00a0parameter to reveal key-value information to annotators. Default metadata about the document is already revealed to the annotator within the UI side panel.<\/li>\n<li>Specify\u00a0\u2013<code>-use-textract-only<\/code>\u00a0flag to instruct the annotation tool to only use\u00a0<a href=\"https:\/\/docs.aws.amazon.com\/textract\/latest\/dg\/API_AnalyzeDocument.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract AnalyzeDocument API<\/a>\u00a0to parse the PDF document. By default, the tool tries to auto-detect what types of source PDF document format is, and use either\u00a0<a href=\"https:\/\/github.com\/jsvine\/pdfplumber\" target=\"_blank\" rel=\"noopener noreferrer\">PDFPlumber<\/a> (native PDF) or Amazon Textract (scanned PDF) to parse the PDF Documents. When creating the labeling job, a customer has the option of only using Amazon Textract for both the scenarios, which may be more accurate for text extraction, but comes at an additional cost (see <a href=\"https:\/\/aws.amazon.com\/textract\/pricing\/\" target=\"_blank\" rel=\"noopener noreferrer\">Textract Pricing<\/a>).<\/li>\n<\/ol>\n<p>This command triggers a Ground Truth labeling job for your private workforce. Now log in to your labeling portal. You\u2019re redirected to the annotation UI to annotate entities we specified in the labeling job. In the following screenshot, we show how entities such as <code>OFFERING_PRICE<\/code> and <code>OFFERED_SHARES<\/code> are annotated. Your labeling workforce uses the pointer to draw bounding boxes around the appropriate entities and labels them with the appropriate custom entity label. You must annotate the <code>OFFERING_PRICE<\/code> entity in every occurrence, which in this case, is within the dense text paragraph and in semi-structured (columnar) format.<\/p>\n<p><strong> <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28112\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/7-5033-Offering_Price.jpg\" alt=\"\" width=\"800\" height=\"348\"><\/strong><\/p>\n<p>After you label all the pages, you can find annotations in JSON format in the Amazon S3 location <code>s3:\/\/comprehend-semi-structured-documents-us-east-1-<em><span>&lt;AWS Account number&gt;<\/span><\/em>\/output\/<em><span>&lt;your labeling job&gt;<\/span><\/em>\/annotations\/<\/code>.<\/p>\n<p>The user-labeled document information is also saved to an output manifest file in the Amazon S3 location provided during the setup of the custom labeling job. The output manifest file references all the annotations within your training documents. You can find your output manifest file in the Amazon S3 location <code>s3:\/\/comprehend-semi-structured-documents-us-east-1--<em><span>&lt;AWS Account number&gt;<\/span><\/em>\/output\/\/<em><span>&lt;your labeling job&gt;<\/span><\/em>\/manifests\/<\/code>. You use this manifest file to create an Amazon Comprehend custom entity recognition training job and train your custom model. For instructions, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Extract custom entities from more document types with Amazon Comprehend<\/a>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28113\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/8-5033-Output.jpg\" alt=\"\" width=\"800\" height=\"435\"><\/p>\n<h2>Review the output manifest file<\/h2>\n<p>When examining the output manifest file for the custom labeling training job on PDF documents, we can see the contextual information (natural language and positional or structural information) that is gathered and used to train the Amazon Comprehend custom entity recognition model.<\/p>\n<p>First, let\u2019s look at the output generated from the custom entity labeling job of a different dataset. In the following screenshot, the output manifest file for the labeling job has generated starting and ending offsets but also uses references to block IDs and child block IDs for each labeled entity to capture the associated 2-D information of the entity (what the entity is and where it\u2019s located in the document). A block represents layout coordinate positional information of the tokens within the document. A child block ID identifies the word blocks within the referenced block ID. As shown in the highlighted code, the entity labeled contains information relating to where the entity is placed within the overall structure of the document.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-28118 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/9-5033-answer_revised.jpg\" alt=\"\" width=\"800\" height=\"482\"><\/p>\n<p>To compare, let\u2019s examine the output from a standard Ground Truth template for named entity recognition (NER). As shown in the following screenshot, the starting and ending offsets are given for each of the three entities labeled. However, no other contextual information, such as the coordinates or position of the text within the document, is provided. This is expected because the Ground Truth template for NER takes input files in as plain text format, which has removed all spatial information from within the PDF document in regards to positioning of words relative to each other and the document.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28119\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/10-5033_revised.jpg\" alt=\"\" width=\"800\" height=\"683\"><\/p>\n<h2>Labeling best practices<\/h2>\n<p>The following are best practices while annotating your data:<\/p>\n<ul>\n<li>Annotate your data with care and verify that you annotate every mention of the entity. You can use the Auto Tag feature and every reference of the entity will be annotated. Imprecise annotations can lead to poor results. In general, more annotations lead to better results.<\/li>\n<li>Input data should not contain duplicates, like a duplicate of a PDF you are going to annotate. Presence of a duplicate sample might result in test set contamination and could negatively affect the training process, model metrics, and model behavior.<\/li>\n<li>Make sure that all documents in the corpus are annotated, and that the documents without annotations are due to lack of legitimate entities, not due to negligence. For example, if you have a document containing \u201cJ Doe has been an engineer for 14 years,\u201d you should provide an annotation for \u201cJ Doe\u201d as well as \u201cJohn Doe.\u201d Failing to do so confuses the model and might lead to not recognizing \u201cJ Doe\u201d as an <code>ENGINEER<\/code>. This should be consistent within the same document and across documents.<\/li>\n<li>Provide documents that resemble real use cases as closely as possible. Synthesized data with repetitive patterns should be avoided. The input data should be as diverse as possible to avoid overfitting and help the underlying model better generalize on real examples.<\/li>\n<li>It\u2019s important that documents should be diverse in terms of word count. For example, if all documents in the training data are short, the resulting model has difficulty predicting entities in longer documents.<\/li>\n<li>Try to give the same data distribution for training as you expect to be using when you\u2019re actually detecting your custom entities (inference time). For example, at inference time, if you expect to be sending documents that have no entities in them, this should also be part of your training document set.<\/li>\n<li>We recommend a minimum of 250 documents\u00a0and 100 annotations per entity to ensure good quality predictions. With more training data, you\u2019re more likely to produce a higher-quality model. If you want higher accuracy, we recommend increasing the volume of annotated data by 10% to further improve the accuracy. This improvement might be best visible to you by running inference on a held-out test set that remains unchanged and can be tested by different models. In this way, you can compare successive models.<\/li>\n<\/ul>\n<p>For additional suggestions, see <a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/cer-metrics.html#cer-performance\" target=\"_blank\" rel=\"noopener noreferrer\">Improving Custom Entity Recognizer Performance<\/a><\/p>\n<h2>Conclusion<\/h2>\n<p>With this new contextual information included within labeled annotations, you can now train Amazon Comprehend custom entity recognition models with semi-structured information (bullets, lists, dense text). This additional structural context provides more information to the model to help identify the relevant entities within documents.<\/p>\n<p>For more information about how to train your custom model and the impact this additional information can have on custom Amazon Comprehend model performance, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Extract custom entities from documents in their native format with Amazon Comprehend<\/a>\u00a0and our <a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/training-recognizers.html\" target=\"_blank\" rel=\"noopener noreferrer\">documentation<\/a>.<\/p>\n<h3>References<\/h3>\n<p>[1] IDC Survey Spotlight, What Is the Landscape of the Emerging Document Artificial Intelligence Market?,\u00a0Doc # US47701421, July 2021<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><b data-stringify-type=\"bold\"><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16219 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/Mona.jpg\" alt=\"\" width=\"100\" height=\"134\">Mona Mona <\/strong><\/b>is an AI\/ML Specialist Solutions Architect based out of Arlington, VA. She helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI\/ML.<\/p>\n<p><b data-stringify-type=\"bold\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-28135 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/14\/Anant-Patel.jpg\" alt=\"\" width=\"101\" height=\"134\">Anant Patel<\/b>\u00a0is a Sr. Product Manager-Tech on the Amazon Comprehend team within AWS AI\/ML.<\/p>\n<p><b data-stringify-type=\"bold\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-11982\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/04\/23\/andrea-morton-youmans-100.jpg\" alt=\"\" width=\"100\" height=\"134\"><strong>Andrea Morton-Youmans <\/strong><\/b>is a Product Marketing Manager on the AI Services team at AWS. Over the past 10 years she has worked in the technology and telecommunications industries, focused on developer storytelling and marketing campaigns. In her spare time, she enjoys heading to the lake with her husband and Aussie dog Oakley, tasting wine and enjoying a movie from time to time.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend\/<\/p>\n","protected":false},"author":0,"featured_media":862,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/861"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=861"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/861\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/862"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}