{"id":1137,"date":"2021-11-03T08:40:18","date_gmt":"2021-11-03T08:40:18","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/03\/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend\/"},"modified":"2021-11-03T08:40:18","modified_gmt":"2021-11-03T08:40:18","slug":"intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/03\/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend\/","title":{"rendered":"Intelligently split multi-form document packages with Amazon Textract and Amazon Comprehend"},"content":{"rendered":"<div id=\"\">\n<p>Many organizations spanning different sizes and industry verticals still rely on large volumes of documents to run their day-to-day operations. To solve this business challenge, customers are using intelligent document processing services from AWS such as <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a> and <a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a> to help with <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">extraction and process automation<\/a>. Before you can extract text, key-value pairs, tables, and entities, you need to be able to split multipage PDF documents that often contain heterogeneous form types. For example, in mortgage processing, a broker or loan processing individual may need to split a consolidated PDF loan package, containing the mortgage application (Fannie Mae form 1003), W2s, income verification, 1040 tax forms, and more.<\/p>\n<p>To tackle this problem, organizations use rules-based processing: identifying document types via form titles, page numbers, form lengths, and so on. These approaches are error-prone and difficult to scale, especially when the form types may have several variations. Accordingly, these workarounds break down quickly in practice and increase the need for human intervention.<\/p>\n<p>In this post, we show how you can create your own document splitting solution with little code for any set of forms, without building custom rules or processing workflows.<\/p>\n<h2>Solution overview<\/h2>\n<p>For this post, we use a set of common mortgage application forms to demonstrate how you can use Amazon Textract and Amazon Comprehend to create an intelligent document splitter that is more robust than earlier approaches. When processing documents for mortgage applications, the borrower submits a multipage PDF that is made up of heterogeneous document types of varying page lengths; to extract information, the user (for example, a bank) must break down this PDF.<\/p>\n<p>Although we show a specific example for mortgage forms, you can generally scale and apply this approach to just about any set of multi-page PDF documents.<\/p>\n<p>We use Amazon Textract to extract data from the document and build an Amazon Comprehend compatible dataset to train a <a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/how-document-classification.html\" target=\"_blank\" rel=\"noopener noreferrer\">document classification model<\/a>. Next, we train the classification model and create a classification endpoint that can perform real-time document analysis. Keep in mind that Amazon Textract and Amazon Comprehend classification endpoints incur charges, so refer to <a href=\"https:\/\/aws.amazon.com\/textract\/pricing\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract pricing<\/a> and <a href=\"https:\/\/aws.amazon.com\/comprehend\/pricing\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend pricing<\/a> for more information. Finally, we show how we can classify documents with this endpoint and split documents based on the classification results.<\/p>\n<p>This solution uses the following AWS services:<\/p>\n<h2>Prerequisites<\/h2>\n<p>You need to complete the following prerequisites to build and deploy this solution:<\/p>\n<ol>\n<li><a href=\"https:\/\/github.com\/pyenv\/pyenv\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> Python 3.8.x.<\/li>\n<li><a href=\"https:\/\/stedolan.github.io\/jq\/download\/\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> jq.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/serverless-application-model\/latest\/developerguide\/serverless-sam-cli-install.html\" target=\"_blank\" rel=\"noopener noreferrer\">Install<\/a> the AWS SAM CLI.<\/li>\n<li>Install <a href=\"https:\/\/docs.docker.com\/get-docker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Docker<\/a>.<\/li>\n<li>Make sure you have <a href=\"https:\/\/pypi.org\/project\/pipenv\/\" target=\"_blank\" rel=\"noopener noreferrer\">pip installed<\/a>.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/install-cliv2.html\" target=\"_blank\" rel=\"noopener noreferrer\">Install and configure<\/a> the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI).<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/cli-configure-files.html\" target=\"_blank\" rel=\"noopener noreferrer\">Configure<\/a> your AWS credentials.<\/li>\n<\/ol>\n<p>The solution is designed to work optimally in the <code>us-east-1<\/code> and <code>us-west-2<\/code> Regions to take advantage of higher default quotas for Amazon Textract. For specific Regional workloads, refer to <a href=\"https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/textract.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract endpoints and quotas<\/a>. Make sure you use a single Region for the entire solution.<\/p>\n<h2>Clone the repo<\/h2>\n<p>To get started, clone the repository by running the following command; then we switch into the working directory:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">git clone https:\/\/github.com\/aws-samples\/aws-document-classifier-and-splitter.git\ncd aws-document-classifier-and-splitter<\/code><\/pre>\n<\/p><\/div>\n<h2>Solution workflows<\/h2>\n<p>The solution consists of three workflows:<\/p>\n<ul>\n<li><strong>workflow1_endpointbuilder<\/strong> \u2013 Takes the training documents and builds a custom classification endpoint on Amazon Comprehend.<\/li>\n<li><strong>workflow2_docsplitter<\/strong> \u2013 Acts as the document splitting service, where documents are split by class. It uses the classification endpoint created in <code>workflow1<\/code>.<\/li>\n<li><strong>workflow3_local<\/strong> \u2013 Is intended for customers who are in highly regulated industries and can\u2019t persist data in Amazon S3. This workflow contains local versions of <code>workflow1<\/code> and <code>workflow2<\/code>.<\/li>\n<\/ul>\n<p>Let\u2019s take a deep dive into each workflow and how they work.<\/p>\n<h3>Workflow 1: Build an Amazon Comprehend classifier from PDF, JPG, or PNG documents<\/h3>\n<p>The first workflow takes documents stored on Amazon S3 and sends them through a series of steps to extract the data from the documents via Amazon Textract. Then, the extracted data is used to create an Amazon Comprehend custom classification endpoint. This is demonstrated in the following architecture diagram.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image001-new.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30289 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image001-new.png\" alt=\"\" width=\"2002\" height=\"1058\"><\/a><\/p>\n<p>To launch <code>workflow1<\/code>, you need the Amazon S3 URI of the folder containing the training dataset files (these can be images, single-page PDFs, or multipage PDFs). The structure of the folder must be as follows:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">root dataset directory\n---- class directory\n-------- files<\/code><\/pre>\n<\/p><\/div>\n<p>Alternatively, the structure can have additional nested subdirectories:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">root dataset directory\n---- class directory\n-------- nested subdirectories\n------------ files<\/code><\/pre>\n<\/p><\/div>\n<p>The names of the class subdirectories (the second directory level) become the names of the classes used in the Amazon Comprehend custom classification model. For example, in the following file structure, the class for <code>form123.pdf<\/code> is <code>tax_forms<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">training_dataset\n---- tax_forms\n-------- page_1\n------------ form123.pdf<\/code><\/pre>\n<\/p><\/div>\n<p>To launch the workflow, complete the following steps:<\/p>\n<ol>\n<li>Upload the dataset to an S3 bucket you own.<\/li>\n<\/ol>\n<p>The recommendation is to have over 50 samples for each class you want to classify on. The following screenshot shows an example of this document class structure.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image003.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30262\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image003.png\" alt=\"\" width=\"1430\" height=\"529\"><\/a><\/p>\n<ol start=\"2\">\n<li>Build the <code>sam-app<\/code> by running the following commands (modify the provided commands as needed):<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cd workflow1_endpointbuilder\/sam-app\nsam build\nsam deploy --guided\nStack Name [sam-app]: endpointbuilder\nAWS Region []: us-east-1\n#Shows you resources changes to be deployed and require a 'Y' to initiate deploy\nConfirm changes before deploy [y\/N]: n\n#SAM needs permission to be able to create roles to connect to the resources in your template\nAllow SAM CLI IAM role creation [Y\/n]: y\nSave arguments to configuration file [Y\/n]: n\n\nLooking for resources needed for deployment:\nCreating the required resources...\nSuccessfully created!\nManaged S3 bucket: {your_bucket}\n#Managed repositories will be deleted when their functions are removed from the template and deployed\nCreate managed ECR repositories for all functions? [Y\/n]: y<\/code><\/pre>\n<\/p><\/div>\n<p>The output of the build is an ARN for a Step Functions state machine.<\/p>\n<ol start=\"3\">\n<li>When the build is complete, navigate to the <strong>State machines<\/strong> page on the Step Functions console.<\/li>\n<li>Choose the state machine you created.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image005.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30263\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image005.png\" alt=\"\" width=\"2672\" height=\"858\"><\/a><\/li>\n<li>Choose <strong>Start execution<\/strong>.<\/li>\n<li>Enter the following required input parameters:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">{\n\u201cfolder_uri\u201d: \u201cs3:\/\/{your dataset}\u201d\n}<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"7\">\n<li>Choose <strong>Start execution<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image007.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30264\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image007.png\" alt=\"\" width=\"1553\" height=\"606\"><\/a><\/li>\n<\/ol>\n<p>The state machine starts the workflow. This can take multiple hours depending on the size of the dataset. The following screenshot shows our state machine in progress.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image009.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30266\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image009.png\" alt=\"\" width=\"1209\" height=\"823\"><\/a><\/p>\n<p>When the state machine is complete, each step in the graph is green, as shown in the following screenshot.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image011.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30267\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image011.png\" alt=\"\" width=\"1262\" height=\"882\"><\/a><\/p>\n<p>You can navigate to the Amazon Comprehend console to see the endpoint deployed.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image013.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30268\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image013.png\" alt=\"\" width=\"2252\" height=\"1348\"><\/a><\/p>\n<p>You have now built your custom classifier using your documents. This marks the end of <code>workflow1<\/code>.<\/p>\n<h3>Workflow 2: Build an endpoint<\/h3>\n<p>The second workflow takes the endpoint you created in <code>workflow1<\/code> and splits the documents based on the classes with which model has been trained. This is demonstrated in the following architecture diagram.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image015-new.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30288 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image015-new.png\" alt=\"\" width=\"2002\" height=\"1012\"><\/a><\/p>\n<p>To launch <code>workflow2<\/code>, we build the <code>sam-app<\/code>. Modify the provided commands as needed:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cd workflow2_docsplitter\/sam-app\nsam-app % sam build\nBuild Succeeded\n\nsam-app % sam deploy --guided\nConfiguring SAM deploy\n=========================================\nStack Name [sam-app]: docsplitter\nAWS Region []: us-east-1\n#Shows you resources changes to be deployed and require a 'Y' to initiate deploy\nConfirm changes before deploy [y\/N]: n\n#SAM needs permission to be able to create roles to connect to the resources in your template\nAllow SAM CLI IAM role creation [Y\/n]: y\nSave arguments to configuration file [Y\/n]: n\n\nLooking for resources needed for deployment:\nManaged S3 bucket: {bucket_name}\n#Managed repositories will be deleted when their functions are removed from the template and deployed\nCreate managed ECR repositories for all functions? [Y\/n]: y<\/code><\/pre>\n<\/p><\/div>\n<p>After the stack is created, you receive a Load Balancer DNS on the <strong>Outputs<\/strong> tab of the CloudFormation stack. You can begin to make requests to this endpoint.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image017.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30270\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image017.png\" alt=\"\" width=\"3238\" height=\"802\"><\/a><\/p>\n<p>A sample request is available in the <code>workflow2_docsplitter\/sample_request_folder\/sample_s3_request.py<\/code> file. The API takes three parameters: the S3 bucket name, the document Amazon S3 URI, and the Amazon Comprehend classification endpoint ARN. <b data-stringify-type=\"bold\">Workflow2 only supports PDF input.<\/b><\/p>\n<p>For our test, we use an 11-page mortgage document with five different document types.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image019.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30271\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image019.png\" alt=\"\" width=\"1332\" height=\"862\"><\/a><\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image021.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30272\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image021.png\" alt=\"\" width=\"1334\" height=\"860\"><\/a> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image023.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30273\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image023.png\" alt=\"\" width=\"1336\" height=\"864\"><\/a> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image025.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30274\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image025.png\" alt=\"\" width=\"1330\" height=\"856\"><\/a> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image027.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30275\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image027.png\" alt=\"\" width=\"1504\" height=\"1254\"><\/a> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image029-1.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30285 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image029-1.png\" alt=\"\" width=\"600\" height=\"781\"><\/a><\/p>\n<p>The response for the API is an Amazon S3 URI for a .zip file with all the split documents. You can also find this file in the bucket that you provided in your API call.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image031.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30277\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image031.png\" alt=\"\" width=\"2662\" height=\"1460\"><\/a><\/p>\n<p>Download the object and review the documents split based on the class.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image033-1.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-30284 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image033-1.png\" alt=\"\" width=\"500\" height=\"147\"><\/a><\/p>\n<p>This marks the end of <code>workflow2<\/code>. We have now shown how we can use a custom Amazon Comprehend classification endpoint to classify and split documents.<\/p>\n<h3>Workflow 3: Local document splitting<\/h3>\n<p>Our third workflow follows a similar purpose to <code>workflow1<\/code> and <code>workflow2<\/code> to generate an Amazon Comprehend endpoint; however, all processing is done using the your local machine to generate an Amazon Comprehend compatible CSV file. This workflow was created for customers in highly regulated industries where persisting PDF documents on Amazon S3 may not be possible. The following architecture diagram is a visual representation of the local endpoint builder workflow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image035.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30279\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image035.png\" alt=\"\" width=\"1322\" height=\"456\"><\/a><\/p>\n<p>The following diagram illustrates the local document splitter architecture.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image037.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-30280\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/ML-4118-image037.png\" alt=\"\" width=\"1322\" height=\"456\"><\/a><\/p>\n<p>All the code for the solution is available in the <code>workflow3_local\/local_endpointbuilder.py<\/code> file to build the Amazon Comprehend classification endpoint and <code>workflow3_local\/local_docsplitter.py<\/code> to send documents for splitting.<\/p>\n<h2>Conclusion<\/h2>\n<p>Document splitting is the key to building a successful and intelligent document processing workflow. It is still a very relevant problem for businesses, especially organizations aggregating multiple document types for their day-to-day operations. Some examples include processing insurance claims documents, insurance policy applications, SEC documents, tax forms, and income verification forms.<\/p>\n<p>In this post, we took a set of common documents used for loan processing, extracted the data using Amazon Textract, and built an Amazon Comprehend custom classification endpoint. With that endpoint, we classified incoming documents and split them based on their respective class. You can apply this process to nearly any set of documents with applications across a variety of industries, such as healthcare and financial services. To learn more about Amazon Textract, <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">visit the webpage<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/1613573108074.jpg\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30282 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/1613573108074.jpg\" alt=\"\" width=\"100\" height=\"100\"><\/a>Aditi Rajnish<\/strong> is a first-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/1622773633578-1.jpg\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-30281 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/01\/1622773633578-1.jpg\" alt=\"\" width=\"100\" height=\"100\"><\/a> Raj Pathak<\/strong> is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/intelligently-split-multi-form-document-packages-with-amazon-textract-and-amazon-comprehend\/<\/p>\n","protected":false},"author":0,"featured_media":1138,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1137"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1137"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1137\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1138"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1137"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1137"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1137"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}