{"id":1978,"date":"2022-03-17T18:38:50","date_gmt":"2022-03-17T18:38:50","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/17\/build-a-traceable-custom-multi-format-document-parsing-pipeline-with-amazon-textract\/"},"modified":"2022-03-17T18:38:50","modified_gmt":"2022-03-17T18:38:50","slug":"build-a-traceable-custom-multi-format-document-parsing-pipeline-with-amazon-textract","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/17\/build-a-traceable-custom-multi-format-document-parsing-pipeline-with-amazon-textract\/","title":{"rendered":"Build a traceable, custom, multi-format document parsing pipeline with Amazon Textract"},"content":{"rendered":"<div id=\"\">\n<p>Organizational forms serve as a primary business tool across industries\u2014from financial services, to healthcare, and more. Consider, for example, tax filing forms in the tax management industry, where new forms come out each year with largely the same information. AWS customers across sectors need to process and store information in forms as part of their daily business practice. These forms often serve as a primary means for information to flow into an organization where technological means of data capture are impractical.<\/p>\n<p>In addition to using forms to capture information, over the years of offering <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a>, we have observed that AWS customers frequently version their organizational forms based on structural changes made, fields added or changed, or other considerations such as a change of year or version of the form.<\/p>\n<p>When the structure or content of a form changes, frequently this can cause challenges for traditional OCR systems or impact downstream tools used to capture information, even when you need to capture the same information year over year and aggregate the data for use regardless of the format of the document.<\/p>\n<p>To solve this problem, in this post we demonstrate how you can build and deploy an event-driven, serverless, multi-format document parsing pipeline with Amazon Textract.<\/p>\n<h2>Solution overview<\/h2>\n<p>The following diagram illustrates our solution architecture:<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/09\/ML-2764-image001-new.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-34057 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/09\/ML-2764-image001-new.png\" alt=\"\" width=\"1381\" height=\"1381\"><\/a><\/p>\n<p>First, the solution offers pipeline ingest using <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3), Amazon S3 Event Notifications, and an <a href=\"https:\/\/aws.amazon.com\/sqs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Queue Service<\/a> (Amazon SQS) queue so that processing begins when a form lands in the target Amazon S3 partition. An event on <a href=\"https:\/\/aws.amazon.com\/eventbridge\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EventBridge<\/a> is created and sent to an <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> target that triggers an Amazon Textract job.<\/p>\n<p>You can use serverless AWS services such as Lambda and <a href=\"http:\/\/aws.amazon.com\/step-functions\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Step Functions<\/a> to create asynchronous service integrations between AWS AI services and AWS Analytics and Database services for warehousing, analytics, and AI and machine learning (ML). In this post, we demonstrate how to use Step Functions to asynchronously control and maintain the state of requests to Amazon Textract asynchronous APIs. This is achieved by using a state machine for managing calls and responses. We use Lambda within the state machine to merge the paginated API response data from Amazon Textract into a single JSON object containing semi-structured text data extracted using OCR.<\/p>\n<p>Then we filter across different forms using a standardized approach to aggregate this OCR data into a common structured format using <a href=\"http:\/\/aws.amazon.com\/athena\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Athena<\/a> and a SQL Amazon Textract JSON <a href=\"https:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/serde-about.html\" target=\"_blank\" rel=\"noopener noreferrer\">SerDe<\/a>.<\/p>\n<p>You can trace the steps taken through this pipeline using serverless Step Functions to track the processing state and retain the output of each state. This is something that customers in some industries prefer to do when working with data where you must retain the results of all predictions from services such as Amazon Textract for promoting explainability of your pipeline results in the long term.<\/p>\n<p>Finally, you can query the extracted data in Athena tables.<\/p>\n<p>In the following sections, we walk you through setting up the pipeline using <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a>, testing the pipeline, and adding new form versions. This pipeline provides a maintainable solution because every component (ingest, text extraction, text processing) is independent and isolated.<\/p>\n<h2>Define default input parameters for CloudFormation stacks<\/h2>\n<p>To define the input parameters for the CloudFormation stacks, open <code>default.properties<\/code> under the <code>params<\/code> folder and enter the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">- set the default value for parameter 'pInputBucketName' for Input S3 bucket \n- set the default value for parameter 'pOutputBucketName' for Output S3 bucket \n- set the default value for parameter 'pInputQueueName' for Ingest SQS (a.k.a job scheduler)<\/code><\/pre>\n<\/p><\/div>\n<h2>Deploy the solution<\/h2>\n<p>To deploy your pipeline, complete the following steps:<\/p>\n<ol>\n<li>Choose <strong>Launch Stack<\/strong>:<br \/><a href=\"https:\/\/console.aws.amazon.com\/cloudformation\/home?region=us-east-1#\/stacks\/new?stackName=multi-format-doc-stack&amp;templateURL=https:\/\/aws-machine-learning-blog.s3.amazonaws.com\/artifacts\/ML-2764-Build-a-traceable-custom-multi-format-document-parsing-pipeline\/main-template-out.template\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-47 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2017\/02\/10\/launchstack.png\" alt=\"\" width=\"107\" height=\"20\"><\/a><\/li>\n<li>Choose <strong>Next<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33910\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image005.png\" alt=\"\" width=\"2158\" height=\"1230\"><\/a><\/li>\n<li>Specify the stack details as shown in the following screenshot and choose <strong>Next<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image007.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33911\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image007.png\" alt=\"\" width=\"1374\" height=\"1128\"><\/a><\/li>\n<li>In the <strong>Configure stack options<\/strong> section, add optional tags, permissions, and other advanced settings.<\/li>\n<li>Choose <strong>Next<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image009.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33912\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image009.png\" alt=\"\" width=\"727\" height=\"731\"><\/a><\/li>\n<li>Review the stack details and select <strong>I acknowledge that AWS CloudFormation might create IAM resources with custom names<\/strong>.<\/li>\n<li>Choose <strong>Create stack<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image011.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33913\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image011.png\" alt=\"\" width=\"1318\" height=\"444\"><\/a><\/li>\n<\/ol>\n<p>This initiates stack deployment in your AWS account.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image013.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33914\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image013.png\" alt=\"\" width=\"2798\" height=\"616\"><\/a><\/p>\n<p>After the stack is deployed successfully, then you can start testing the pipeline as described in the next section.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image015.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33915\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image015.png\" alt=\"\" width=\"2766\" height=\"1256\"><\/a><\/p>\n<h2>Test the pipeline<\/h2>\n<p>After a successful deployment, complete the following steps to test your pipeline:<\/p>\n<ol>\n<li>Download the <a href=\"https:\/\/aws-machine-learning-blog.s3.amazonaws.com\/artifacts\/ML-2764-Build-a-traceable-custom-multi-format-document-parsing-pipeline\/sample+docs\/multi-format-sample-documents.zip\" target=\"_blank\" rel=\"noopener noreferrer\">sample files<\/a> onto your computer.<\/li>\n<li>Create an <code>\/uploads<\/code> folder (partition) under the newly created input S3 bucket.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image017.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33916\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image017.png\" alt=\"\" width=\"2710\" height=\"766\"><\/a><\/li>\n<li>Create the separate folders (partitions) like <code>jobapplications<\/code> under <code>\/uploads<\/code>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image019.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33917\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image019.png\" alt=\"\" width=\"2760\" height=\"730\"><\/a><\/li>\n<li>Upload the first version of the job application from the sample docs folder to the <code>\/uploads\/jobapplications<\/code> partition.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image021.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33918\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image021.png\" alt=\"\" width=\"2732\" height=\"752\"><\/a><\/li>\n<\/ol>\n<p>When the pipeline is complete, you can find the extracted key-value for this version of the document in <code>\/OuputS3\/03-textract-parsed-output\/jobapplications<\/code> on the Amazon S3 console.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image023.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33919\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image023.png\" alt=\"\" width=\"2386\" height=\"766\"><\/a><\/p>\n<p>You can also find it in the Athena table (<code>applications_data_table<\/code>) on the <strong>Database<\/strong> menu (<code>jobapplicationsdatabase<\/code>).<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image025.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33920\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image025.png\" alt=\"\" width=\"2696\" height=\"1272\"><\/a><\/p>\n<ol start=\"5\">\n<li>Upload the second version of the job application from the sample docs folder to the <code>\/uploads\/jobapplications<\/code> partition.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image027.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33921\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image027.png\" alt=\"\" width=\"2772\" height=\"1004\"><\/a><\/li>\n<\/ol>\n<p>When the pipeline is complete, you can find the extracted key-value for this version in <code>\/OuputS3\/03-textract-parsed-output\/jobapplications<\/code> on the Amazon S3 console.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image029.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33922\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image029.png\" alt=\"\" width=\"2384\" height=\"810\"><\/a><\/p>\n<p>You can also find it in the Athena table (<code>applications_data_table<\/code>) on the <strong>Database<\/strong> menu (<code>jobapplicationsdatabase<\/code>).<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image031.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33923\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image031.png\" alt=\"\" width=\"2700\" height=\"1290\"><\/a><\/p>\n<p>You\u2019re done! You\u2019ve successfully deployed your pipeline.<\/p>\n<h2>Add new form versions<\/h2>\n<p>Updating the solution for a new form version is straightforward\u2014each form version only needs to be updated by testing the queries in the processing stack.<\/p>\n<p>After you make the updates, you can redeploy the updated pipeline using AWS CloudFormation APIs and process new documents, arriving at the same standard data points for your schema with minimal disruption and development effort needed to make changes to your pipeline. This flexibility, which is achieved by decoupling the parsing and extraction behavior and using the JSON SerDe functionality in Athena, makes this pipeline a maintainable solution for any number of form versions that your organization needs to process to gather information.<\/p>\n<p>As you run the ingest solution, data from incoming forms is automatically populated to Athena with information about the files and inputs associated to them. When the data in your forms moves from unstructured to structured data, it\u2019s ready to use for downstream applications such as analytics, ML modeling, and more.<\/p>\n<h2>Clean up<\/h2>\n<p>To avoid incurring ongoing charges, delete the resources you created as part of this solution when you\u2019re done.<\/p>\n<ol>\n<li>On the Amazon S3 console, manually delete the buckets you created as part of the CloudFormation stack.<\/li>\n<li>On the AWS CloudFormation console, choose <strong>Stacks<\/strong> in the navigation pane.<\/li>\n<li>Select the main stack and choose <strong>Delete<\/strong>.<\/li>\n<\/ol>\n<p>This automatically deletes the nested stacks.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image033.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33924\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-2764-image033.png\" alt=\"\" width=\"2824\" height=\"918\"><\/a><\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how customers seeking to trace and customize the document processing can build and deploy an event-driven, serverless, multi-format document parsing pipeline with Amazon Textract. This pipeline provides a maintainable solution because every component (ingest, text extraction, text processing) are independent and isolated, allowing organizations to operationalize their solutions to address diverse processing needs.<\/p>\n<p>Try the solution today and leave your feedback in the comments section.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/sowarde-high-res-current-photo.jpeg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-33937 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/sowarde-high-res-current-photo.jpeg\" alt=\"\" width=\"100\" height=\"150\"><\/a><strong>Emily Soward<\/strong> is a Data Scientist with AWS Professional Services. She holds a Master of Science with Distinction in Artificial Intelligence from the University of Edinburgh in Scotland, United Kingdom with emphasis on Natural Language Processing (NLP). Emily has served in applied scientific and engineering roles focused on AI-enabled product research and development, operational excellence, and governance for AI workloads running at organizations in the public and private sector. She contributes to customer guidance as an AWS Senior Speaker and recently, as an author for AWS Well-Architected in the Machine Learning Lens.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/snghigf.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33929 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/snghigf.png\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Sandeep Singh<\/strong> is a Data Scientist with AWS Professional Services. He holds a Master of Science in Information Systems with concentration in AI and Data Science from San Diego State University (SDSU), California. He is a full stack Data Scientist with a strong computer science background and Trusted adviser with specialization in AI Systems and Control design. He is passionate about helping customers to get their high impact projects in the right direction, advising and guiding them in their Cloud journey, and building state-of-the-art AI\/ML enabled solutions.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-a-traceable-custom-multi-format-document-parsing-pipeline-with-amazon-textract\/<\/p>\n","protected":false},"author":0,"featured_media":1979,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1978"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1978"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1978\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1979"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1978"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1978"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}