{"id":286,"date":"2020-09-25T02:35:50","date_gmt":"2020-09-25T02:35:50","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/25\/improved-ocr-and-structured-data-extraction-with-amazon-textract\/"},"modified":"2020-09-25T02:35:50","modified_gmt":"2020-09-25T02:35:50","slug":"improved-ocr-and-structured-data-extraction-with-amazon-textract","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/25\/improved-ocr-and-structured-data-extraction-with-amazon-textract\/","title":{"rendered":"Improved OCR and structured data extraction with Amazon Textract"},"content":{"rendered":"<div id=\"\">\n<p>Optical character recognition (OCR) technology, which enables extracting text from an image, has been around since the mid-20th century, and continues to be a research topic today. OCR and document understanding are still vibrant areas of research because they\u2019re both valuable and hard problems to solve.<\/p>\n<p>AWS has been investing in improving OCR and document understanding technology, and our research scientists continue to publish research papers in these areas. For example, the research paper <a href=\"https:\/\/www.amazon.science\/publications\/can-you-read-me-now-content-aware-rectification-using-angle-supervision\" target=\"_blank\" rel=\"noopener noreferrer\">Can you read me now? Content aware rectification using angle supervision<\/a> describes how to tackle the problem of document rectification which is fundamental to the OCR process on documents. Additionally, the paper <a href=\"https:\/\/www.amazon.science\/publications\/scatter-selective-context-attentional-scene-text-recognizer\" target=\"_blank\" rel=\"noopener noreferrer\">SCATTER: Selective Context Attentional Scene Text Recognizer<\/a> introduces a novel way to perform scene text recognition, which is the task of recognizing text against complex image backgrounds. For more recent publications in this area, see <a href=\"https:\/\/www.amazon.science\/computer-vision\" target=\"_blank\" rel=\"noopener noreferrer\">Computer Vision<\/a>.<\/p>\n<p>Amazon scientists also incorporate these research findings into best-of-breed technologies such as <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a>, a fully managed service that uses machine learning (ML) to identify text and data from tables and forms in documents\u2014such as tax information from a W2, or values from a table in a scanned inventory report\u2014and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring customization or human intervention.<\/p>\n<p>One of the advantages of a fully managed service is the automatic and periodic improvement to the underlying ML models to improve accuracy. You may need to extract information from documents that have been scanned or pictured in different lighting conditions, a variety of angles, and numerous document types. As the models are trained using data inputs that encompass these different conditions, they become better at detecting and extracting data.<\/p>\n<p>In this post, we discuss a few recent updates to Amazon Textract that improve the overall accuracy of document detection and extraction.<\/p>\n<h2>Currency symbols<\/h2>\n<p>Amazon Textract now detects a set of currency symbols (Chinese yuan, Japanese yen, Indian rupee, British pound, and US dollar) and the degree symbol with more precision without much regression on existing symbol detection.<\/p>\n<p>For example, the following is a sample table in a document from a company\u2019s annual report.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15921 size-full\" title=\"Sample table\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/1-Note9.jpg\" alt=\"\" width=\"900\" height=\"357\"><\/p>\n<p>The following screenshot shows the output on the Amazon Textract console before the latest update.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15922 size-full\" title=\"Amazon Textract output - prior to update\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/2-Screenshot-2.jpg\" alt=\"\" width=\"900\" height=\"470\"><\/p>\n<p>Amazon Textract detects all the text accurately. However, the Indian rupee symbol is recognized as an \u201cR\u201d instead of \u201c\u20b9\u201d. The following screenshot shows the output using the updated model.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15923 size-full\" title=\"Amazon Textract output - updated\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/3-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"474\"><\/p>\n<p>The rupee symbol is detected and extracted accurately. Similarly, the degree symbol and the other currency symbols (yuan, yen, pound, and dollar) are now supported in Amazon Textract.<\/p>\n<h2>Detecting rows and columns in large tables<\/h2>\n<p>Amazon Textract released a new table model update that more accurately detects rows and columns of large tables that span an entire page. Overall table detection and extraction of data and text within tables has also been improved.<\/p>\n<p>The following is an example of a table in a personal investment account statement.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15924 size-full\" title=\"Table example\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/4-Screenshot-1.jpg\" alt=\"\" width=\"900\" height=\"255\"><\/p>\n<p>The following screenshot shows the Amazon Textract output prior to the new model update.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15925 size-full\" title=\"Amazon Textract output - prior to update\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/5-Screenshot-2.jpg\" alt=\"\" width=\"900\" height=\"507\"><\/p>\n<p>Even though all the rows, columns, and text is detected properly, the output also contains empty columns. The original table didn\u2019t have a clear separation for columns, so the model included extra columns.<\/p>\n<p>The following screenshot shows the output after the model update.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15926 size-full\" title=\"Amazon Textract output - updated\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/6-Screenshot-1.jpg\" alt=\"\" width=\"900\" height=\"330\"><\/p>\n<p>The output now is much cleaner. Amazon Textract still extracts all the data accurately from this table and now includes the correct number of columns. Similar performance improvement can be seen in tables that span an entire page and columns are not omitted.<\/p>\n<h2>Improved accuracy in forms<\/h2>\n<p>Amazon Textract now has higher accuracy on a variety of forms, especially income verification documents such as pay stubs, bank statements, and tax documents. The following screenshot shows an example of such a form.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15927 size-full\" title=\"Form example\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/7-Form.jpg\" alt=\"\" width=\"900\" height=\"569\"><\/p>\n<p>The preceding form is not of high-quality resolution. Regardless, you may have to process such documents in your organization. The following screenshot is the Amazon Textract output using one of the previous models.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15928 size-full\" title=\"Amazon Textract output - prior to update\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/8-Form.jpg\" alt=\"\" width=\"900\" height=\"557\"><\/p>\n<p>Although the older model detected many of the check boxes, it didn\u2019t capture all of them. The following screenshot shows the output using the new model.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15929 size-full\" title=\"Amazon Textract output - updated\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/9-Form.jpg\" alt=\"\" width=\"900\" height=\"550\"><\/p>\n<p>With this new model, Amazon Textract accurately detected all the check boxes in the document.<\/p>\n<h2><strong>Summary<\/strong><\/h2>\n<p>The improvements to the currency symbols and the degree symbol detection will be launched in the Asia Pacific (Singapore) region on September 24th, 2020, followed by other regions where Amazon Textract is available in the next few days. With the latest improvements to Amazon Textract, you can retrieve information from documents with more accuracy. Tables spanning the entire page are detected more accurately, currency symbols \u00a0(yuan, yen, rupee, pound, and dollar) and the\u00a0degree symbol are now supported, and key-value pairs and check boxes in financial forms are detected with more precision. To start extracting data from your documents and images, try <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a> for yourself.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15932 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/15\/RajCopparapu.jpg\" alt=\"\" width=\"100\" height=\"135\">Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/improved-ocr-and-structured-data-extraction-with-amazon-textract\/<\/p>\n","protected":false},"author":0,"featured_media":287,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/286"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=286"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/286\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/287"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=286"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=286"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=286"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}