{"id":2048,"date":"2022-04-04T18:42:37","date_gmt":"2022-04-04T18:42:37","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/04\/04\/enable-amazon-kendra-search-for-a-scanned-or-image-based-text-document\/"},"modified":"2022-04-04T18:42:37","modified_gmt":"2022-04-04T18:42:37","slug":"enable-amazon-kendra-search-for-a-scanned-or-image-based-text-document","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/04\/04\/enable-amazon-kendra-search-for-a-scanned-or-image-based-text-document\/","title":{"rendered":"Enable Amazon Kendra search for a scanned or image-based text document"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra<\/a> is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they\u2019re looking for, even when it\u2019s scattered across multiple locations and content repositories within your organization.<\/p>\n<p>Amazon Kendra supports a variety of document formats, such as Microsoft Word, PDF, and text. While working with a leading Edtech customer, we were asked to build an enterprise search solution that also utilizes images and PPT files. This post focuses on extending the document support in Amazon Kendra so you can preprocess text images and scanned documents (JPEG, PNG, or PDF format)\u00a0 to make them searchable. The solution combines <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a> for document preprocessing and optical character recognition (OCR), and Amazon Kendra for intelligent search.<\/p>\n<p>With the new Custom Document Enrichment feature in Amazon Kendra, you can now preprocess your documents during ingestion and augment your documents with new metadata. Custom Document Enrichment allows you to call external services like <a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a>, Amazon Textract, and <a href=\"https:\/\/aws.amazon.com\/transcribe\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Transcribe<\/a> to extract text from images, transcribe audio, and analyze video. For more information about using Custom Document Enrichment, refer to <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/enrich-your-content-and-metadata-to-enhance-your-search-experience-with-custom-document-enrichment-in-amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra<\/a>.<\/p>\n<p>In this post, we propose an alternate method of preprocessing the content prior to calling the ingestion process in Amazon Kendra.<\/p>\n<h2>Solution overview<\/h2>\n<p>Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents and goes beyond basic OCR to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents like PDFs, images, tables, and forms through basic OCR software that requires manual configuration, which often requires reconfiguration when the form changes.<\/p>\n<p>To overcome these manual and expensive processes, Amazon Textract uses machine learning to read and process a wide range of documents, accurately extracting text, handwriting, tables, and other data without any manual effort. You can quickly automate document processing and take action on the information extracted, whether it\u2019s automating loans processing or extracting information from invoices and receipts.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra<\/a> is an easy-to-use enterprise search service that allows you to add search capabilities to your applications so that end-users can easily find information stored in different data sources within your company. This could include invoices, business documents, technical manuals, sales reports, corporate glossaries, internal websites, and more. You can harvest this information from storage solutions like <a href=\"https:\/\/aws.amazon.com\/s3\/?sc_channel=PS&amp;sc_campaign=acquisition_CA&amp;sc_publisher=google&amp;sc_medium=ACQ-P%7CPS-GO%7CBrand%7CDesktop%7CSU%7CStorage%7CS3%7CCA%7CEN%7CText&amp;sc_content=s3_e&amp;sc_detail=amazon%20s3&amp;sc_category=Storage&amp;sc_segment=293634539894&amp;sc_matchtype=e&amp;sc_country=CA&amp;s_kwcid=AL!4422!3!293634539894!e!!g!!amazon%20s3&amp;ef_id=Cj0KCQjw0Mb3BRCaARIsAPSNGpVjTiCn2vOoesJKGEuwsEUp2rJ1gItQeOllwxV842PV0-rdGAS29cQaAjztEALw_wcB:G:s&amp;s_kwcid=AL!4422!3!293634539894!e!!g!!amazon%20s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) and OneDrive; applications such as Salesforce, SharePoint, and ServiceNow; or relational databases like <a href=\"https:\/\/aws.amazon.com\/rds\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Relational Database Service<\/a> (Amazon RDS).<\/p>\n<p>The proposed solution enables you to unlock the search potential in scanned documents, extending the ability of Amazon Kendra to find accurate answers in a wider range of document types. The workflow includes the following steps:<\/p>\n<ol>\n<li>Upload a document (or documents of various types) to Amazon S3.<\/li>\n<li>The event triggers an <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> function that uses the synchronous Amazon Textract API (<code>DetectDocumentText<\/code>).<\/li>\n<li>Amazon Textract reads the document in Amazon S3, extracts the text from it, and returns the extracted text to the Lambda function.<\/li>\n<li>The data source on the new text file needs to be reindexed.<\/li>\n<li>When reindexing is complete, you can search the new dataset either via the Amazon Kendra console or API.<\/li>\n<\/ol>\n<p>The following diagram illustrates the solution architecture.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34617\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image001.png\" alt=\"\" width=\"1152\" height=\"681\"><\/a><\/p>\n<p>In the following sections, we demonstrate how to configure the Lambda function, create the event trigger, process a document, and then reindex the data.<\/p>\n<h2>Configure the Lambda function<\/h2>\n<p>To configure your Lambda function, add the following code to the function Python editor:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import urllib\nimport boto3\n\ntextract = boto3.client('textract')\ndef handler(event, context):\n\tsource_bucket = event['Records'][0]['s3']['bucket']['name']\n\tobject_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])\n\t\n\ttextract_result = textract.detect_document_text(\n\t\tDocument={\n\t\t\t'S3Object': {\n\t\t\t\t'Bucket': source_bucket,\n\t\t\t\t'Name': object_key\n\t\t\t}\n\t\t})\n\tpage=\"\"\n\tblocks = [x for x in textract_result['Blocks'] if x['BlockType'] == \"LINE\"]\n\tfor block in blocks:\n\t\tpage += \" \" + block['Text']\n        \t\n\tprint(page)\n\ts3 = boto3.resource('s3')\n\tobject = s3.Object('demo-kendra-test', 'text\/apollo11-summary.txt')\n\tobject.put(Body=page)<\/code><\/pre>\n<\/p><\/div>\n<p>We use the <a href=\"https:\/\/docs.aws.amazon.com\/textract\/latest\/dg\/API_DetectDocumentText.html\" target=\"_blank\" rel=\"noopener noreferrer\">DetectDocumentText<\/a> API to extract the text from an image (JPEG or PNG) retrieved in Amazon S3.<\/p>\n<h2>Create an event trigger at Amazon S3<\/h2>\n<p>In this step, we create an event trigger to start the Lambda function when a new document is uploaded to a specific bucket. The following screenshot shows our new function on the Amazon S3 console.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image002.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34618\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image002.png\" alt=\"\" width=\"984\" height=\"141\"><\/a><\/p>\n<p>You can also verify the event trigger on the Lambda console.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34619\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image003.png\" alt=\"\" width=\"910\" height=\"490\"><\/a><\/p>\n<h2>Process a document<\/h2>\n<p>To test the process, we upload an image to the S3 folder that we defined for the S3 event trigger. We use the following sample image.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image004.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34620\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image004.png\" alt=\"\" width=\"600\" height=\"880\"><\/a><\/p>\n<p>When the Lambda function is complete, we can go to the <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a> console to check the output. The following screenshot shows the extracted text, which confirms that the Lambda function ran successfully.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34621\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image005.png\" alt=\"\" width=\"959\" height=\"462\"><\/a><\/p>\n<h2>Reindex the data with Amazon Kendra<\/h2>\n<p>We can now reindex our data.<\/p>\n<ol>\n<li>On the Amazon Kendra console, under <strong>Data management <\/strong>in the navigation pane, choose <strong>Data sources<\/strong>.<\/li>\n<li>Select the data source <code>demo-s3-datasource<\/code>.<\/li>\n<li>Choose <strong>Sync now<\/strong>.<\/li>\n<\/ol>\n<p>The sync state changes to <code>Synching - crawling<\/code>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image006.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34622\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image006.png\" alt=\"\" width=\"959\" height=\"196\"><\/a><\/p>\n<p>When the sync is complete, the sync status changes to <code>Succeeded<\/code> and the sync state changes to <code>Idle<\/code>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image007.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34623\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image007.png\" alt=\"\" width=\"937\" height=\"190\"><\/a><\/p>\n<p>Now we can go back to the search console and see our faceted search in action.<\/p>\n<ol start=\"4\">\n<li>In the navigation pane, choose <strong>Search console<\/strong>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image008.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34624\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image008.png\" alt=\"\" width=\"368\" height=\"476\"><\/a><\/li>\n<\/ol>\n<p>We added metadata for a few items; two of them are the ML algorithms XGBoost and BlazingText.<\/p>\n<ol start=\"5\">\n<li>Let\u2019s try searching for <code>Sagemaker<\/code>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image009.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34625\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image009.png\" alt=\"\" width=\"959\" height=\"523\"><\/a><\/li>\n<\/ol>\n<p>Our search was successful, and we got a list of results. Let\u2019s see what we have for facets.<\/p>\n<ol start=\"6\">\n<li>Expand <strong>Filter search results<\/strong>.<\/li>\n<\/ol>\n<p>We have the <code>category<\/code> and <code>tags<\/code> facets that were part of our item metadata.<\/p>\n<ol start=\"7\">\n<li>Choose <strong>BlazingText<\/strong> to filter results just for that algorithm.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image010.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34626\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image010.png\" alt=\"\" width=\"959\" height=\"485\"><\/a><\/li>\n<li>Now let\u2019s perform the search on newly uploaded image files. The following screenshot shows the search on new preprocessed documents.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image011.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-34627\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/ML-6030-image011.png\" alt=\"\" width=\"1424\" height=\"986\"><\/a><\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>This blog will be helpful in improving the effectiveness of search results and search experience. You can use Amazon Textract to extract text from scanned images that are added as metadata and later available as facets to interact with the search results. This is just an illustration of how you can use AWS native services to create a differentiated search experience for your users. This also helps in unlocking the full potential of your knowledge assets.<\/p>\n<p>For a deeper dive into what you can achieve by combining other AWS services with Amazon Kendra, refer to\u00a0<a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/make-your-audio-and-video-files-searchable-using-amazon-transcribe-and-amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra<\/a>,\u00a0<a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-an-intelligent-search-solution-with-automated-content-enrichment\/\" target=\"_blank\" rel=\"noopener noreferrer\">Build an intelligent search solution with automated content enrichment<\/a>, and other posts on the\u00a0<a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/category\/artificial-intelligence\/amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra blog<\/a>.<\/p>\n<hr>\n<h3>About of Author<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/Sanjay-Tiwari.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-34628 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/25\/Sanjay-Tiwari.png\" alt=\"\" width=\"100\" height=\"127\"><\/a>Sanjay Tiwary<\/strong> is a Specialist Solutions Architect AI\/ML. He spends his time working with strategic customers to define business requirements, provide L300 sessions around specific use cases, and design ML applications and services that are scalable, reliable, and performant. He has helped launch and scale the AI\/ML powered Amazon SageMaker service and has implemented several proofs of concept using Amazon AI services. He has also developed the advanced analytics platform as a part of the digital transformation journey.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/enable-amazon-kendra-search-for-a-scanned-or-image-based-text-document\/<\/p>\n","protected":false},"author":0,"featured_media":2049,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2048"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=2048"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2048\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/2049"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=2048"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=2048"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=2048"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}