{"id":1293,"date":"2021-12-02T08:29:37","date_gmt":"2021-12-02T08:29:37","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/12\/02\/enrich-your-content-and-metadata-to-enhance-your-search-experience-with-custom-document-enrichment-in-amazon-kendra\/"},"modified":"2021-12-02T08:29:37","modified_gmt":"2021-12-02T08:29:37","slug":"enrich-your-content-and-metadata-to-enhance-your-search-experience-with-custom-document-enrichment-in-amazon-kendra","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/12\/02\/enrich-your-content-and-metadata-to-enhance-your-search-experience-with-custom-document-enrichment-in-amazon-kendra\/","title":{"rendered":"Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra<\/a> customers can now enrich document metadata and content during the document ingestion process using custom document enrichment (CDE). Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines search for your websites and applications so your employees and customers can easily find the content they\u2019re looking for, even when it\u2019s scattered across multiple locations and content repositories within your organization.<\/p>\n<p>You can further enhance the accuracy and search experience of Amazon Kendra by improving the quality of documents indexed in it. Documents with precise content and rich metadata are more searchable and yield more accurate results. Organizations often have large repositories of raw documents that can be improved for search by modifying content or adding metadata before indexing. So how does CDE help? By simplifying the process of creating, modifying, or deleting document metadata and content before they\u2019re ingested into Amazon Kendra. This can include detecting entities from text, extracting text from images, transcribing audio and video, and more by creating custom logic or using services like <a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a>, <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a>, <a href=\"https:\/\/aws.amazon.com\/transcribe\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Transcribe<\/a>, <a href=\"https:\/\/aws.amazon.com\/rekognition\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Rekognition<\/a>, and others.<\/p>\n<p>In this post, we show you how to use CDE in Amazon Kendra using custom logic or with AWS services like Amazon Textract, Amazon Transcribe, and Amazon Comprehend. We demonstrate CDE using simple examples and provide a step-by-step guide for you to experience CDE in an Amazon Kendra index in your own AWS account.<\/p>\n<h2>CDE overview<\/h2>\n<p>CDE enables you to create, modify, or delete document metadata and content when you ingest your documents into Amazon Kendra. Let\u2019s understand the Amazon Kendra document ingestion workflow in the context of CDE.<\/p>\n<p>The following diagram illustrates the CDE workflow.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31328\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/1-6962-Flowchart.jpg\" alt=\"\" width=\"800\" height=\"247\"><\/p>\n<p>The path a document takes depends on the presence of different CDE components:<\/p>\n<ul>\n<li><strong>Path taken when no CDE is present<\/strong> \u2013 Steps 1 and 2<\/li>\n<li><strong>Path taken with only CDE basic operations<\/strong> \u2013 Steps 3, 4, and 2<\/li>\n<li><strong>Path taken with only CDE advanced operations<\/strong> \u2013 Steps 6, 7, 8, and 9<\/li>\n<li><strong>Path taken when CDE basic operations and advanced operations are present<\/strong> \u2013 Steps, 3, 5, 7, 8, and 9<\/li>\n<\/ul>\n<p>The CDE basic operations and advanced operations components are optional. For more information on the CDE basic operations and advanced operations with the <code>preExtraction<\/code> and <code>postExtraction<\/code> <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> functions, refer to the <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/custom-document-enrichment.html\" target=\"_blank\" rel=\"noopener noreferrer\">Custom Document Enrichment<\/a> section in the <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/what-is-kendra.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra Developer Guide<\/a>.<\/p>\n<p>In this post, we walk you through four use cases:<\/p>\n<ul>\n<li>Automatically assign category attributes based on the subdirectory of the document being ingested<\/li>\n<li>Automatically extract text while ingesting scanned image documents to make them searchable<\/li>\n<li>Automatically create a transcription while ingesting audio and video files to make them searchable<\/li>\n<li>Automatically generate facets based on entities in a document to enhance the search experience<\/li>\n<\/ul>\n<h2>Prerequisites<\/h2>\n<p>You can follow the step-by-step guide in your AWS account to get a first-hand experience of using CDE. Before getting started, complete the following prerequisites:<\/p>\n<ol>\n<li>Download the sample data files <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/enrich-your-content-and-metadata-custom-document-enrichment-kendra\/AWS_Whitepapers.zip\" target=\"_blank\" rel=\"noopener noreferrer\">AWS_Whitepapers.zip<\/a>, <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/enrich-your-content-and-metadata-custom-document-enrichment-kendra\/GenMeta.zip\" target=\"_blank\" rel=\"noopener noreferrer\">GenMeta.zip<\/a>, and <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/enrich-your-content-and-metadata-custom-document-enrichment-kendra\/Media.zip\" target=\"_blank\" rel=\"noopener noreferrer\">Media.zip<\/a> to a local drive on your computer.<\/li>\n<li>In your AWS account, create a new Amazon Kendra index, Developer Edition. For more information and instructions, refer to the <strong>Getting Started<\/strong> chapter in the <a href=\"https:\/\/kendra-essentials.workshop.aws\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra Essentials<\/a> workshop and <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/create-index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Creating an index<\/a>.<\/li>\n<li>Open the <a href=\"https:\/\/console.aws.amazon.com\/console\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a>, and make sure that you\u2019re logged in to your AWS account<\/li>\n<li>Create an <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket to use as a data source. Refer to <a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/Welcome.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon S3 User Guide<\/a> for more information.<\/li>\n<li>Click on <a href=\"https:\/\/console.aws.amazon.com\/cloudformation\/home?#\/stacks\/create\/review?templateURL=https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/enrich-your-content-and-metadata-custom-document-enrichment-kendra\/cde-blog-template.yaml\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-20275 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/31\/LaunchStack.jpg\" alt=\"\" width=\"107\" height=\"20\"><\/a> to launch the <a href=\"https:\/\/aws.amazon.com\/cloudformation\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> to deploy the <code>preExtraction<\/code> and <code>postExtraction<\/code> Lambda functions and the required AWS Identity and Access Management (IAM) roles. It will open the AWS CloudFormation Management Console.\n<ol type=\"a\">\n<li>Provide a unique name for your CloudFormation stack and the name of the bucket you just created as a parameter.<\/li>\n<li>Choose <strong>Next<\/strong>, select the acknowledgement check boxes, and choose <strong>Create stack<\/strong>.<\/li>\n<li>After the stack creation is complete, note the contents of the <strong>Outputs<\/strong>. We use these values later.<\/li>\n<\/ol>\n<\/li>\n<li>Configure the S3 bucket as a data source using the S3 data source connector in the Amazon Kendra index you created. When configuring the data source, in the <strong>Additional configurations<\/strong> section, define the <strong>Include pattern<\/strong> to be <code>Data\/<\/code>. For more information and instructions, refer to the <strong>Using Amazon Kendra S3 Connector<\/strong> subsection of the <strong>Ingesting Documents<\/strong> section in the <a href=\"https:\/\/kendra-essentials.workshop.aws\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra Essentials<\/a> workshop and <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/getting-started-s3.html\" target=\"_blank\" rel=\"noopener noreferrer\">Getting Started with an Amazon S3 data source (console)<\/a>.<\/li>\n<li>Extract the contents of the data file AWS_Whitepapers.zip to your local machine and upload them to the S3 bucket you created at the path <code>s3:\/\/<span>&lt;YOUR-DATASOURCE-BUCKET&gt;<\/span>\/Data\/<\/code> while preserving the subdirectory structure.<\/li>\n<\/ol>\n<h2>Automatically assign category attributes based on the subdirectory of the document being ingested<\/h2>\n<p>The documents in the sample data are stored in subdirectories <code>Best_Practices<\/code>, <code>Databases<\/code>, <code>General<\/code>, <code>Machine_Learning<\/code>, <code>Security<\/code>, and <code>Well_Architected<\/code>. The S3 bucket used as the data source looks like the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31329\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/2-6962-Data.jpg\" alt=\"\" width=\"800\" height=\"408\"><\/p>\n<p>We use CDE basic operations to automatically set the category attribute based on the subdirectory a document belongs to while the document is being ingested.<\/p>\n<ol>\n<li>On the Amazon Kendra console, open the index you created.<\/li>\n<li>Choose <strong>Data sources<\/strong> in the navigation pane.<\/li>\n<li>Choose the data source used in this example.<\/li>\n<li>Copy the data source ID.<\/li>\n<li>Choose <strong>Document enrichment<\/strong> in the navigation pane.<\/li>\n<li>Choose <strong>Add document enrichment<\/strong>.<\/li>\n<li>For <strong>Data Source ID<\/strong>, enter the ID you copied.<\/li>\n<li>Enter six basic operations, one corresponding to each subdirectory.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31330\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/3-6962-Configure.jpg\" alt=\"\" width=\"800\" height=\"1181\"><\/p>\n<ol start=\"9\">\n<li>Choose\u00a0<strong>Next<\/strong>.<\/li>\n<li>Leave the configuration for both Lambda functions blank.<\/li>\n<li>For <strong>Service permissions<\/strong>, choose <strong>Enter custom role ARN<\/strong> and enter the <code>CDERoleARN<\/code> value (available on the stack\u2019s <strong>Outputs<\/strong> tab).<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31331\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/4-6962-Service.jpg\" alt=\"\" width=\"800\" height=\"392\"><\/p>\n<ol start=\"12\">\n<li>Choose <strong>Next<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31332\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/5-6962-Review.jpg\" alt=\"\" width=\"800\" height=\"995\"><\/p>\n<ol start=\"13\">\n<li>Review all the information and choose <strong>Add document enrichment<\/strong>.<\/li>\n<li>Browse back to the data source we\u2019re using by choosing <strong>Data sources<\/strong> in the navigation pane and choose the data source.<\/li>\n<li>Choose <strong>Sync now<\/strong> to start data source sync.<\/li>\n<\/ol>\n<p>The data source sync can take up to 10\u201315 minutes to complete.<\/p>\n<ol start=\"16\">\n<li>While waiting for the data source sync to complete, choose <strong>Facet definition<\/strong> in the navigation pane.<\/li>\n<li>For the <strong>Index<\/strong> field of <strong>_category<\/strong>, select <strong>Facetable<\/strong>, <strong>Searchable<\/strong>, and <strong>Displayable<\/strong> to enable these properties.<\/li>\n<li>Choose <strong>Save<\/strong>.<\/li>\n<li>Browse back to the data source page and wait for the sync to complete.<\/li>\n<li>When the data source sync is complete, choose <strong>Search indexed content<\/strong> in the navigation pane.<\/li>\n<li>Enter the query <code>Which service provides 11 9s of durability?<\/code>.<\/li>\n<li>After you get the search results, choose <strong>Filter search results<\/strong>.<\/li>\n<\/ol>\n<p>The following screenshot shows the results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31333\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/6-6962-Which-service.jpg\" alt=\"\" width=\"800\" height=\"721\"><\/p>\n<p>For each of the documents that were ingested, the category attribute values set by the CDE basic operations are seen as selectable facets.<\/p>\n<p>Note <strong>Document fields<\/strong> for each of the results. When you click on it, it shows the fields or attributes of the document included in that result as seen in the screenshot below.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31334\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/7-6962-.jpg\" alt=\"\" width=\"800\" height=\"666\"><\/p>\n<p>From the selectable facets, you can select a category, such as <strong>Best Practices<\/strong>, to filter your search results to be only from the <code>Best Practices<\/code> category, as shown in the following screenshot. The search experience improved significantly without requiring additional manual steps during document ingestion.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31335\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/8-6962-.jpg\" alt=\"\" width=\"800\" height=\"620\"><\/p>\n<h2>Automatically extract text while ingesting scanned image documents to make them searchable<\/h2>\n<p>In order for documents that are scanned as images to be searchable, you first need to extract the text from such documents and ingest that text in an Amazon Kendra index. The pre-extraction Lambda function from the CDE advanced operations provides a place to implement text extraction and modification logic. The pre-extraction function we configure has the code to extract the text from images using Amazon Textract. The function code is embedded in the CloudFormation template we used earlier. You can choose the <strong>Template<\/strong> tab of the template on the AWS CloudFormation console and review the code for <code>PreExtractionLambda<\/code>.<\/p>\n<p>We now configure CDE advanced operations to try out this and additional examples.<\/p>\n<ol>\n<li>On the Amazon Kendra console, choose <strong>Document enrichments<\/strong> in the navigation pane.<\/li>\n<li>Select the CDE we configured.<\/li>\n<li>On the <strong>Actions<\/strong> menu, choose <strong>Edit<\/strong>.<\/li>\n<li>Choose <strong>Add basic operations<\/strong>.<\/li>\n<\/ol>\n<p>You can view all the basic operations you added.<\/p>\n<ol start=\"5\">\n<li>Add two more operations: one for Media and one for GEN_META.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31336\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/9-6962-GEN.jpg\" alt=\"\" width=\"800\" height=\"457\"><\/p>\n<ol start=\"6\">\n<li>Choose <strong>Next<\/strong>.<\/li>\n<\/ol>\n<p>In this step, you need the ARNs of the <code>preExtraction<\/code> and <code>postExtraction<\/code> functions (available on the <strong>Outputs<\/strong> tab of the CloudFormation stack). We use the same bucket that you\u2019re using as the data source bucket.<\/p>\n<ol start=\"7\">\n<li>Enter the conditions, ARN, and bucket details for the pre-extraction and post-extraction functions.<\/li>\n<li>For <strong>Service permissions<\/strong>, choose <strong>Enter custom role ARN<\/strong> and enter the <code>CDERoleARN<\/code> value (available on the stack\u2019s <strong>Outputs<\/strong> tab).<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31337\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/10-6962-Lambda.jpg\" alt=\"\" width=\"800\" height=\"914\"><\/p>\n<ol start=\"9\">\n<li>Choose\u00a0<strong>Next.\u00a0<\/strong><\/li>\n<li>Choose <strong>Add document enrichment<\/strong>.<\/li>\n<\/ol>\n<p>Now we\u2019re ready to ingest scanned images into our index. The sample data file Media.zip you downloaded earlier contains two image files: Yosemite.png and Yellowstone.png. These are scanned pictures of the Wikipedia pages of Yosemite National Park and Yellowstone National Park, respectively.<\/p>\n<ol start=\"11\">\n<li>Upload these to the S3 bucket being used as the data source in the folder <code>s3:\/\/<span>&lt;YOUR-DATASOURCE-BUCKET&gt;<\/span>\/Data\/Media\/<\/code>.<\/li>\n<li>Open the data source on the Amazon Kendra console start a data source sync.<\/li>\n<li>When the data source sync is complete, browse to <strong>Search indexed content<\/strong>\u00a0and enter the query <code>Where is Yosemite National Park?<\/code>.<\/li>\n<\/ol>\n<p>The following screenshot shows the search results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31338\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/11-6962-Where-is-Yosemite.jpg\" alt=\"\" width=\"800\" height=\"595\"><\/p>\n<ol start=\"14\">\n<li>Choose the link from the top search result.<\/li>\n<\/ol>\n<p>The scanned image pops up, as in the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31339\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/12-6962-Wikipedia.jpg\" alt=\"\" width=\"800\" height=\"553\"><\/p>\n<p>You can experiment with similar questions related to Yellowstone.<\/p>\n<h2>Automatically create a transcription while ingesting audio or video files to make them searchable<\/h2>\n<p>Similar to images, audio and video content needs to be transcribed in order to be searchable. The pre-extraction Lambda function also contains the code to call Amazon Transcribe for audio and video files to transcribe them and extract a time-marked transcript. Let\u2019s try it out.<\/p>\n<p>The maximum runtime allowed for a CDE pre-extraction Lambda function is 5 minutes (300 seconds), so you can only use it to transcribe audio or video files of short duration, about 10 minutes or less. For longer files, you can use the approach described in <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/make-your-audio-and-video-files-searchable-using-amazon-transcribe-and-amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra<\/a>.<\/p>\n<p>The sample data file Media.zip contains a video file How_do_I_configure_a_VPN_over_AWS_Direct_Connect_.mp4, which has a video tutorial.<\/p>\n<ol>\n<li>Upload this file to the S3 bucket being used as the data source in the folder <code>s3:\/\/<span>&lt;YOUR-DATASOURCE-BUCKET&gt;<\/span>\/Data\/Media\/<\/code>.<\/li>\n<li>On the Amazon Kendra console, open the data source and start a data source sync.<\/li>\n<li>When the data source sync is complete, browse to <strong>Search indexed content <\/strong>and enter the query <code>What is the process to configure VPN over AWS Direct Connect?<\/code>.<\/li>\n<\/ol>\n<p>The following screenshot shows the search results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31340\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/13-6962.jpg\" alt=\"\" width=\"800\" height=\"592\"><\/p>\n<ol start=\"4\">\n<li>Choose link in the answer to start the video.<\/li>\n<\/ol>\n<p>If you seek to an offset of 84.44 seconds (1 minute, 24 seconds), you\u2019ll hear exactly what the excerpt shows.<\/p>\n<h2>Automatically generate facets based on entities in a document to enhance the search experience<\/h2>\n<p>Relevant facets such as the entities in documents like places, people, and events, when presented as as part of search results, provide an interactive way for a user to filter search results and find what they\u2019re looking for. Amazon Kendra metadata, when populated correctly, can provide these facets, and enhances the user experience.<\/p>\n<p>The post-extraction Lambda function allows you to implement the logic to process the text extracted by Amazon Kendra from the ingested document, then create and update the metadata. The post-extraction function we configured implements the code to invoke Amazon Comprehend to detect entities from the text extracted by Amazon Kendra, and uses them to update the document metadata, which is presented as facets in an Amazon Kendra search. The function code is embedded in the CloudFormation template we used earlier. You can choose the <strong>Template<\/strong> tab of the stack on the CloudFormation console and review the code for <code>PostExtractionLambda<\/code>.<\/p>\n<p>The maximum runtime allowed for a CDE post-extraction function is 60 seconds, so you can only use it to implement tasks that can be completed in that time.<\/p>\n<p>Before we can try out this example, we need to define the entity types that we detect using Amazon Comprehend as facets in our Amazon Kendra index.<\/p>\n<ol>\n<li>On the Amazon Kendra console, choose the index we\u2019re working on.<\/li>\n<li>Choose <strong>Facet definition<\/strong> in the navigation pane.<\/li>\n<li>Choose <strong>Add field<\/strong> and add fields for <code>COMMERCIAL_ITEM<\/code>, <code>DATE<\/code>, <code>EVENT<\/code>, <code>LOCATION<\/code>, <code>ORGANIZATION<\/code>, <code>OTHER<\/code>, <code>PERSON<\/code>, <code>QUANTITY<\/code>, and <code>TITLE<\/code> of type <code>StringList<\/code>.<\/li>\n<li>Make <code>LOCATION<\/code>, <code>ORGANIZATION<\/code> and <code>PERSON<\/code> facetable by selecting <strong>Facetable<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31341\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/14-6962-Index.jpg\" alt=\"\" width=\"800\" height=\"578\"><\/p>\n<ol start=\"5\">\n<li>Extract the contents of the GenMeta.zip data file and upload the files United_Nations_Climate_Change_conference_Wikipedia.pdf, United_Nations_General_Assembly_Wikipedia.pdf, United_Nations_Security_Council_Wikipedia.pdf, and United_Nations_Wikipedia.pdf to the S3 bucket being used as the data source in the folder <code>s3:\/\/<span>&lt;YOUR-DATASOURCE-BUCKET&gt;<\/span>\/Data\/GEN_META\/<\/code>.<\/li>\n<li>Open the data source on the Amazon Kendra console and start a data source sync.<\/li>\n<li>When the data source sync is complete, browse to <strong>Search indexed content<\/strong>\u00a0and enter the query <code>What is Paris agreement?<\/code>.<\/li>\n<li>After you get the results, choose <strong>Filter search results<\/strong> in the navigation pane.<\/li>\n<\/ol>\n<p>The following screenshot shows the faceted search results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-31342\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/29\/15-6962-What-is-Paris.jpg\" alt=\"\" width=\"800\" height=\"637\"><\/p>\n<p>All the facets of the type <code>ORGANIZATION<\/code>, <code>LOCATION<\/code>, and <code>PERSON<\/code> are automatically generated by the post-extraction Lambda function with the detected entities using Amazon Comprehend. You can use these facets to interactively filter the search results. You can also try a few more queries and experiment with the facets.<\/p>\n<h2>Clean up<\/h2>\n<p>After you have experimented with the Amazon Kendra index and the features of CDE, delete the infrastructure you provisioned in your AWS account while working on the examples in this post:<\/p>\n<ul>\n<li>CloudFormation stack<\/li>\n<li>Amazon Kendra index<\/li>\n<li>S3 bucket<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>Enhancing data and metadata can improve the effectiveness of search results and improve the search experience. You can use the custom data enrichment (CDE) feature of Amazon Kendra to easily automate the CDE process by creating, modifying, or deleting the metadata using the basic operations. You can also use the advanced operations with pre-extraction and post-extraction Lambda functions to implement the logic to manipulate the data and metadata.<\/p>\n<p>We demonstrated using subdirectories to assign categories, using Amazon Textract to extract text from scanned images, using Amazon Transcribe to generate a transcript of audio and video files, and using Amazon Comprehend to detect entities that are added as metadata and later available as facets to interact with the search results. This is just an illustration of how you can use CDE to create a differentiated search experience for your users.<\/p>\n<p>For a deeper dive into what you can achieve by combining other AWS services with Amazon Kendra, refer to <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/make-your-audio-and-video-files-searchable-using-amazon-transcribe-and-amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Make your audio and video files searchable using Amazon Transcribe and Amazon Kendra<\/a>, <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/build-an-intelligent-search-solution-with-automated-content-enrichment\/\" target=\"_blank\" rel=\"noopener noreferrer\">Build an intelligent search solution with automated content enrichment<\/a>, and other posts on the <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/category\/artificial-intelligence\/amazon-kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra blog<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-20223 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/24\/Abhinav-Jawadekar.jpg\" alt=\"Abhinav Jawadekar\" width=\"100\" height=\"133\"><strong>Abhinav Jawadekar<\/strong> is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS Partners to help them in their cloud journey.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/enrich-your-content-and-metadata-to-enhance-your-search-experience-with-custom-document-enrichment-in-amazon-kendra\/<\/p>\n","protected":false},"author":0,"featured_media":1294,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1293"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1293"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1293\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1294"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}