{"id":1171,"date":"2021-11-09T08:34:47","date_gmt":"2021-11-09T08:34:47","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/09\/augment-search-with-metadata-by-chaining-amazon-textract-amazon-comprehend-and-amazon-kendra\/"},"modified":"2021-11-09T08:34:47","modified_gmt":"2021-11-09T08:34:47","slug":"augment-search-with-metadata-by-chaining-amazon-textract-amazon-comprehend-and-amazon-kendra","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/09\/augment-search-with-metadata-by-chaining-amazon-textract-amazon-comprehend-and-amazon-kendra\/","title":{"rendered":"Augment search with metadata by chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra<\/a> is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they\u2019re looking for, even when it\u2019s scattered across multiple locations and content repositories within your organization. With Amazon Kendra, you can stop searching through troves of unstructured data and discover the right answers to your questions, when you need them.<\/p>\n<p>Although Amazon Kendra is a great search tool, it only performs as well as the quality of documents in its index. Like with all things AI\/ML related, the better the quality of data that is input into Amazon Kendra, the more targeted and precise the search results. So how can we improve the documents in our Amazon Kendra index to maximize search result performance? To allow Amazon Kendra to return more targeted search results, we enrich the documents with metadata to use attributes such as main language, named entities, key phrases, and more.<\/p>\n<p>In this post, we address the following use case: With large amounts of raw historical documents to search on, how do we connect metadata to the documents to take advantage of Amazon Kendra\u2019s boosting and filtering features? We aim to demonstrate a way in which you can enrich your historical data by adding metadata to searchable documents with Amazon Textract and Amazon Comprehend, to get more targeted and flexible searches with Amazon Kendra.<\/p>\n<p>Note that the following tutorial is written using Amazon SageMaker Notebooks as a code platform. However, these API calls can be done using any IDE of your choice. To save costs, and for the sake of your own familiarity, feel free to use your favorite IDE in place of SageMaker Notebooks to follow along.<\/p>\n<h2>Solution overview<\/h2>\n<p>For this post, we examine a hypothetical use case for a media entertainment company. We have many documents about movies and television shows and want to use Amazon Kendra to query the data. For demonstration purposes, we pull public data on <a href=\"https:\/\/en.wikipedia.org\/wiki\/Main_Page\" target=\"_blank\" rel=\"noopener noreferrer\">Wikipedia<\/a> to create PDF documents that act as our company\u2019s data that we want to query on. We use an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/nbi.html\" target=\"_blank\" rel=\"noopener noreferrer\">notebook instance<\/a> as our code platform. We use Python, along with the <a href=\"https:\/\/boto3.amazonaws.com\/v1\/documentation\/api\/latest\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Boto3 Python library<\/a>, to connect to and use the <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a>, <a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a>, and Amazon Kendra APIs.<\/p>\n<p>We walk you through the following high-level steps:<\/p>\n<ol>\n<li>Create our media PDF documents through Wikipedia.<\/li>\n<li>Create metadata using Amazon Textract and Amazon Comprehend.<\/li>\n<li>Configure the Amazon Kendra index and load the data.<\/li>\n<li>Run a sample query and experiment with boosting query performance.<\/li>\n<\/ol>\n<h2>Prerequisites<\/h2>\n<p>As a prerequisite, we first set up a SageMaker notebook instance and a Python notebook within it.<\/p>\n<h3>Create a SageMaker notebook instance<\/h3>\n<p>To create a SageMaker notebook instance, you can follow the instructions in the documentation <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/howitworks-create-ws.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a Notebook Instance<\/a>, or follow the configuration that we use in this post.<\/p>\n<ol>\n<li>Create a notebook instance with the following configuration:\n<ol type=\"a\">\n<li><strong>Notebook instance name<\/strong> \u2013 <code>KendraAugmentation<\/code><\/li>\n<li><strong>Notebook instance class <\/strong>\u2013 ml.t2.medium<\/li>\n<li><strong>Elastic inference <\/strong>\u2013 None<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>Next, we create an <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) role.<\/p>\n<ol start=\"2\">\n<li>Choose <strong>Create a new role<\/strong>.<\/li>\n<li>Choose <strong>Next<\/strong> to create a role.<\/li>\n<\/ol>\n<p>The role name starts with <code>AmazonSageMaker-ExecutionRole-xxxxxxxx<\/code>. For this example, we create a role called <code><strong>AmazonSageMaker-ExecutionRole-Kendra-Blog<\/strong><\/code>.<\/p>\n<ol start=\"4\">\n<li>For <strong>Root access<\/strong>, select <strong>Enable<\/strong>.<\/li>\n<li>Leave the remaining options at their default.<\/li>\n<li>Choose <strong>Create notebook instance<\/strong>.<\/li>\n<\/ol>\n<p>You\u2019re redirected to a page that shows that your notebook instance is being created. The process takes a few minutes. When you see a green <code><strong>InService<\/strong><\/code> state, the notebook is ready.<\/p>\n<h3>Create a Python3 notebook in your SageMaker notebook instance<\/h3>\n<p>When your SageMaker instance is ready, choose the version of Jupyter you prefer to use. For this post, we use the original Jupyter notebook as opposed to JupyterLab.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image001.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29597\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image001.jpg\" alt=\"\" width=\"2204\" height=\"402\"><\/a><\/p>\n<p>When inside, create a new <code><strong>conda_python3<\/strong><\/code> notebook.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image003-2.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29610\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image003-2.jpg\" alt=\"\" width=\"300\" height=\"508\"><\/a><\/p>\n<p>With this, we\u2019re ready to start writing and running Python code. To run the rest of the code that follows, run the following code in the first Jupyter notebook cell to import the necessary modules we need:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Module Imports\nimport boto3\nimport os<\/code><\/pre>\n<\/p><\/div>\n<p>As we go through each of the sections, we import other modules as necessary.<\/p>\n<h2>Create Media PDF documents through Wikipedia<\/h2>\n<p>Run the following code in one of the Jupyter notebook cells to create an <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket where we store all the media documents that we search on:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Instantiate Amazon S3 client.\ns3_client = boto3.client('s3')\n\n# Create bucket.\nbucket_name = \"kendra-augmentation-documents-jp\"\ns3_client.create_bucket(Bucket=bucket_name)\n\n# List buckets to make sure bucket was created.\nresponse = s3_client.list_buckets()\nprint('Existing buckets:')\nfor bucket in response['Buckets']:\n    print(f'  {bucket[\"Name\"]}')<\/code><\/pre>\n<\/p><\/div>\n<p>For this post, our bucket is <code><strong>kendra-augmentation-documents-jp<\/strong><\/code>. You can update the code with a different name.<\/p>\n<p>As we mentioned earlier, we create mock PDF documents from public Wikipedia content that represent the media data that we augment and perform searches on. I\u2019ve pre-selected movies and TV shows from the entertainment industry in the following code, but you can choose different topics in your notebook.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Import fpdf module.\nfrom fpdf import FPDF\n\n# Movie and TV show topics for media documents.\ntopics = [\"Dumb &amp; Dumber Movie\", \"Black Panther Movie\",\n          \"Star Wars The Last Jedi Movie\", \"Mary Poppins Movie\",\n          \"Kung Fu Panda Movie\", \"I Love Lucy\", \"The Office TV Show\",\n          \"Star Trek: The Original Series\", \"NCIS TV Show\",\n          \"Game of Thrones TV Show\"]\n\n# Create PDF documents out of each topic and store into Amazon S3 bucket.\nfor topic in topics:\n    \n    # Define text and PDF document names.\n    text_filename = f\"{topic}.txt\".replace(\" \",\"_\")\n    pdf_filename = text_filename.replace(\"txt\",\"pdf\")\n    \n    # Write to text first.\n    with open(text_filename, \"w+\") as f:\n        f.write(wikipedia.summary(topic))\n        \n    # Convert text to pdf.\n    pdf = FPDF()\n    with open(text_filename, 'rb') as text_file:\n        txt = text_file.read().decode('latin-1')\n    pdf.set_font('Times', '', 12)\n    pdf.add_page()\n    pdf.multi_cell(0, 5, txt)\n    pdf.output(pdf_filename, 'F')\n    \n    # Upload to Amazon S3 bucket.\n    s3_client = boto3.client('s3')\n    s3_client.upload_file(pdf_filename,\n                          \"kendra-augmentation-documents-jp\",\n                          pdf_filename)\n    os.remove(text_filename)\n    os.remove(pdf_filename)<\/code><\/pre>\n<\/p><\/div>\n<p>When this code block finishes running, we have 10 media PDF documents that we can augment with metadata using Amazon Textract and Amazon Comprehend, then run queries on with Amazon Kendra.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image005-2.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-29612 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image005-2.jpg\" alt=\"\" width=\"300\" height=\"343\"><\/a><\/p>\n<h2>Create metadata using Amazon Textract and Amazon Comprehend<\/h2>\n<p>To create metadata for each of our PDF files, we must first extract the text portions of each PDF using Amazon Textract. We run the extracted text through Amazon Comprehend to attach attributes (metadata) to the PDFs, such as named entities, dominant language, and key phrases. Note that Amazon Comprehend will be able to read PDF files directly in a future feature release.<\/p>\n<ol>\n<li>Use the following helper function (<code><strong>s3_get_filenames<\/strong><\/code>) to get all the file names in a specific bucket or prefix folder in Amazon S3:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">def s3_get_filenames(bucket, prefix=None):\n    \"\"\"\n    Gets all the filenames in a specific bucket\/prefix folder in Amazon S3.\n    \n    Parameters:\n    ----------\n    bucket : str\n        String representing bucket name you want to get filenames from.\n    prefix : str\n        String representing prefix within the bucket that you want to get\n        filenames from.\n        \n    Returns:\n    -------\n    file_list : list[str]\n        List containing all filenames within the bucket\/prefix location\n    \"\"\"\n    \n    # Set Amazon S3 client and get file objects.\n    s3 = boto3.client('s3')\n    if prefix == None:\n        prefix = ''\n    result = s3.list_objects(Bucket=bucket, Prefix=prefix)\n    \n    # Put all file names into one list.\n    file_list = []\n    for obj in result['Contents']:\n        # Only take objects that are not the folder.\n        if 'metadata\/' not in obj['Key']:\n            file_list += [obj['Key']]\n        \n    return file_list<\/code><\/pre>\n<\/p><\/div>\n<p>We run Amazon Textract on each of our PDF files to extract the text of each file and transform the data into a format that we later ingest into Amazon Comprehend.<\/p>\n<p>Next, we create the S3 bucket and service role settings needed to run Amazon Textract through SageMaker notebook instances.<\/p>\n<ol start=\"2\">\n<li>Create a new S3 bucket to store our Amazon Textract output, called <code><strong>kendra-augmentation-textract-output-jp<\/strong><\/code>:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Create a new bucket for Amazon Textract text outputs:\nbucket_name = \"kendra-augmentation-textract-output-jp\"\ns3_client.create_bucket(Bucket=bucket_name)<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"3\">\n<li>Attach the <code><strong>AmazonTextractFullAccess<\/strong><\/code> policy to the same <code><strong>AmazonSageMaker-ExecutionRole-Kendra-Blog <\/strong><\/code>role.<\/li>\n<\/ol>\n<p>This policy allows SageMaker to access Amazon Textract.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image007.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-29600 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image007.jpg\" alt=\"\" width=\"2268\" height=\"1090\"><\/a><\/p>\n<ol start=\"4\">\n<li>Run the following code to run Amazon Textract on our PDF files, create new .txt files for Amazon Comprehend to use, and send these files to the S3 bucket we created:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Amazon Textract specific imports.\n!pip install amazon-textract-caller amazon-textract-prettyprinter\nfrom textractcaller.t_call import call_textract, Textract_Features\nfrom textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_lines_string\nimport textractprettyprinter.t_pretty_print as prettyprint\n\n# Transform and output documents from s3.\ninput_bucket = 'kendra-augmentation-documents-jp'\noutput_bucket = 'kendra-augmentation-textract-output-jp'\ns3 = boto3.client('s3')\n\n# Get all the names of the media documents we want to run Textract on.\ninput_documents = s3_get_filenames(input_bucket)\n\n# Loop through input documents, transform textract outputs into LINE\n# transformations, and output to S3 for ingestion into Comprehend.\nfor document_name in input_documents:\n    \n    # Define input document to read.\n    input_document = f's3:\/\/{input_bucket}\/{document_name}'\n    \n    # Get text using Textract call_textract function.\n    textract_json = call_textract(input_document=input_document)\n    \n    # Convert response from Textract using get_lines_string function.\n    line_transformation_text = get_lines_string(textract_json=textract_json)\n\n    # Put text into text file to send back to S3.\n    filename = f'{document_name}_LINE.txt'\n    with open(filename,'w+') as f:\n        f.write(line_transformation_text)\n        \n    # Send text file to S3 to be ingested into Comprehend.\n    with open(filename, 'rb') as data:\n        s3.upload_fileobj(data, output_bucket, filename)<\/code><\/pre>\n<\/p><\/div>\n<p>We now have a .txt Amazon Textract output file for each of our PDFs in the <code><strong>kendra-augmentation-textract-output-jp<\/strong><\/code> bucket.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image009.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29601\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image009.jpg\" alt=\"\" width=\"2140\" height=\"1200\"><\/a><\/p>\n<p>Now that we have our .txt files with the text representation of our PDF files, we can create metadata out of them using Amazon Comprehend.<\/p>\n<ol start=\"5\">\n<li>Attach the <code><strong>ComprehendFullAccess<\/strong><\/code> policy to the <code><strong>AmazonSageMaker-ExecutionRole-Kendra-Blog<\/strong><\/code> role.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image011.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29602\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image011.jpg\" alt=\"\" width=\"2238\" height=\"1158\"><\/a><\/li>\n<\/ol>\n<p>We extract and attach the following Amazon Comprehend metadata attributes to each document for Amazon Kendra to index on:<\/p>\n<ul>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/how-languages.html\" target=\"_blank\" rel=\"noopener noreferrer\">Dominant language<\/a> \u2013 The language that\u2019s being used the most in the document<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/how-entities.html\" target=\"_blank\" rel=\"noopener noreferrer\">Named entities<\/a> \u2013 A textual reference to the unique name of a real-world object, such as people, places, and commercial items, and precise references to measures such as dates and quantities<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/how-key-phrases.html\" target=\"_blank\" rel=\"noopener noreferrer\">Key phrases<\/a> \u2013 A string containing a noun phrase that describes a particular thing<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/comprehend\/latest\/dg\/how-sentiment.html\" target=\"_blank\" rel=\"noopener noreferrer\">Sentiment<\/a> \u2013 The positive, negative, neutral, and mixed sentiment score of the entire document<\/li>\n<\/ul>\n<ol start=\"6\">\n<li>Use the following <code><strong>ComprehendAnalyzer<\/strong><\/code> Python class to simplify and unify the Amazon Comprehend API calls. Either copy and paste the code into one of the notebook cells and run it, or create a separate .py file and import it in the notebook.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import boto3\nimport json\n\nclass ComprehendAnalyzer:\n    \"\"\"\n    Class that takes a document in Amazon S3 and uses Amazon Comprehend to define\n    metadata to it as attributes for the purpose of being used by Amazon Kendra\n    downstream.\n    \"\"\"\n    \n    def __init__(self, s3_bucket, document=None, lang=None):\n        \"\"\"\n        Instantiates Amazon S3 and Amazon Comprehend clients plus class attributes.\n        \"\"\"\n        \n        # Instantiate Amazon S3 and Amazon Comprehend.\n        self.s3 = boto3.client('s3')\n        self.comprehend = boto3.client('comprehend')\n        \n        # Instantiate class attributes.\n        self.s3_bucket = s3_bucket\n        if s3_bucket != None:\n            self.document = document\n        if lang == None:\n            self.lang = 'en'\n            \n        # Attribute list that will be used by Amazon Kendra downstream.\n        self.attribute_list = []\n        \n    def set_document(self, document):\n        \"\"\"\n        Sets self.document whenever you want to analyze a new document without having\n        to instantiate another comprehend_analyzer object.\n        \n        Parameters:\n        ----------\n        document : str\n            String representing document filepath to process.\n            \n        Returns:\n        -------\n        Void\n        \"\"\"\n        \n        # Set self.document to new document and reset self.attribute_list\n        self.document = document\n        self.attribute_list = []\n        \n        \n    def get_dominant_languages(self, confidence_threshold=.75):\n        \"\"\"\n        Gets the dominant langauages in self.document in self.s3_bucket in string format. Only \n        add languages with a confidence score that is greater than or equal to the confidence_threshold input\n        parameter.\n\n        Parameters:\n        ----------\n        confidence_threshold : float \n            Float representing confidence threshold for adding languages to metadata. Defaults\n            to .75.\n        s3 : boto3.resources.factory.s3.ServiceResource\n            Amazon S3 client to do the processing. Defaults to None.\n        comprehend : botocore.client.Comprehend\n            Amazon Comprehend client to do the processing. Defaults to None.\n\n        Returns:\n        -------\n        languages_text : str\n            String representing all the dominant languages found in the document that are\n            greater than, or equal to, the confidence_threshold input parameter.\n        \"\"\"\n\n        # Grab text from document.\n        test_text = self.s3.get_object(Bucket=self.s3_bucket,\n                                       Key=self.document)['Body'].read()\n\n        # Detect language using Amazon Comprehend.\n        comprehend_response = self.comprehend.detect_dominant_language(Text = test_text.decode('utf-8'))\n\n        # Take languages over confidence_threshold from comprehend_response.\n        languages = []\n        for l in comprehend_response['Languages']:\n            if l['Score'] &gt;= confidence_threshold:\n                languages.append(l['LanguageCode'])\n        languages_text = ', '.join(languages)\n        \n        # Attribute dictionary input.\n        attribute_format = {'Key' : 'Languages',\n                            'Value' : {'StringValue' : languages_text}}\n        \n        # Add languages to self.attribute_list.\n        self.attribute_list.append(attribute_format)\n\n        return languages_text\n        \n    def get_named_entities(self, confidence_threshold=.75):\n        \"\"\"\n        Gets the named entities in self.document in Amazon S3 in string format. Only\n        add named entities with a confidence score that is greater than or equal to\n        the confidence_threshold input parameter.\n\n        Parameters:\n        ----------\n        confidence_threshold : float \n            Float representing confidence threshold for adding named entities to metadata. Defaults\n            to .75.\n\n        Returns:\n        -------\n        named_entities : list\n            List representing all the named entities found in the document that are\n            greater than, or equal to, the confidence_threshold input parameter.\n        \"\"\"\n\n        # Grab text from document.\n        test_text = self.s3.get_object(Bucket=self.s3_bucket,\n                                       Key=self.document)['Body'].read()\n\n        # Detect named entities using Amazon Comprehend.\n        comprehend_response = self.comprehend.detect_entities(Text = test_text.decode('utf-8'),\n                                                              LanguageCode=self.lang)\n\n        # Take named entities over confidence_threshold from comprehend_response.\n        named_entities = []\n        for entity in comprehend_response['Entities']:\n            if entity['Score'] &gt;= confidence_threshold:\n                named_entities.append(entity['Text'])\n\n        attribute_format = {'Key' : 'Named_Entities',\n                            'Value' : {'StringListValue' : named_entities[0:10]}}\n        \n        # Add named entities to self.attribute_list.\n        self.attribute_list.append(attribute_format)\n\n        return named_entities\n    \n    def get_key_phrases(self, confidence_threshold=.75):\n        \"\"\"\n        Gets key phrases in self.document in Amazon S3 in string format. Only add key phrases\n        with a confidence score that is greater than or equal to the confidence_threshold input\n        parameter.\n\n        Parameters:\n        ----------\n        confidence_threshold : float \n            Float representing confidence threshold for adding key phrases to metadata. Defaults\n            to .75.\n\n        Returns:\n        -------\n        key_phrases : list\n            List representing all the key phrases found in the document that are\n            greater than, or equal to, the confidence_threshold input parameter.\n        \"\"\"\n\n        # Grab text from document.\n        test_text = self.s3.get_object(Bucket=self.s3_bucket,\n                                       Key=self.document)['Body'].read()\n\n        # Detect key phrases using Amazon Comprehend.\n        comprehend_response = self.comprehend.detect_key_phrases(Text = test_text.decode('utf-8'),\n                                                                 LanguageCode=self.lang)\n\n        # Take named entities over confidence_threshold from comprehend_response.\n        key_phrases = []\n        for phrase in comprehend_response['KeyPhrases']:\n            if phrase['Score'] &gt;= confidence_threshold:\n                key_phrases.append(phrase['Text'])\n                \n        # Attribute dictionary input.\n        attribute_format = {'Key' : 'Key_Phrases',\n                            'Value' : {'StringListValue' : key_phrases[0:10]}}\n        \n        # Add key phrases to self.attribute_list.\n        self.attribute_list.append(attribute_format)\n        \n        return key_phrases\n    \n    def get_sentiment(self):\n        \"\"\"\n        Gets sentiment in self.document in Amazon S3 in string format. Only add sentiment\n        with a confidence score that is greater than or equal to the confidence_threshold input\n        parameter.\n\n        Parameters:\n        ----------\n        None\n\n        Returns:\n        -------\n        sentiment_dict : dict\n            Dictionary representing all the sentiment found in the document broken down\n            into overall sentiment and inidividual scores for positive, negative, neutral,\n            and mixed sentiments.\n        \"\"\"\n\n        # Grab text from document.\n        test_text = self.s3.get_object(Bucket=self.s3_bucket,\n                                       Key=self.document)['Body'].read()\n\n        # Detect sentiment using Amazon Comprehend.\n        comprehend_response = self.comprehend.detect_sentiment(Text = test_text.decode('utf-8'),\n                                                               LanguageCode=self.lang)\n                \n        # Add sentiment scores to self.attribute_dict.\n        attribute_format = [{'Key' : 'Sentiment',\n                             'Value' : {'StringValue' : comprehend_response['Sentiment']}},\n                            {'Key' : 'Positive_Score',\n                             'Value' : {'LongValue' : int(comprehend_response['SentimentScore']['Positive']*100)}},\n                            {'Key' : 'Negative_Score',\n                             'Value' : {'LongValue' : int(comprehend_response['SentimentScore']['Negative']*100)}},\n                            {'Key' : 'Neutral_Score',\n                             'Value' : {'LongValue' : int(comprehend_response['SentimentScore']['Neutral']*100)}},\n                            {'Key' : 'Mixed_Score',\n                             'Value' : {'LongValue' : int(comprehend_response['SentimentScore']['Mixed']*100)}}]\n        \n        self.attribute_list += attribute_format\n\n        # Add same information to sentiment_dict.\n        sentiment_dict = {}\n        sentiment_dict['Sentiment'] = comprehend_response['Sentiment']\n        sentiment_dict['Positive_Score'] = int(comprehend_response['SentimentScore']['Positive']*100)\n        sentiment_dict['Negative_Score'] = int(comprehend_response['SentimentScore']['Negative']*100)\n        sentiment_dict['Neutral_Score'] = int(comprehend_response['SentimentScore']['Neutral']*100)\n        sentiment_dict['Mixed_Score'] = int(comprehend_response['SentimentScore']['Mixed']*100)\n\n        return sentiment_dict<\/code><\/pre>\n<\/p><\/div>\n<p>We now have everything we need to create an Amazon Kendra index, create and add metadata to the index, and start boosting and filtering our Amazon Kendra searches!<\/p>\n<h2>Configure our Amazon Kendra index<\/h2>\n<p>Now that we\u2019ve got our Amazon Textract outputs and our Amazon Comprehend class in <code>ComprehendAnalyzer<\/code>, we can put everything together with Amazon Kendra.<\/p>\n<h3>Configure Amazon Kendra IAM access<\/h3>\n<p>Like in the previous steps, we need to give SageMaker access to use Amazon Kendra by attaching the <code><strong>AmazonKendraFullAccess <\/strong><\/code>policy to the<code><strong> AmazonSageMaker-ExecutionRole-Kendra-Blog <\/strong><\/code>role. Then we create an IAM policy and service role.<\/p>\n<ol>\n<li>Attach the <code>AmazonKendraFullAccess<\/code> policy to the <code>AmazonSageMaker-ExecutionRole-Kendra-Blog<\/code> role.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image013.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29603\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image013.jpg\" alt=\"\" width=\"2316\" height=\"1154\"><\/a><\/li>\n<\/ol>\n<p>To create an index with Amazon Kendra, we first create an IAM policy that lets Amazon Kendra access our CloudWatch Logs, and then create an Amazon Kendra service role. For full instructions, see the <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/gs-prerequisites.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra Developer Guide<\/a>. I outline the exact steps in this section for your convenience.<\/p>\n<ol start=\"2\">\n<li>On the IAM console, choose <strong>Policies<\/strong> in the navigation pane.<\/li>\n<li>Choose <strong>Create policy<\/strong>.<\/li>\n<li>Choose <strong>JSON<\/strong> and replace the default policy with the following:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"cloudwatch:PutMetricData\"\n            ],\n            \"Resource\": \"*\",\n            \"Condition\": {\n                \"StringEquals\": {\n                    \"cloudwatch:namespace\": \"AWS\/Kendra\"\n                }\n            }\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"logs:DescribeLogGroups\"\n            ],\n            \"Resource\": \"*\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"logs:CreateLogGroup\"\n            ],\n            \"Resource\": [\n                \"arn:aws:logs:region:account ID:log-group:\/aws\/kendra\/*\"\n            ]\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"logs:DescribeLogStreams\",\n                \"logs:CreateLogStream\",\n                \"logs:PutLogEvents\"\n            ],\n            \"Resource\": [\n                \"arn:aws:logs:region:account ID:log-group:\/aws\/kendra\/*:log-stream:*\"\n            ]\n        }\n    ]\n}<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"5\">\n<li>Choose <strong>Review policy<\/strong>.<\/li>\n<li>Name the policy <code>KendraPolicyForGettingStartedIndex<\/code> and choose <strong>Create policy<\/strong>.<\/li>\n<li>Choose <strong>Another AWS account<\/strong> and enter your account ID.<\/li>\n<li>Choose <strong>Next: Permissions<\/strong>.<\/li>\n<li>In the navigation pane, choose <strong>Roles<\/strong>.<\/li>\n<li>Choose <strong>Create role<\/strong>.<\/li>\n<li>Choose the policy that you just created and choose <strong>Next: Tags<\/strong>.<\/li>\n<li>Don\u2019t add any tags and choose <strong>Next: Review<\/strong>.<\/li>\n<li>Name the role <code>KendraRoleForGettingStartedIndex<\/code> and choose <strong>Create role<\/strong>.<\/li>\n<li>Find the role that you just created and open the role summary.<\/li>\n<li>Choose <strong>Trust relationships<\/strong> and then choose <strong>Edit trust relationship<\/strong>.<\/li>\n<li>Replace the existing trust relationship with the following:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Principal\": {\n        \"Service\": \"kendra.amazonaws.com\"\n      },\n      \"Action\": \"sts:AssumeRole\"\n    }\n  ]\n}<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"17\">\n<li>Choose <strong>Update trust policy<\/strong>.<\/li>\n<\/ol>\n<h3>Create your Amazon Kendra Index<\/h3>\n<p>Now that we\u2019ve got all the policies and roles that we need, let\u2019s create our Amazon Kendra index using the following code. You have to update the role ARN with your AWS account number.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Instantiate Amazon Kendra client.\nkendra = boto3.client('kendra')\n\n# Set index name and input service role.\nindex_name = \"blog-media-company-index\"\nindex_role_arn =\"arn:aws:iam::{<span>your_account_no<\/span>}:role\/service-role\/KendraPolicyForGettingStartedIndex\"\n\n# Create Amazon Kendra index\nindex_response = kendra.create_index(\n    Name = index_name,\n    RoleArn = index_role_arn\n ) \n \n # Get index ID for reference.\nindex_id = index_response[\"Id\"]\n\n# Check status of index.\nimport time\nwhile True:\n    # Get index description\n    index_description = kendra.describe_index(Id = index_id)\n    # When status is not CREATING quit.\n    status = index_description[\"Status\"]\n    print(\" Creating index. Status: \"+status)\n    time.sleep(60)\n    if status != \"CREATING\":\n        break <\/code><\/pre>\n<\/p><\/div>\n<p>When this code block is done running, you should see the status as <code><strong>Active<\/strong> <\/code>when your Amazon Kendra index has been created.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image015-1.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-29608 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image015-1.jpg\" alt=\"\" width=\"500\" height=\"431\"><\/a><\/p>\n<h3>Define the Amazon Kendra index metadata configuration<\/h3>\n<p>We now define the metadata configuration for the index <code><strong>blog-media-company-index <\/strong><\/code>we just made. It follows the Amazon Comprehend attributes we defined in our Python class <code><strong>ComprehendAnalyzer<\/strong><\/code>. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Since comprehend_analyzer has the ability to create 8 new attributes with Amazon \n# Textract and Amazon Comprehend, we'll update the metadata configuration of Amazon \n# Kendra to reflect those attributes.\n\nmeta_config_dict = {'Id':index_id,\n                    'DocumentMetadataConfigurationUpdates':[\n                         {'Name': 'Languages',\n                          'Type': 'STRING_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : True,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1},\n                         },\n                         {'Name': 'Key_Phrases',\n                          'Type': 'STRING_LIST_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : True,\n                              'Displayable': True},\n                         },\n                         {'Name': 'Named_Entities',\n                          'Type': 'STRING_LIST_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : True,\n                              'Displayable': True},\n                         },\n                         {'Name': 'Sentiment',\n                          'Type': 'STRING_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : True,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1},\n                         },\n                         {'Name': 'Positive_Score',\n                          'Type': 'LONG_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : False,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1,\n                              'RankOrder': 'DESCENDING'},\n                         },\n                         {'Name': 'Negative_Score',\n                          'Type': 'LONG_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : False,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1,\n                              'RankOrder': 'DESCENDING'},\n                         },\n                         {'Name': 'Neutral_Score',\n                          'Type': 'LONG_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : False,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1,\n                              'RankOrder': 'DESCENDING'},\n                         },\n                         {'Name': 'Mixed_Score',\n                          'Type': 'LONG_VALUE',\n                          'Search': {\n                              'Facetable': True,\n                              'Searchable' : False,\n                              'Displayable': True},\n                          'Relevance': {\n                              'Importance': 1,\n                              'RankOrder': 'DESCENDING'},\n                         },\n                     ]\n                    }\n\nresponse = kendra.update_index(**meta_config_dict)\nprint(response)<\/code><\/pre>\n<\/p><\/div>\n<h3>Create metadata using ComprehendAnalyzer<\/h3>\n<p>Now that we\u2019ve created our index <code><strong>blog-media-company-index <\/strong><\/code>and defined and set our metadata configuration, we use <code><strong>ComprehendAnalyzer<\/strong><\/code> to extract metadata from our media files in Amazon S3:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Define input parameters.\ntextract_output_bucket = 'kendra-augmentation-textract-output-jp'\ntextract_documents = s3_get_filenames(textract_output_bucket)\nanalyzer = ComprehendAnalyzer(s3_bucket=textract_output_bucket)\n\n# Instantiate Amazon S3 client.\ns3 = boto3.client('s3')\n\n# Instantiate document list to be ingested into Amazon Kendra index.\ndocuments = []\n\n# Loop through each document in Amazon S3:\n#    1. Create document by using the Amazon Textract output in addition to\n#       the metadata defined with comprehend_analyzer.\n#    2. Append document to document list to be ingested into Amazon Kendra.\n\nfor d in textract_documents:\n    \n    # Set document.\n    analyzer.set_document(d)\n    \n    # Get metadata.\n    analyzer.get_dominant_languages()\n    analyzer.get_key_phrases()\n    analyzer.get_named_entities()\n    analyzer.get_sentiment()\n    \n    # Remove either \"_LINE.txt\" or \"_WORD.txt\" from the document filename.\n    document_id = d[0:-9]\n    \n    # Grab text from the Amazon Textract output.\n    text = s3.get_object(Bucket=textract_output_bucket,\n                         Key=d)['Body'].read()\n    \n    # Define document with Amazon Textract text and Amazon Comprehend attributes.\n    document = {\n        'Id': document_id,\n        'Title': document_id,\n        'Blob': text,\n        'Attributes':analyzer.attribute_list,\n        'ContentType':'PLAIN_TEXT'\n    }\n    \n    documents.append(document)<\/code><\/pre>\n<\/p><\/div>\n<p>If you want to see what the metadata looks like, look at the first item in the <strong>documents <\/strong>Python list by running the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># Take a look at the first of the metadata documents you've prepared.\ndocuments[0]['Attributes']<\/code><\/pre>\n<\/p><\/div>\n<h3>Load metadata into the Amazon Kendra index<\/h3>\n<p>The last step is to load the metadata we extracted using <code>ComprehendAnalyzer<\/code> into the <code><strong>blog-media-company-index <\/strong><\/code>index by running the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">kendra.batch_put_document(\n    IndexId = index_id,\n    Documents = documents)<\/code><\/pre>\n<\/p><\/div>\n<p>Now we\u2019re ready to start querying and boosting some of the metadata attributes!<\/p>\n<h2>Query the index and boost metadata attributes<\/h2>\n<p>We now have everything set up to start querying our data. We\u2019re able to weigh attributes differently in terms of significance, make metadata attributes searchable, influence the order of results coming back from the query by improving the sentiment metadata, and much more.<\/p>\n<h3>Run a sample query<\/h3>\n<p>Before we get into a few examples that demonstrate the power and flexibility this metadata attachment gives us, let\u2019s run the following code to query the <code><strong>blog-media-company-index <\/strong><\/code>index:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Print function to more easily visualize Amazon Kendra query results.\ndef print_results(response, result_number):\n    print ('nSearch results for query: ' + query + 'n')        \n\n    count = 0\n    for query_result in response['ResultItems']:\n        \n        print('-------------------')\n        print('Type: ' + str(query_result['Type']))\n\n        if query_result['Type']=='ANSWER':\n            count += 1\n            answer_text = query_result['DocumentExcerpt']['Text']\n            print(answer_text)\n\n        if query_result['Type']=='DOCUMENT':\n            if 'DocumentTitle' in query_result:\n                document_title = query_result['DocumentTitle']['Text']\n                print('Title: ' + document_title)\n            document_text = query_result['DocumentExcerpt']['Text']\n            print(document_text)\n            \n        count += 1\n\n        print ('------------------nn')  \n        \n        if count &gt;= result_number:\n            break;\n            \ndef print_list(dict_list):\n    for attribute in dict_list:\n        text = attribute['Key'] + \": \"\n        for k,v in attribute['Value'].items():\n            text += str(v) + \"n\"\n        print(text)<\/code><\/pre>\n<\/p><\/div>\n<p>We can test the following query to get a sense of how to query our new index:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Search Query.\nquery = 'who was star wars produced by'\n\nresponse=kendra.query(\n            QueryText = query,\n            IndexId = index_id)\n\nprint_results(response, 3)<\/code><\/pre>\n<\/p><\/div>\n<p>You should get a response like the following screenshot.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image017.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29605\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/21\/ML-1170-image017.jpg\" alt=\"\" width=\"1622\" height=\"706\"><\/a><\/p>\n<p>Now that you know how to query, let\u2019s get into some examples of how we can use our metadata to influence our searches.<\/p>\n<h3>Improve metadata<\/h3>\n<p>This section contains some examples of how we can influence and control our search for more targeted results. For each of the examples, we update our <code><strong>blog-media-company-index <\/strong><\/code>index by modifying our <code><strong>meta_config_dict <\/strong><\/code>and rerunning the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">kendra.update_index(**meta_config_dict)<\/code><\/pre>\n<\/p><\/div>\n<h4>Example 1: Weighing attributes<\/h4>\n<p>To weigh attributes by significance, update the <code><strong>Importance <\/strong><\/code>value of the attributes. The range for importance goes from 1\u201310, 1 being lowest, and 10 being the highest.<\/p>\n<p>For example, let\u2019s say we have a use case where we have different country entities and we have documents in many different languages. We can increase the significance of the <code><strong>Languages <\/strong><\/code>metadata attribute to account for this by updating its <code><strong>Importance <\/strong><\/code>to 10, and making sure <code><strong>Searchable <\/strong><\/code>is set to <strong><code>True<\/code>.<\/strong> This makes it so that the text in the field <code><strong>Languages<\/strong><\/code> is searchable. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{'Name': 'Languages',\n 'Type': 'STRING_VALUE',\n 'Search': {\n     'Facetable': True,\n     'Searchable' : True,\n     'Displayable': True},\n 'Relevance': {\n     'Importance': 10},\n}<\/code><\/pre>\n<\/p><\/div>\n<p>Now let\u2019s say that we\u2019re looking for more positive context results. We increase the <code>Importance<\/code> value of the metadata attribute <code><strong>Sentiment<\/strong><\/code> to 10:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{'Name': 'Sentiment',\n 'Type': 'STRING_VALUE',\n 'Search': {\n     'Facetable': True,\n     'Searchable' : True,\n     'Displayable': True},\n 'Relevance': {\n     'Importance': 10},\n}<\/code><\/pre>\n<\/p><\/div>\n<h4>Example 2: Ranking search results<\/h4>\n<p>Let\u2019s say we want to influence the rank of the search results by a particular sentiment metadata attribute. We can simply configure the <code><strong>Importance <\/strong><\/code>and <code><strong>RankOrder <\/strong><\/code>of the sentiment we want. For example, if we want to increase the significance of the positive results and rank those results higher than the negative, we update the <code><strong>Positive_score <\/strong><\/code>attribute to have an <code><strong>Importance<\/strong><\/code> of 10 and a <code><strong>RankOrder <\/strong><\/code>of <code><strong>DESCENDING<\/strong><\/code> to put the most positive results at the top. We leave the <code>Importance<\/code> of <code><strong>Negative_Score <\/strong><\/code>at 1 and update its <code><strong>RankOrder<\/strong><\/code> to <code><strong>ASCENDING<\/strong><\/code> to make sure the least negative sentiment results show up higher. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{'Name': 'Positive_Score',\n 'Type': 'LONG_VALUE',\n 'Search': {\n     'Facetable': True,\n     'Searchable' : False,\n     'Displayable': True},\n 'Relevance': {\n     'Importance': 1,\n     'RankOrder': 'DESCENDING'},\n },\n {'Name': 'Negative_Score',\n 'Type': 'LONG_VALUE',\n 'Search': {\n     'Facetable': True,\n     'Searchable' : False,\n     'Displayable': True},\n 'Relevance': {\n    'Importance': 1,\n    'RankOrder': 'ASCENDING'},\n}<\/code><\/pre>\n<\/p><\/div>\n<h3>Get creative!<\/h3>\n<p>At this point, you\u2019ve got your Amazon Kendra index and metadata attributes set up. Go ahead and play around with querying, weighing metadata, and ranking results by creating your own creative combinations!<\/p>\n<h2>Clean up<\/h2>\n<p>To avoid extra charges, shut down the SageMaker and Amazon Kendra resources when you\u2019re done.<\/p>\n<ol>\n<li>On the SageMaker console, choose <strong>Notebook<\/strong> and <strong>Notebook instances<\/strong>.<\/li>\n<li>Select the notebook that you created.<\/li>\n<li>On the <strong>Actions<\/strong> menu, choose <strong>Stop<\/strong>.<\/li>\n<li>Choose <strong>Delete<\/strong>.<\/li>\n<\/ol>\n<p>Alternatively, you can keep the instance stopped indefinitely and not be charged.<\/p>\n<ol start=\"5\">\n<li>On the Amazon Kendra console, choose <strong>Indexes<\/strong>.<\/li>\n<li>Select the index you created.<\/li>\n<li>On the <strong>Actions<\/strong> menu, choose <strong>Delete<\/strong>.<\/li>\n<\/ol>\n<p>Because we used Amazon Textract and Amazon Comprehend via API, there are no shutdown steps necessary for those resources.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we showed how to do the following:<\/p>\n<ul>\n<li>Use Amazon Textract on PDF files to extract text from documents<\/li>\n<li>Use Amazon Comprehend to extract metadata attributes from Amazon Textract output<\/li>\n<li>Perform targeted searches with Amazon Kendra using the metadata attributes extracted by Amazon Comprehend<\/li>\n<\/ul>\n<p>Although this may have been a mock media company example using public sample data, I hope you were able to have some fun following along and realize the potential\u2014and power\u2014of chaining Amazon Textract, Amazon Comprehend, and Amazon Kendra together. Use this new knowledge and start augmenting your historical data! To learn more about how Amazon Kendra\u2019s fully managed intelligent search service can help your business, <a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">visit our webpage<\/a> or dive into our <a href=\"https:\/\/aws.amazon.com\/kendra\/resources\/\" target=\"_blank\" rel=\"noopener noreferrer\">documentation and tutorials<\/a>!<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/25\/wiki_james_poquiz.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-29810 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/25\/wiki_james_poquiz.png\" alt=\"\" width=\"100\" height=\"120\"><\/a>James Poquiz<\/strong> is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/augment-search-with-metadata-by-chaining-amazon-textract-amazon-comprehend-and-amazon-kendra\/<\/p>\n","protected":false},"author":0,"featured_media":1172,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1171"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1171"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1171\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1172"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}