{"id":506,"date":"2020-11-07T16:21:09","date_gmt":"2020-11-07T16:21:09","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/11\/07\/adding-custom-data-sources-to-amazon-kendra\/"},"modified":"2020-11-07T16:21:09","modified_gmt":"2020-11-07T16:21:09","slug":"adding-custom-data-sources-to-amazon-kendra","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/11\/07\/adding-custom-data-sources-to-amazon-kendra\/","title":{"rendered":"Adding custom data sources to Amazon Kendra"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/kendra\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kendra<\/a> is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra provides native <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/hiw-data-source.html\" target=\"_blank\" rel=\"noopener noreferrer\">connectors<\/a> for popular data sources like <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3), SharePoint, ServiceNow, OneDrive, Salesforce, and Confluence so you can easily add data from different content repositories and file systems into a centralized location. This enables you to use Kendra\u2019s natural language search capabilities to quickly find the most relevant answers to your questions.<\/p>\n<p>However, many organizations store relevant information in the form of unstructured data on company intranets or within file systems on corporate networks that are inaccessible to Amazon Kendra.<\/p>\n<p>You can now use the custom data source feature in Amazon Kendra to upload content to your Amazon Kendra index from a wider range of data sources. When you select a connector type, the custom data source feature gives complete control over how documents are selected and indexed, and provides visibility and metrics on which content associated with a data source has been added, modified, or deleted.<\/p>\n<p>In this post, we describe how to use a simple web connector to scrape content from unauthenticated webpages, capture attributes, and ingest this content into an Amazon Kendra index using the custom data source feature. This enables you to ingest your <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/index-document-types.html\" target=\"_blank\" rel=\"noopener noreferrer\">content<\/a> directly to the index using the <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/API_BatchPutDocument.html\">BatchPutDocument<\/a> API, and allows you to keep track of the ingestion through <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a> <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/cloudwatch-logs.html#data-source-log-stream\" target=\"_blank\" rel=\"noopener noreferrer\">log streams<\/a> and through the metrics from the data sync operation.<\/p>\n<h2>Setting up a web connector<\/h2>\n<p>To use the custom data source connector in Amazon Kendra, you need to create an application that scrapes the documents in your repository and builds a list of documents. You ingest those documents into your Amazon Kendra index by using the <code>BatchPutDocument<\/code> operation. To delete documents, you have to provide a list of the document IDs and use the <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/API_BatchDeleteDocument.html\" target=\"_blank\" rel=\"noopener noreferrer\">BatchDeleteDocument<\/a> operation. If you need to modify a document (for example because it was updated), if you provide the same document ID, the document with the matching document ID is replaced on your index.<\/p>\n<p>For this post, we scrape HTML content from AWS FAQs for 11 AI\/ML services:<\/p>\n<p>We use <code>BeautifulSoup<\/code> and requests library to scrape the content from the AWS FAQ website. The script first gets the content of an AWS FAQ page through the <code>get_soup_from_url<\/code> function. Based on the presence of certain CSS classes, it locates question and answers pairs and for each URL, it creates a text file to be later ingested in Amazon Kendra.<\/p>\n<p>The solution in this post is for demonstration purposes only. We recommend running similar scripts only on your own websites after consulting with the team who manages them, or be sure to follow the terms of service for the website that you\u2019re trying to scrape.<\/p>\n<p>The following screenshot shows a sample of the script.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17686 size-full\" title=\"Script sample\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-1.jpg\" alt=\"\" width=\"823\" height=\"895\"><\/p>\n<p>The following screenshot shows the results of a sample run.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17687\" title=\"Results of a sample run\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-2.jpg\" alt=\"\" width=\"925\" height=\"249\"><\/p>\n<p>The <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Adding-custom-data-sources-to-Amazon-Kendra\/ScrapedFAQS.zip\" target=\"_blank\" rel=\"noopener noreferrer\">ScrapedFAQS.zip<\/a> file contains the scraped documents.<\/p>\n<h2>Creating a custom data source<\/h2>\n<p>To ingest documents through the custom data source, you need to first create a<a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/hiw-data-source.html\" target=\"_blank\" rel=\"noopener noreferrer\"> data source<\/a>. The assumption is you already have an Amazon Kendra index in your account. If you don\u2019t, you can <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/create-index.html\" target=\"_blank\" rel=\"noopener noreferrer\">create a new index<\/a>.<\/p>\n<p>Amazon Kendra has two provisioning editions: the Amazon Kendra Developer Edition, recommended for building proof of concepts (POCs), and the Amazon Kendra Enterprise Edition, which provides multi-AZ deployment, making it ideal for production. Amazon Kendra connectors work with both editions.<\/p>\n<p>To create your custom data source, complete the following steps:<\/p>\n<ol>\n<li>On your index, choose <strong>Add data sources<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17688 size-full\" title=\"Add data sources\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-3.jpg\" alt=\"\" width=\"800\" height=\"209\"><\/p>\n<ol start=\"2\">\n<li>For <strong>Custom data source connector<\/strong>, choose <strong>Add connector<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17689 size-full\" title=\"Add connector\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-4.jpg\" alt=\"\" width=\"509\" height=\"274\"><\/p>\n<ol start=\"3\">\n<li>For <strong>Data source name<\/strong>, enter a name (for example, <code>MyCustomConnector<\/code>).<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17690 size-full\" title=\"Enter a name\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-5.jpg\" alt=\"\" width=\"830\" height=\"599\"><\/p>\n<ol start=\"4\">\n<li>Review the information in the <strong>Next steps\u00a0<\/strong>section.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17691 size-full\" title=\"Review the information\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-6.jpg\" alt=\"\" width=\"845\" height=\"131\"><\/p>\n<ol start=\"5\">\n<li>Choose <strong>Add data source<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17692 size-full\" title=\"Add data source\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-7.jpg\" alt=\"\" width=\"847\" height=\"517\"><\/p>\n<h2>Syncing documents using the custom data source<\/h2>\n<p>Now that your connector is set up, you can ingest documents in Amazon Kendra using the <code>BatchPutDocument<\/code> API, and get some metrics to track the status of ingestion. For that you need an <a href=\"https:\/\/docs.aws.amazon.com\/kendra\/latest\/dg\/API_StartDataSourceSyncJob.html#API_StartDataSourceSyncJob_ResponseSyntax\" target=\"_blank\" rel=\"noopener noreferrer\">ExecutionID<\/a>, so before running your <code>BatchPutDocument<\/code> operation, you need to <a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/kendra\/start-data-source-sync-job.html\" target=\"_blank\" rel=\"noopener noreferrer\">start a <\/a><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/kendra\/start-data-source-sync-job.html\">data source sync job<\/a>. When the data sync is complete, you <a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/kendra\/stop-data-source-sync-job.html\" target=\"_blank\" rel=\"noopener noreferrer\">stop the data source sync job<\/a>.<\/p>\n<p>For this post, you use the latest version of the <a href=\"https:\/\/aws.amazon.com\/sdk-for-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS SDK for Python (Boto3)<\/a> and ingest 10 documents with the IDs 0\u20139.<\/p>\n<p>Extract the .zip file containing the scraped content by using any standard file decompression utility . You should have 11 files on your local file system. In a real use case, these files are likely on a shared file server in your data center. When you create a custom data source, you have complete control over how the documents for the index are selected. Amazon Kendra only provides metric information that you can use to monitor the performance of your data source.<\/p>\n<p>For demonstration, let\u2019s assume you have extracted the json files under a directory called <code>kendra-ingestion<\/code><\/p>\n<p>Replace <code>&lt;YOUR-INDEX-ID&gt;<\/code> and <code>&lt;YOUR-DATASOURCE-ID&gt;<\/code> variable with your index specific details and save the following sample code as <code>kendra-ingestion.py<\/code> file at the same level as the <code>kendra-ingestion<\/code> directory.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-18089\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/05\/Revised-Image-1.jpg\" alt=\"\" width=\"302\" height=\"201\"><\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import boto3\r\nimport pandas as pd\r\nimport glob\r\nimport os\r\n\r\ndef get_docs(dataSourceId, jobExecutionId):\r\n    documents = []\r\n    try:\r\n        json_pattern = os.path.join('kendra-ingestion','*.json')\r\n        file_list = glob.glob(json_pattern)\r\n        df = pd.DataFrame()\r\n        for file in file_list:\r\n            data = pd.read_json(file)\r\n            df = df.append(data, ignore_index = True)\r\n        #Randomize the indexes\r\n        df = df.sample(frac=1).reset_index(drop=True)\r\n        #Slice df to obtain 10 documents\r\n        df = df.head(10)\r\n    except:\r\n       print(\"Documents file not found\")  \r\n    for index_label, row_series in df.iterrows():\r\n        Text = df.at[index_label , 'Text']\r\n        Title = df.at[index_label , 'Title']\r\n        Url =  df.at[index_label , 'Url']\r\n        CrawledDate = df.at[index_label , 'CrawledDate']\r\n        docID =  df.at[index_label , 'docID']\r\n        doc = {\r\n            \"Id\": docID,\r\n            \"Blob\": Text,\r\n            \"Title\": Title,\r\n            \"Attributes\": [\r\n                {\r\n                \"Key\": \"_data_source_id\",\r\n                \"Value\": {\r\n                    \"StringValue\": dataSourceId\r\n                    }\r\n                },\r\n                {\r\n                \"Key\": \"_data_source_sync_job_execution_id\",\r\n                \"Value\": {\r\n                    \"StringValue\": jobExecutionId\r\n                    }\r\n                },\r\n                {\r\n                \"Key\": \"_source_uri\",\r\n                \"Value\": {\r\n                    \"StringValue\": Url\r\n                    }    \r\n                },\r\n                {\r\n                \"Key\": \"_created_at\",\r\n                \"Value\": {\r\n                    \"DateValue\": CrawledDate\r\n                    }    \r\n                }\r\n            ]\r\n        }\r\n        documents.append(doc)\r\n    return documents\r\n    \r\n#Index ID\r\nindex_id = &lt;YOUR-INDEX-ID&gt;\r\n#Datasource ID\r\ndata_source_id = &lt;YOUR-DATASOURCE-ID&gt;\r\n\r\nkendra = boto3.client('kendra')\r\n\r\n#Start a data source sync job\r\nresult = kendra.start_data_source_sync_job(\r\n    Id = data_source_id,\r\n    IndexId = index_id\r\n    )\r\n\r\nprint(\"Start data source sync operation: \")\r\nprint(result)\r\n\r\n#Obtain the job execution ID from the result\r\njob_execution_id = result['ExecutionId']\r\nprint(\"Job execution ID: \"+job_execution_id)\r\n\r\n#Start ingesting documents\r\ntry:\r\n    #Part of the workflow will require you to have a list with your documents ready\r\n    #for ingestion\r\n    docs = get_docs(data_source_id, job_execution_id)\r\n    #batchput docs\r\n    result = kendra.batch_put_document(\r\n        IndexId = index_id,\r\n        Documents = docs\r\n        )\r\n    print(\"Response from batch_put_document:\")\r\n    print(result)\r\n\r\n#Stop data source sync job\r\nfinally:\r\n    #Stop data source sync\r\n    result = kendra.stop_data_source_sync_job(\r\n        Id = data_source_id,\r\n        IndexId = index_id\r\n        )\r\n    print(\"Stop data source sync operation:\")\r\n    print(result)<\/code><\/pre>\n<\/div>\n<div class=\"hide-language\">\n<pre><span>When you run the python script, if the sync job is successful, you should see something like this as the output:<\/span><\/pre>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">Start data source sync operation:\r\n{\r\n    'ExecutionId': 'a5ac1ba0-b480-46e3-a718-5fffa5006f1a',\r\n    'ResponseMetadata': {\r\n        'RequestId': 'a24a2600-0570-4520-8956-d58c8b1ef01c',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': 'a24a2600-0570-4520-8956-d58c8b1ef01c',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '54',\r\n            'date': 'Mon, 12 Oct 2020 19:55:11 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}\r\n\r\nJob execution ID: a5ac1ba0-b480-46e3-a718-5fffa5006f1a\r\n\r\nResponse from batch_put_document:\r\n{\r\n    'FailedDocuments': [],\r\n    'ResponseMetadata': {\r\n        'RequestId': 'fcda5fed-c55c-490b-9867-b45a3eb6a780',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': 'fcda5fed-c55c-490b-9867-b45a3eb6a780',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '22',\r\n            'date': 'Mon, 12 Oct 2020 19:55:12 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}\r\n\r\nStop data source sync operation:\r\n{\r\n    'ResponseMetadata': {\r\n        'RequestId': '249a382a-7170-49d1-855d-879b5a6f2954',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': '249a382a-7170-49d1-855d-879b5a6f2954',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '0',\r\n            'date': 'Mon, 12 Oct 2020 19:55:12 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}<\/code><\/pre>\n<\/div>\n<pre><span>Allow for some time for the sync job to finish, because document ingestion could continue as an asynchronous process after the data source sync process has stopped. The status on the Amazon Kendra console should change from Syncing-indexing to Succeeded when all the documents have been ingested successfully. You can now confirm the count of the documents that were ingested successfully and the metrics of the operation on the Amazon Kendra console.\r\n<\/span><\/pre>\n<\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17693\" title=\"Confirm the count of the documents that were ingested successfully\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-8.jpg\" alt=\"\" width=\"925\" height=\"57\"><\/p>\n<h2>Deleting documents from a custom data source<\/h2>\n<p>In this section, you explore how to remove documents from your index. You can use the same <code>DataSourceSync<\/code> job that you used for ingesting the documents. This process could be useful if you have a changelog of the documents you\u2019re syncing with your Amazon Kendra index, and during your sync job you want to delete documents from your index and also ingest new documents. You can do this by starting the sync job, performing the <code>BatchDeleteDocument<\/code> operation, performing the <code>BatchPutDocument<\/code> operation, and stopping the sync job.<\/p>\n<p>For this post, we use a separate data source sync job to remove the documents with IDs 6, 7, and 8. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import boto3\r\n\r\n#Index ID\r\nindex_id = &lt;YOUR-INDEX-ID&gt;\r\n#Datasource ID\r\ndata_source_id = &lt;YOUR-DATASOURCE-ID&gt;\r\n\r\nkendra = boto3.client('kendra')\r\n\r\n#Start data source sync job\r\nresult = kendra.start_data_source_sync_job(\r\n    Id = data_source_id,\r\n    IndexId = index_id\r\n    )\r\nprint(\"Start data source sync operation: \")\r\nprint(result)\r\n\r\njob_execution_id = result['ExecutionId']\r\nprint(\"Job execution ID: \"+job_execution_id)\r\ntry:\r\n    #Add the document IDs you would like to delete\r\n    delete_docs = [\"6\", \"7\", \"8\"]\r\n    #Start the batch put delete operation\r\n    result = kendra.batch_delete_document(\r\n        IndexId = index_id,\r\n        DocumentIdList = delete_docs,\r\n        DataSourceSyncJobMetricTarget = {\r\n            \"DataSourceSyncJobId\": job_execution_id,\r\n            \"DataSourceId\": data_source_id\r\n            }\r\n            )\r\n    print(\"Response from batch_delete_document:\")\r\n    print(result)\r\n\r\nfinally:\r\n#Stop the data source sync job\r\n    result = kendra.stop_data_source_sync_job(\r\n        Id = data_source_id,\r\n        IndexId = index_id\r\n    )\r\n    print(\"Stop data source sync operation:\")\r\n    print(result)<\/code><\/pre>\n<\/div>\n<div class=\"hide-language\">\n<pre><span>When the process is complete, you see a message similar to following:<\/span><\/pre>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">Start data source sync operation:\r\n\r\n{\r\n    'ExecutionId': '6979977e-0d91-45e9-b69e-19b179cc3bdf',\r\n    'ResponseMetadata': {\r\n        'RequestId': '677c5ab8-b5e0-4b55-8520-6aa838b8696e',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': '677c5ab8-b5e0-4b55-8520-6aa838b8696e',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '54',\r\n            'date': 'Mon, 12 Oct 2020 20:25:42 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}\r\n\r\nJob execution ID: 6979977e-0d91-45e9-b69e-19b179cc3bdf\r\n\r\nResponse from batch_delete_document:\r\n\r\n{\r\n    'FailedDocuments': [],\r\n    'ResponseMetadata': {\r\n        'RequestId': 'e647bac8-becd-4e2f-a089-84255a5d715d',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': 'e647bac8-becd-4e2f-a089-84255a5d715d',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '22',\r\n            'date': 'Mon, 12 Oct 2020 20:25:43 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}\r\n\r\nStop data source sync operation:\r\n{\r\n    'ResponseMetadata': {\r\n        'RequestId': '58626ede-d535-43dc-abf8-797a5637fc86',\r\n        'HTTPStatusCode': 200,\r\n        'HTTPHeaders': {\r\n            'x-amzn-requestid': '58626ede-d535-43dc-abf8-797a5637fc86',\r\n            'content-type': 'application\/x-amz-json-1.1',\r\n            'content-length': '0',\r\n            'date': 'Mon, 12 Oct 2020 20:25:43 GMT'\r\n        },\r\n        'RetryAttempts': 0\r\n    }\r\n}<\/code><\/pre>\n<\/div>\n<pre><span>On Amazon Kendra console, you can see the operation details.\r\n<\/span><\/pre>\n<\/div>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/missing-image.jpg\" alt=\"\" width=\"925\" height=\"56\"><\/p>\n<h2>Running queries<\/h2>\n<p>In this section, we show results from queries using the documents you ingested into your index.<\/p>\n<p>The following screenshot shows results for the query \u201cwhat is deep learning?\u201d<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17699 size-full\" title='\"what is deep learning?\"' src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-9.jpg\" alt=\"\" width=\"977\" height=\"393\"><\/p>\n<p>The following screenshot shows results for the query \u201chow do I try amazon rekognition?\u201d<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17700 size-full\" title='\"how do I try amazon rekognition?\"' src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-10.jpg\" alt=\"\" width=\"994\" height=\"277\"><\/p>\n<p>The following screenshot shows results for the query \u201cwhat is vga resolution?\u201d<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17701 size-full\" title=\"\u201cwhat is vga resolution?\u201d\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/29\/Adding-custom-data-sources-to-Amazon-Kendra-11.jpg\" alt=\"\" width=\"992\" height=\"396\"><\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how you can use the custom data source feature in Amazon Kendra to ingest documents from a custom data source into an Amazon Kendra index. We used a sample web connector to scrape content from AWS FAQs and stored it in a local file system. Then we outlined the steps you can follow to ingest those scraped documents into your Kendra index. We also detailed how to use CloudWatch metrics to check the status of an ingestion job, and ran a few natural language search queries to get relevant results from the ingested content.<\/p>\n<p>We hope this post helps you take advantage of the intelligent search capabilities of Amazon Kendra to find accurate answers from your enterprise content. For more information about Amazon Kendra, watch <a href=\"https:\/\/www.youtube.com\/watch?v=7-31KgImGgU&amp;t=7904s\" target=\"_blank\" rel=\"noopener noreferrer\">AWS re:Invent 2019 \u2013 Keynote with Andy Jassy<\/a> on YouTube.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17855 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/01\/Tapodipta-Ghosh.jpg\" alt=\"\" width=\"100\" height=\"134\"><\/p>\n<p><strong>Tapodipta Ghosh<\/strong> is a Senior Architect. He leads the Content And Knowledge Engineering Machine Learning team that focuses on building models related to AWS Technical Content. He also helps our customers with AI\/ML strategy and implementation using our AI Language services like Kendra.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17854 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/11\/01\/Juan-Bustos.jpg\" alt=\"\" width=\"100\" height=\"150\"><\/p>\n<p><strong>Juan Pablo Bustos<\/strong> is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/adding-custom-data-sources-to-amazon-kendra\/<\/p>\n","protected":false},"author":0,"featured_media":507,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/506"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=506"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/506\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/507"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=506"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=506"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=506"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}