{"id":1926,"date":"2022-03-03T18:05:41","date_gmt":"2022-03-03T18:05:41","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/03\/enable-the-visually-impaired-to-hear-documents-using-amazon-textract-and-amazon-polly\/"},"modified":"2022-03-03T18:05:41","modified_gmt":"2022-03-03T18:05:41","slug":"enable-the-visually-impaired-to-hear-documents-using-amazon-textract-and-amazon-polly","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/03\/enable-the-visually-impaired-to-hear-documents-using-amazon-textract-and-amazon-polly\/","title":{"rendered":"Enable the visually impaired to hear documents using Amazon Textract and Amazon Polly"},"content":{"rendered":"<div id=\"\">\n<p>At the 2021 AWS re:Invent conference in Las Vegas, we demoed <a href=\"https:\/\/www.readforme.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Read For Me<\/a> at the AWS Builders Fair\u2014a website that helps the visually impaired hear documents.<\/p>\n<p>For better quality, view the video <a href=\"https:\/\/d36nqpzxe3i32x.cloudfront.net\/artifacts\/ML-5722\/read-for-me-1080p.mp4\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/p>\n<p>Adaptive technology and accessibility features are often expensive, if they\u2019re available at all. Audio books help the visually impaired read. Audio description makes movies accessible. But what do you do when the content isn\u2019t already digitized?<\/p>\n<p>This post focuses on the AWS AI services <a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a> and <a href=\"https:\/\/aws.amazon.com\/polly\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Polly<\/a>, which empower those with impaired vision. Read For Me was co-developed by Jack Marchetti, who is visually impaired.<\/p>\n<h2>Solution overview<\/h2>\n<p>Through an event-driven, serverless architecture and a combination of multiple AI services, we can create natural-sounding audio files in multiple languages from a picture of a document, or any image with text. For example, a letter from the IRS, a holiday card from family, or even the opening titles to a film.<\/p>\n<p>The following <a href=\"https:\/\/d1.awsstatic.com\/architecture-diagrams\/ArchitectureDiagrams\/readforme-ra.pdf?did=wp_card&amp;trk=wp_card\" target=\"_blank\" rel=\"noopener noreferrer\">Reference Architecture<\/a>, published in the <a href=\"https:\/\/aws.amazon.com\/architecture\/reference-architecture-diagrams\/?achp_ra8&amp;whitepapers-main.sort-by=item.additionalFields.sortDate&amp;whitepapers-main.sort-order=desc&amp;awsf.whitepapers-tech-category=tech-category%23ai-ml&amp;awsf.whitepapers-industries=*all&amp;solutions-all.sort-by=item.additionalFields.sortDate&amp;solutions-all.sort-order=desc&amp;whitepapers-main.q=ReadForMe&amp;whitepapers-main.q_operator=AND\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Architecture Center<\/a> shows the workflow of a user taking a picture with their phone and playing an MP3 of the content found within that document.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ReadForMeArchitectureforblogpost.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-33388 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ReadForMeArchitectureforblogpost.png\" alt=\"\" width=\"1456\" height=\"838\"><\/a><\/p>\n<p>The workflow includes the following steps:<\/p>\n<ol>\n<li>Static content (HTML, CSS, JavaScript) is hosted on <a href=\"https:\/\/aws.amazon.com\/amplify\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Amplify<\/a>.<\/li>\n<li>Temporary access is granted for anonymous users to backend services via an <a href=\"\/\/\/Users\/alaknan\/Downloads\/aws.amazon.com\/cognito\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Cognito<\/a> identity pool.<\/li>\n<li>The image files are stored in <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3).<\/li>\n<li>A user makes a POST request through <a href=\"https:\/\/aws.amazon.com\/api-gateway\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon API Gateway<\/a> to the audio service, which proxies to an express <a href=\"http:\/\/aws.amazon.com\/step-functions\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Step Functions<\/a> workflow.<\/li>\n<li>The Step Functions workflow includes the following steps:\n<ol type=\"a\">\n<li><a href=\"https:\/\/aws.amazon.com\/textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Textract<\/a> extracts text from the image.<\/li>\n<li><a href=\"https:\/\/aws.amazon.com\/comprehend\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Comprehend<\/a> detects the language of the text.<\/li>\n<li>If the target language differs from the detected language, <a href=\"https:\/\/aws.amazon.com\/translate\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Translate<\/a> translates to the target language.<\/li>\n<li><a href=\"https:\/\/aws.amazon.com\/polly\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Polly<\/a> creates an audio file as output using the text.<\/li>\n<\/ol>\n<\/li>\n<li>The AWS Step Functions workflow creates an audio file as output and stores it in Amazon S3 in MP3 format.<\/li>\n<li>A pre-signed URL with the location of the audio file stored in Amazon S3 is sent back to the user\u2019s browser through API Gateway. The user\u2019s mobile device plays the audio file using the pre-signed URL.<\/li>\n<\/ol>\n<p>In the following sections, we discuss the reasons for why we chose the specific services, architecture pattern, and service features for this solution.<\/p>\n<h2>AWS AI services<\/h2>\n<p>Several AI services are wired together to power Read For Me:<\/p>\n<ul>\n<li>Amazon Textract identifies the text in the uploaded picture.<\/li>\n<li>Amazon Comprehend determines the language.<\/li>\n<li>If the user chooses a different spoken language than the language in the picture, we translate it using Amazon Translate.<\/li>\n<li>Amazon Polly creates the MP3 file. We take advantage of the Amazon Polly neural engine, which creates a more natural, lifelike audio recording.<\/li>\n<\/ul>\n<p>One of the main benefits of using these AI services is the ease of adoption with little or no core machine learning experience required. The services expose APIs that clients can invoke using SDKs made available in multiple programming languages, such as Python and Java.<\/p>\n<p>With Read For Me, we wrote the underlying <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> functions in Python.<\/p>\n<h2>AWS SDK for Python (Boto3)<\/h2>\n<p>The <a href=\"https:\/\/aws.amazon.com\/sdk-for-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS SDK for Python (Boto3)<\/a> makes interacting with AWS services simple. For example, the following lines of Python code return the text found in the image or document you provide:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import boto3\nclient = boto3.client('textract')\nresponse = client.detect_document_text(\nDocument={\n'S3Object': {\n'Bucket': 'bucket-name',\n'Name': 's3-key'\n}\n})\n#do something with the response<\/code><\/pre>\n<\/p><\/div>\n<p>All Python code is run within individual Lambda functions. There are no servers to provision and no infrastructure to maintain.<\/p>\n<h2>Architecture patterns<\/h2>\n<p>In this section, we discuss the different architecture patterns used in the solution.<\/p>\n<h3>Serverless<\/h3>\n<p>We implemented a serverless architecture for two main reasons: speed to build and cost. With no underlying hardware to maintain or infrastructure to deploy, we focused entirely on the business logic code and nothing else. This allowed us to get a functioning prototype up and running in a matter of days. If users aren\u2019t actively uploading pictures and listening to recordings, nothing is running, and therefore nothing is incurring costs outside of storage. An S3 lifecycle management rule deletes uploaded images and MP3 files after 1 day, so storage costs are low.<\/p>\n<h3>Synchronous workflow<\/h3>\n<p>When you\u2019re building serverless workflows, it\u2019s important to understand when a synchronous call makes more sense from the architecture and user experience than an asynchronous process. With Read For Me, we initially went down the asynchronous path and planned on using WebSockets to bi-directionally communicate with the front end. Our workflow would include a step to find the connection ID associated with the Step Functions workflow and upon completion, alert the front end. For more information about this process, refer to <a href=\"https:\/\/aws.amazon.com\/blogs\/compute\/from-poll-to-push-transform-apis-using-amazon-api-gateway-rest-apis-and-websockets\/\" target=\"_blank\" rel=\"noopener noreferrer\">From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets<\/a>.<\/p>\n<p>We ultimately chose not to do this and used express step functions which are synchronous. Users understand that processing an image won\u2019t be instant, but also know it won\u2019t take 30 seconds or a minute. We were in a space where a few seconds was satisfactory to the end-user and didn\u2019t need the benefit of WebSockets. This simplified the workflow overall.<\/p>\n<h3>Express Step Functions workflow<\/h3>\n<p>The ability to break out your code into smaller, isolated functions allows for fine-grained control, easier maintenance, and the ability to scale more accurately. For instance, if we determined that the Lambda function that triggered Amazon Polly to create the audio file was running slower than the function that determined the language, we could vertically scale that function, adding more memory, without having to do so for the others. Similarly, you limit the blast radius of what your Lambda function can do or access when you limit its scope and reach.<\/p>\n<p>One of the benefits of orchestrating your workflow with Step Functions is the ability to introduce decision flow logic without having to write any code.<\/p>\n<p>Our Step Functions workflow isn\u2019t complex. It\u2019s linear until the translation step. If we don\u2019t need to call a translation Lambda function, that\u2019s less cost to us, and a faster experience for the user. We can use the visual designer on the Step Functions console to find the specific key in the input payload and, if it\u2019s present, call one function over the other using JSONPath. For example, our payload includes a key called translate:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-json\">{\u00a0\nextracted_text: \"hello world\",\ntarget_language: \"es\",\nsource_language: \"en\",\ntranslate: true\n}<\/code><\/pre>\n<\/p><\/div>\n<p>Within the Step Functions visual designer, we find the translate key, and set up rules to match.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Screen-Shot-2022-02-22-at-12.36.24-PM-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33396\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Screen-Shot-2022-02-22-at-12.36.24-PM-1.png\" alt=\"\" width=\"3721\" height=\"1228\"><\/a><\/p>\n<h3>Headless architecture<\/h3>\n<p>Amplify hosts the front-end code. The front end is written in React and the source code is checked into <a href=\"https:\/\/aws.amazon.com\/codecommit\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CodeCommit<\/a>. Amplify solves a few problems for users trying to deploy and manage static websites. If you were doing this manually (using an S3 bucket set up for static website hosting and fronting that with <a href=\"https:\/\/aws.amazon.com\/cloudfront\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudFront<\/a>), you\u2019d have to expire the cache yourself each time you did deployments. You\u2019d also have to write up your own CI\/CD pipeline. Amplify handles this for you.<\/p>\n<p>This allows for a headless architecture, where front-end code is decoupled from the backend and each layer can be managed and scaled independent of the other.<\/p>\n<h2>Analyze ID<\/h2>\n<p>In the preceding section, we discussed the architecture patterns for processing the uploaded picture and creating an MP3 file from it. Having a document read back to you is a great first step, but what if you only want to know something specific without having the whole thing read back to you? For instance, you need to fill out a form online and provide your state ID or passport number, or perhaps its expiration date. You then have to take a picture of your ID and, while having it read back to you, wait for that specific part. Alternatively, you could use Analyze ID.<\/p>\n<p>Analyze ID is a feature of Amazon Textract that enables you to query documents. Read For Me contains a drop-down menu where you can specifically ask for the expiration date, date of issue, or document number. You can use the same workflow to create an MP3 file that provides an answer to your specific question.<\/p>\n<p>You can demo the Analyze ID feature at <a href=\"http:\/\/readforme.io\/analyze\" target=\"_blank\" rel=\"noopener noreferrer\">readforme.io\/analyze<\/a>.<\/p>\n<h2>Additional Polly Features<\/h2>\n<ul>\n<li>Read For Me offers multiple neural voices utilizing different languages and dialects. Note that there are several other <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/voicelist.html\" target=\"_blank\" rel=\"noopener noreferrer\">voices<\/a> you can choose from, which we did not implement. When a new voice is available, an update to the front-end code and a lambda function is all it takes to take advantage of it.<\/li>\n<li>The Polly service also offers other options which we have yet to include in Read For Me. Those include adjusting the <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/voice-speed-vip.html\" target=\"_blank\" rel=\"noopener noreferrer\">speed of the voices<\/a> and <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/using-speechmarks.html\" target=\"_blank\" rel=\"noopener noreferrer\">speech marks<\/a>.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>In this post, we discussed how to use numerous AWS services, including AI and serverless, to aid the visually impaired. You can learn more about the Read For Me project and use it by visiting <a href=\"https:\/\/www.readforme.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">readforme.io<\/a>. You can also find Amazon Textract examples on the <a href=\"https:\/\/github.com\/aws-samples\/amazon-textract-code-samples\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. To learn more about Analyze ID, check out <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/announcing-support-for-extracting-data-from-identity-documents-using-amazon-textract\/\" target=\"_blank\" rel=\"noopener noreferrer\">Announcing support for extracting data from identity documents using Amazon Textract<\/a>.<\/p>\n<p>The source code for this project will be open-sourced and added to AWS\u2019s public GitHub soon.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Jack-M.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33380 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Jack-M.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a>Jack Marchetti<\/strong> is a Senior Solutions architect at AWS. With a background in software engineering, Jack is primarily focused on helping customers implement serverless, event-driven architectures. He built his first distributed, cloud-based application in 2013 after attending the second AWS re:Invent conference and has been hooked ever since. Prior to AWS Jack spent the bulk of his career in the ad agency space building experiences for some of the largest brands in the world. Jack is legally blind and resides in Chicago with his wife Erin and cat Minou. He also is a screenwriter, and director with a primary focus on Christmas movies and horror. View Jack\u2019s filmography at his <a href=\"https:\/\/www.imdb.com\/name\/nm5189782\/\">IMDb<\/a> page.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/AlakEswaradass.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33379 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/AlakEswaradass.jpg\" alt=\"\" width=\"100\" height=\"137\"><\/a>Alak Eswaradass<\/strong> is a Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers design cloud architectures utilizing AWS services to solve business challenges. She has a Master\u2019s degree in computer science engineering. Before joining AWS, she worked for different healthcare organizations, and she has in-depth experience architecting complex systems, technology innovation, and research. She hangs out with her daughters and explores the outdoors in her free time.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Swagat-300x300-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33378 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/Swagat-300x300-1.png\" alt=\"\" width=\"100\" height=\"100\"><\/a>Swagat Kulkarni<\/strong> is a Senior Solutions Architect at AWS and an AI\/ML enthusiast. He is passionate about solving real-world problems for customers with cloud native services and machine learning. Outside of work, Swagat enjoys travel, reading and meditating.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/enable-the-visually-impaired-to-hear-documents-using-amazon-textract-and-amazon-polly\/<\/p>\n","protected":false},"author":0,"featured_media":1927,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1926"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1926"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1926\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1927"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1926"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1926"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1926"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}