{"id":1097,"date":"2021-10-28T08:40:22","date_gmt":"2021-10-28T08:40:22","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/28\/optimize-your-budget-and-time-by-submitting-amazon-polly-voice-synthesis-tasks-in-bulk\/"},"modified":"2021-10-28T08:40:22","modified_gmt":"2021-10-28T08:40:22","slug":"optimize-your-budget-and-time-by-submitting-amazon-polly-voice-synthesis-tasks-in-bulk","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/28\/optimize-your-budget-and-time-by-submitting-amazon-polly-voice-synthesis-tasks-in-bulk\/","title":{"rendered":"Optimize your budget and time by submitting Amazon Polly voice synthesis tasks in bulk"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/polly\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Polly<\/a> is a service that turns text into natural-sounding speech, using dozens of voices in more than 30 languages. You can use it for all sorts of applications, ranging from talking animated avatars, to lifelike virtual agents that answer customer support requests, to automated newscasters reading stories aloud. You can have Amazon Polly return synthesized speech as a live stream, or download it as a standard audio file for playback later. Like many AWS services, you pay only for what you actually use: with Amazon Polly, you pay for <a href=\"https:\/\/aws.amazon.com\/polly\/pricing\" target=\"_blank\" rel=\"noopener noreferrer\">the number of characters in the synthesized phrase<\/a>. Just playing a saved audio file is free, whether you play it a single time or multiple times.<\/p>\n<p>If you know exactly which phrases you need ahead of time, you can optimize your AWS spend. Just take every phrase you need voiced and send it to Amazon Polly at build time, storing the generated audio file until you\u2019re ready to play it back at runtime. Common use cases for this approach include public address systems at airports or bus stations, video games, and quick-service restaurant automated order-takers. Just pay once to synthesize your text, and then replay the resulting audio files as needed for free.<\/p>\n<p>In this post, we share a fully automated, event-driven, serverless solution that you can use to turn large numbers of text phrases to lifelike speech asynchronously. You can trigger the jobs by manually uploading a file of phrases to a private <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket, and then be notified by email or instant message when they\u2019re ready. Or, make the process part of your <a href=\"https:\/\/aws.amazon.com\/codebuild\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CodeBuild<\/a> continuous integration system, by automatically triggering the synthesis work whenever your source phrases change.<\/p>\n<h2>Overview of the solution<\/h2>\n<p>The solution is fully serverless, consisting chiefly of a set of <a href=\"https:\/\/aws.amazon.com\/lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> functions. These functions track the items to be synthesized. Submit them to Amazon Polly for synthesis, and process the results as they\u2019re completed. The functions use shared <a href=\"https:\/\/aws.amazon.com\/dynamodb\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon DynamoDB<\/a> tables to manage the state of the work over time. An <a href=\"https:\/\/aws.amazon.com\/step-functions\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Step Functions<\/a> workflow tracks each submitted set, and notifies interested parties of its completion via an <a href=\"https:\/\/aws.amazon.com\/sns\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Notification Service<\/a> (Amazon SNS) topic.<\/p>\n<p>The solution employs an <a href=\"https:\/\/aws.amazon.com\/event-driven-architecture\/\" target=\"_blank\" rel=\"noopener noreferrer\">event-driven architecture<\/a>: rather than a single process running from beginning to end, the process is distributed across Lambda invocations, run only when triggered to do so from some event.<\/p>\n<p>The following diagram illustrates the solution architecture.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29482\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/18\/6160-Architecture.jpg\" alt=\"\" width=\"801\" height=\"459\"><\/p>\n<h2>Deploy and configure the solution<\/h2>\n<p>You deploy the solution into your AWS account using the <a href=\"https:\/\/aws.amazon.com\/serverless\/sam\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Serverless Application Model<\/a> (AWS SAM). You can do this from any computer with command line access to your account, but for the sake of simplicity, we use <a href=\"https:\/\/aws.amazon.com\/cloudshell\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudShell<\/a>.<\/p>\n<ol>\n<li>Sign in to the CloudShell console.<\/li>\n<li>When your shell has been initialized, make a local copy of the solution source code and prepare the AWS SAM stack by issuing the following commands:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">$ git clone https:\/\/github.com\/aws-samples\/amazon-polly-async-batch.git\n$ cd amazon-polly-async-batch\n$ sam build\n<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"3\">\n<li>Use AWS SAM to deploy the solution, with deploy \u2013guided. Provide a stack name (like <code>amazon-polly-async-batch<\/code>), your preferred Region, an email address for notifications, and the name of a non-existent S3 bucket for the generated audio files. Accept the other defaults.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">$ sam deploy --guided\n        Setting default arguments for 'sam deploy'\n        =========================================\n        Stack Name [amazon-polly-async-batch]: \n        AWS Region [us-east-1]: \n        Parameter NotificationEmail []: *YOUR EMAIL ADDRESS*\n        Parameter WorkBucket []: *YOUR WORK BUCKET NAME*\n        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy\n        Confirm changes before deploy [y\/N]:  \n        #SAM needs permission to be able to create roles to connect to the resources in your template\n        Allow SAM CLI IAM role creation [Y\/n]:  \n        Save arguments to configuration file [Y\/n]: \n        SAM configuration file [samconfig.toml]: \n        SAM configuration environment [default]: \n<\/code><\/pre>\n<\/p><\/div>\n<p>Deployment of all the components should take only a few minutes. If installation is successful, you should see a message like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">Successfully created\/updated stack - amazon-polly-async-batch in us-east-1<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"4\">\n<li>Check your email for a message from Amazon SNS and confirm the subscription.<\/li>\n<\/ol>\n<h2>How the solution works<\/h2>\n<p>In this section, we describe in detail how to use the solution to synthesize your text, and how each major component works.<\/p>\n<h3>The set file: Specifying the text to synthesize<\/h3>\n<p>You define the set of text phrases you want Amazon Polly to voice in a file named a <em>set file<\/em>. This is a <a href=\"https:\/\/yaml.org\/spec\/1.0\/\" target=\"_blank\" rel=\"noopener noreferrer\">YAML file<\/a> consisting of the set details, a collection of defaults, and a list of items to synthesize:<\/p>\n<ul>\n<li><strong>Set details<\/strong> \u2013 In the set stanza, you give the set a name to differentiate it from others, and an optional output prefix to tell the solution where in your S3 bucket you want the audio files stored.<\/li>\n<li><strong>Defaults <\/strong>\u2013 In the optional defaults section, you can give parameters specific values that apply unless overridden by specific items. The following attributes are supported, as <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/API_SynthesizeSpeech.html\" target=\"_blank\" rel=\"noopener noreferrer\">documented in the Amazon Polly API<\/a>:\n<ul>\n<li><strong>engine <\/strong>\u2013 Either <code>standard<\/code> or <code>neural<\/code>; defaults to <code>neural<\/code><\/li>\n<li><strong>language-code <\/strong>\u2013 Any of the over 20 languages supported; defaults to <code>en-US<\/code><\/li>\n<li><strong>output-format <\/strong>\u2013 <code>mp3<\/code>, <code>ogg_vorbis<\/code>, or <code>pcm<\/code>; defaults to <code>mp3<\/code><\/li>\n<li><strong>text-type <\/strong>\u2013 Either <code>text<\/code> or <code>SSML<\/code>; defaults to <code>text<\/code><\/li>\n<li><strong>voice-id <\/strong>\u2013 Any of the supported voices; defaults to <code>Matthew<\/code><\/li>\n<\/ul>\n<\/li>\n<li><strong>Items <\/strong>\u2013 The items collection is simply a list of text strings to synthesize. Amazon Polly converts each item\u2019s text to speech, using the set defaults plus any overrides given in the item, and places the resulting files in the S3 bucket in the set\u2019s output prefix folder. If you specify an output file, the file is named as specified; otherwise, the solution assigns the file a name based on its contents and its order in the collection.<\/li>\n<\/ul>\n<p>For example, if you want to synthesize six lines from Act 1 Scene 1 of <a href=\"https:\/\/www.gutenberg.org\/files\/1513\/1513-h\/1513-h.htm\" target=\"_blank\" rel=\"noopener noreferrer\">Romeo and Juliet<\/a>, you might use a YAML file that looks like the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">set:\n  name: romeo-juliet\n  output-prefix: act-1-scene-1\ndefaults:\n  engine: neural \n  language-code: en-US\n  output-format: mp3\n  text-type: text\nitems:\n  - text: Do you bite your thumb at us, sir?\n    voice-id: Joey\n  - text: I do bite my thumb, sir.\n    voice-id: Matthew\n  - text: &lt;speak&gt;Do you bite your thumb at &lt;break\/&gt;us&lt;break\/&gt;, sir?&lt;\/speak&gt;\n    voice-id: Joey\n    text-type: ssml\n  - text: &gt;\n      &lt;speak&gt;&lt;amazon:effect name=\"whispered\"&gt;Is the law of our side\n      if I say aye?&lt;\/amazon:effect&gt;&lt;\/speak&gt;\n    voice-id: Matthew\n    text-type: ssml\n  - text: &lt;speak&gt;&lt;amazon:effect name=\"whispered\"&gt;No.&lt;\/amazon:effect&gt;&lt;\/speak&gt;\n    voice-id: Brian\n    text-type: ssml\n  - text: No, sir. I do not bite my thumb at you, sir, but I bite my thumb, sir.\n    voice-id: Matthew \n<\/code><\/pre>\n<\/p><\/div>\n<p>This set specifies that Amazon Polly should synthesize six lines from the play. To represent the characters Abraham, Sampson, and Gregory, we use the voices Joey, Matthew, and Brian. With Amazon Polly, you can specify volume and tone, like when Abraham emphasizes the word \u201cus\u201d and for Sampson\u2019s and Gregory\u2019s asides, which are whispered; for <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/supportedtags.html\" target=\"_blank\" rel=\"noopener noreferrer\">SSML effects like these<\/a>, we simply specify that the<code> text-type<\/code> is <code>ssml<\/code>, and decorate the utterance appropriately.<\/p>\n<p>Because none of the items specify an output file, the file names are generated automatically for you. In this example, the generated MP3 files are <code>act-1-scene-1\/item-000000-do-you-bite-your-thumb-at-us-sir.mp3<\/code> through <code>act-1-scene-1\/item-000005-no-sir-i-do-not-bite-my-thumb-at-you-sir.mp3<\/code>.<\/p>\n<p>This set file (and others) are in the <code>docs\/samples<\/code> directory of the code. In CloudShell, you can send this file to Amazon Polly simply by uploading it to the S3 bucket you specified earlier:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">$ aws s3 cp docs\/samples\/romeo-juliet.yml s3:\/\/[BUCKET NAME]<\/code><\/pre>\n<\/p><\/div>\n<p>Amazon Polly synthesizes the six lines from the file. When all the lines have been synthesized, you get an email notification:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">Your Amazon Polly batch set romeo-juliet completed with 6 successful tasks and 0 failures. The requested files are in s3:\/\/[BUCKET NAME]\/act-1-scene-1\/.<\/code><\/pre>\n<\/p><\/div>\n<p>YAML can be created in any editor, is easy for humans to read, and is friendly for checking in to source control systems like <a href=\"https:\/\/aws.amazon.com\/codecommit\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CodeCommit<\/a>. However, the set file must be a pure text file, must have the .yml file extension, and must be valid YAML.<\/p>\n<h3>The Set Processor function<\/h3>\n<p>When a file with a <code>.yml<\/code> extension is uploaded to the S3 bucket, the Set Processor Lambda function kicks off the process. It parses the set file and creates a corresponding record for it in DynamoDB. This set record is used to keep track of how many items there are in the set, how many have yet to be completed, and when the set processing began.<\/p>\n<p>Then, for each item in the collection, the Set Processor function posts a message\u2014a work order, of sorts\u2014to the solution\u2019s <a href=\"https:\/\/aws.amazon.com\/sqs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Queue Service<\/a> (Amazon SQS) queue. This work order is a JSON document including everything Amazon Polly needs to synthesize the text per the instructions in the uploaded set file.<\/p>\n<p>Each message is entirely independent of the others, so the work of synthesizing them can be done by Amazon Polly concurrently, and it doesn\u2019t matter in what order they\u2019re completed. The name of the set is also part of the work order, so multiple sets (or even multiple instances of the same set) can be processed by the solution at the same time.<\/p>\n<h3>The Item Processor function<\/h3>\n<p>The Item Processor Lambda function consumes messages from the SQS queue and posts work to Amazon Polly.<\/p>\n<p>Each message represents a single audio file for Amazon Polly to create. The function calls the API method <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/API_StartSpeechSynthesisTask.html\" target=\"_blank\" rel=\"noopener noreferrer\">StartSpeechSynthesisTask<\/a>, using the values in the work order as arguments to the method\u2019s parameters. This is an asynchronous API call, so we have no guarantees as to when Amazon Polly actually generates the audio file for us; but when it\u2019s complete, Amazon Polly publishes an SNS message for the next Lambda function, the Response Processor, to handle.<\/p>\n<p>The Item Processor function also adds a record to the items table in DynamoDB, so the solution can keep track of which items have been successfully completed and which have not yet been.<\/p>\n<p>Like many AWS APIs, there are <a href=\"https:\/\/docs.aws.amazon.com\/polly\/latest\/dg\/limits.html\" target=\"_blank\" rel=\"noopener noreferrer\">limits to how many API calls you can make to Amazon Polly per second<\/a>. The Item Processor function is throttled to stay within reasonable limits, and it <a href=\"https:\/\/aws.amazon.com\/builders-library\/timeouts-retries-and-backoff-with-jitter\/\" target=\"_blank\" rel=\"noopener noreferrer\">backs off exponentially and retries<\/a> as needed so as to post the work but still stay within your account service limits.<\/p>\n<h3>The Response Processor function<\/h3>\n<p>When Amazon Polly has completed work on a specific request, it posts a notification to the SNS response topic. This is immediately picked up by the final Lambda function in the sequence, the Response Processor. This function is responsible for updating the item and set records in DynamoDB, and for renaming the audio file in Amazon S3 to the requested file name.<\/p>\n<p>If Amazon Polly reported success in synthesizing the audio file, then the Response Processor function simply moves the file to its final location. It updates the item record <code>taskStatus<\/code> to <code>success<\/code> and increments the <code>success<\/code> counter in the set record. If Amazon Polly reports failure, the function updates the item record with the reason for failure and increments the <code>failed<\/code> counter in the set record.<\/p>\n<h3>The Set Waiter workflow<\/h3>\n<p>To review, each of these Lambda functions runs only when triggered by an event:<\/p>\n<ul>\n<li>The Set Processor is triggered when a set file is uploaded to the S3 bucket<\/li>\n<li>The Item Processor is triggered when work orders appear in the SQS queue<\/li>\n<li>The Response Processor is triggered when Amazon Polly publishes a message to the SNS topic<\/li>\n<\/ul>\n<p>These functions can run concurrently, processing multiple items from multiple sets at the same time. Without an orchestration process, how do we know when a specific set is complete? How do we know if something went wrong?<\/p>\n<p>The Set Waiter is a Step Functions workflow that\u2019s responsible for watching a specific set to decide when it\u2019s done, or to notify if a technical problem with the solution has left the set abandoned.<\/p>\n<p>In the Step Functions Graph inspector, an in-process Set Waiter workflow looks like the following.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-29481\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/18\/6160-Step.jpg\" alt=\"\" width=\"801\" height=\"645\"><\/p>\n<p>An instance of the Set Waiter is started by the Set Processor function for every submitted set, which passes a unique name identifying that set. The waiter loads the set record from the DynamoDB table in the load phase and checks to see if it\u2019s complete in the check phase. If Amazon Polly still has tasks to process, the function waits a few seconds in the wait phase before starting again.<\/p>\n<p>If every task in the set has been processed by Amazon Polly, the Set Waiter moves to the notify phase, which publishes a message to the completion SNS topic. If no changes have recently been made to an in-process set, the Set Waiter assumes that something is wrong and posts an abandoned message to the topic.<\/p>\n<h2>Clean up<\/h2>\n<p>You can leave the solution in your account for as long as you like. When it\u2019s not in use, you pay only for the storage of the audio files in Amazon S3 and for the data in the DynamoDB tables. When you have text to synthesize, just upload a set file to the S3 bucket, and the solution takes it from there. You pay for the Lambda function invocations and the <a href=\"https:\/\/aws.amazon.com\/polly\/pricing\/\" target=\"_blank\" rel=\"noopener noreferrer\">characters actually processed by Amazon Polly<\/a>. Synthesizing all 1.1 million characters in <em>Moby Dick<\/em>, for example, costs less than $5 for the standard voices, and well under $20 for the higher-quality neural voices.<\/p>\n<p>If you decide not to use the solution again, you can delete all its resources using <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">$ aws cloudformation delete-stack --stack-name amazon-polly-async-batch<\/code><\/pre>\n<\/p><\/div>\n<h2>Conclusion<\/h2>\n<p>In this post, we described a serverless, event-driven solution for submitting large amounts of text phrases for Amazon Polly to process asynchronously. With this approach, you can keep your costs low by paying only once for synthesis, no matter how many times you play the generated audio files.<\/p>\n<p>You can deploy the solution to your account in minutes as an AWS SAM application. You specify the text to be converted in YAML files called set files. When a set file is uploaded to the solution\u2019s S3 bucket (either manually by a human, or automatically by a code pipeline), a series of Lambda functions\u2014the Set Processor, Item Processor, and Result Processor\u2014work together to submit the tasks to Amazon Polly and collect the audio files for you. When all the work has been completed, a notification is published to an SNS topic.<\/p>\n<p>The solution is developed as an open source project on GitHub. We welcome your feature requests, bug reports, or contributions. Try this out on your own and let us know what you think in the comments. To learn more about how Amazon Polly can help you, <a href=\"https:\/\/aws.amazon.com\/polly\/\" target=\"_blank\" rel=\"noopener noreferrer\">visit our webpage<\/a>!<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-29942 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/27\/Jon-Peterson.jpg\" alt=\"\" width=\"100\" height=\"125\"><strong>Jon Peterson<\/strong> is a Senior Solutions Architect with AWS. He lives outside of Chicago with his wife and two children.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-29941 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/27\/Prateek-Jain.jpg\" alt=\"\" width=\"100\" height=\"132\"><strong>Prateek Jain<\/strong> is a Solutions Architect with AWS, based out of Atlanta Georgia. He is passionate about Cloud and helping customers build amazing solutions on AWS.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/optimize-your-budget-and-time-by-submitting-amazon-polly-voice-synthesis-tasks-in-bulk\/<\/p>\n","protected":false},"author":0,"featured_media":1098,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1097"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1097"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1097\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1098"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1097"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1097"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1097"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}