{"id":362,"date":"2020-10-06T10:24:25","date_gmt":"2020-10-06T10:24:25","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/10\/06\/evaluating-an-automatic-speech-recognition-service\/"},"modified":"2020-10-06T10:24:25","modified_gmt":"2020-10-06T10:24:25","slug":"evaluating-an-automatic-speech-recognition-service","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/10\/06\/evaluating-an-automatic-speech-recognition-service\/","title":{"rendered":"Evaluating an automatic speech recognition service"},"content":{"rendered":"<div id=\"\">\n<p>Over the past few years, many <a href=\"https:\/\/aws.amazon.com\/transcribe\/\" target=\"_blank\" rel=\"noopener noreferrer\">automatic speech recognition (ASR) services<\/a> have entered the market, offering a variety of different features. When deciding whether to use a service, you may want to evaluate its performance and compare it to another service. This evaluation process often analyzes a service along multiple vectors such as feature coverage, customization options, security, performance and latency, and integration with other cloud services.<\/p>\n<p>Depending on your needs, you\u2019ll want to check for features such as speaker labeling, content filtering, and automatic language identification. Basic transcription accuracy is often a key consideration during these service evaluations. In this post, we show how to measure the basic transcription accuracy of an ASR service in six easy steps, provide best practices, and discuss common mistakes to avoid.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-16697 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/02\/Evaluating-ASR-accuracy.jpg\" alt=\"Illustration showing a table of contents: The evaluation basics, six steps for performing an evaluation, and best practices and common mistakes to avoid.\" width=\"900\" height=\"450\"><\/p>\n<h2>The evaluation basics<\/h2>\n<h3>Defining your use case and performance metric<\/h3>\n<p>Before starting an ASR performance evaluation, you first need to consider your transcription use case and decide how to measure a good or bad performance. Literal transcription accuracy is often critical. For example, how many word errors are in the transcripts? This question is especially important if you pay annotators to review the transcripts and manually correct the ASR errors, and you want to minimize how much of the transcript needs to be re-typed.<\/p>\n<p>The most common metric for speech recognition accuracy is called <em>word error rate<\/em> (WER), which is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems. WER is the proportion of transcription errors that the ASR system makes relative to the number of words that were actually said. The lower the WER, the more accurate the system. Consider this example:<\/p>\n<p><strong>Reference transcript (what the speaker said)<\/strong>: well they went to the store to get sugar<\/p>\n<p><strong>Hypothesis transcript (what the ASR service transcribed)<\/strong>: they went to this tour kept shook or<\/p>\n<p>In this example, the ASR service doesn\u2019t appear to be accurate, but how many errors did it make? To quantify WER, there are three categories of errors:<\/p>\n<ul>\n<li>\n<strong>Substitutions<\/strong> \u2013 When the system transcribes one word in place of another. Transcribing the fifth word as <code>this<\/code> instead of <code>the<\/code> is an example of a substitution error.<\/li>\n<li>\n<strong>Deletions<\/strong> \u2013 When the system misses a word entirely. In the example, the system deleted the first word <code>well<\/code>.<\/li>\n<li>\n<strong>Insertions<\/strong> \u2013 When the system adds a word into the transcript that the speaker didn\u2019t say, such as <code>or<\/code> inserted at the end of the example.<\/li>\n<\/ul>\n<p>Of course, counting errors in terms of substitutions, deletions, and insertions isn\u2019t always straightforward. If the speaker says \u201cto get sugar\u201d and the system transcribes <code>kept shook or<\/code>, one person might count that as a deletion (<code>to<\/code>), two substitutions (<code>kept<\/code> instead of <code>get<\/code> and <code>shook<\/code> instead of <code>sugar<\/code>), and an insertion (<code>or<\/code>). A second person might count that as three substitutions (<code>kept<\/code> instead of <code>to<\/code>, <code>shook<\/code> instead of <code>get<\/code>, and <code>or<\/code> instead of <code>sugar<\/code>). Which is the correct approach?<\/p>\n<p>WER gives the system the benefit of the doubt, and counts the minimum number of possible errors. In this example, the minimum number of errors is six. The following aligned text shows how to count errors to minimize the total number of substitutions, deletions, and insertions:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">REF: WELL they went to THE  STORE TO   GET   SUGAR\r\nHYP: **** they went to THIS TOUR  KEPT SHOOK OR\r\n     D                 S    S     S    S     S\r\n<\/code><\/pre>\n<\/div>\n<p>Many ASR evaluation tools use this format. The first line shows the reference transcript, labeled <code>REF<\/code>, and the second line shows the hypothesis transcript, labeled <code>HYP<\/code>. The words in each transcript are aligned, with errors shown in uppercase. If a word was deleted from the reference or inserted into the hypothesis, asterisks are shown in place of the word that was deleted or inserted. The last line shows <code>D<\/code> for the word that was deleted by the ASR service, and <code>S<\/code> for words that were substituted.<\/p>\n<p>Don\u2019t worry if these aren\u2019t the actual errors that the system made. With the standard WER metric, the goal is to find the minimum number of words that you need to correct. For example, the ASR service probably didn\u2019t really confuse \u201cget\u201d and \u201cshook<em>,\u201d<\/em> which sound nothing alike. The system probably misheard \u201csugar\u201d as \u201cshook or,\u201d which do sound very similar. If you take that into account (and there are variants of WER that do), you might end up counting seven or eight word errors. However, for the simple case here, all that matters is counting how many words you need to correct without needing to identify the exact mistakes that the ASR service made.<\/p>\n<p>You might recognize this as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\" target=\"_blank\" rel=\"noopener noreferrer\">Levenshtein edit distance<\/a> between the reference and the hypothesis. WER is defined as the normalized Levenshtein edit distance:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-16624 size-full\" title=\"Math formula showing the minimum number of errors divided by the actual number of words in the reference transcript. The minimum number of errors is the same as the number of substitutions plus deletions plus insertions.\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/01\/F1.jpg\" alt=\"\" width=\"900\" height=\"108\"><\/p>\n<p>In other words, it\u2019s the minimum number of words that need to be corrected to change the hypothesis transcript into the reference transcript, divided by the number of words that the speaker originally said. Our example would have the following WER calculation:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-16625 size-full\" title=\"Math equation showing the quantity five plus one plus zero divided by the quantity nine equals approximately zero point six seven.\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/01\/F2.jpg\" alt=\"\" width=\"900\" height=\"95\"><\/p>\n<p>WER is often multiplied by 100, so the WER in this example might be reported as 0.67, 67%, or 67. This means the service made errors for 67% of the reference words. Not great! The best achievable WER score is 0, which means that every word is transcribed correctly with no inserted words. On the other hand, there is no worst WER score\u2014it can even go above 1 (above 100%) if the system made a lot of insertion errors. In that case, the system is actually making more errors than there are words in the reference\u2014not only does it get all the words wrong, but it also manages to add new wrong words to the transcript.<\/p>\n<p>For other performance metrics besides WER, see the section <strong>Adapting the performance metric to your use case <\/strong>later in this post.<\/p>\n<h3>Normalizing and preprocessing your transcripts<\/h3>\n<p>When calculating WER and many other metrics, keep in mind that the problem of text normalization can drastically affect the calculation. Consider this example:<\/p>\n<p><strong>Reference<\/strong>: They will tell you again: our ballpark estimate is $450.<\/p>\n<p><strong>ASR hypothesis<\/strong>: They\u2019ll tell you again our ball park estimate is four hundred fifty dollars.<\/p>\n<p>The following code shows how most tools would count the word errors if you just leave the transcripts as-is:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">REF: THEY WILL    tell you AGAIN: our **** BALLPARK estimate is **** ******* ***** $450.   \r\nHYP: **** THEY'LL tell you AGAIN  our BALL PARK     estimate is FOUR HUNDRED FIFTY DOLLARS.\r\n     D    S                S          I    S                    I    I       I     S\r\n<\/code><\/pre>\n<\/div>\n<p>The word error rate would therefore be:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-16626 size-full\" title=\"Math equation showing one deletion plus four substitutions plus four insertions in the numerator divided by 10 words in the reference in the denominator equals zero point nine.\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/01\/F3.jpg\" alt=\"\" width=\"900\" height=\"117\"><\/p>\n<p>According to this calculation, there were errors for 90% of the reference words. That doesn\u2019t seem right. The ASR hypothesis is basically correct, with only small differences:<\/p>\n<ul>\n<li>The words <code>they will<\/code> are contracted to <code>they\u2019ll<\/code>\n<\/li>\n<li>The colon after <code>again<\/code> is omitted<\/li>\n<li>The term <code>ballpark<\/code> is spelled as a single compound word in the reference, but as two words in the hypothesis<\/li>\n<li>\n<code>$450<\/code> is spelled with numerals and a currency symbol in the reference, but the ASR system spells it using the alphabet as <code>four hundred fifty dollars<\/code>\n<\/li>\n<\/ul>\n<p>The problem is that you can write down the original spoken words in more than one way. The reference transcript spells them one way and the ASR service spells them in a different way. Depending on your use case, you may or may not want to count these written differences as errors that are equivalent to missing a word entirely.<\/p>\n<p>If you don\u2019t want to count these kinds of differences as errors, you should normalize both the reference and the hypothesis transcripts before you calculate WER. Normalizing involves changes such as:<\/p>\n<ul>\n<li>Lowercasing all words<\/li>\n<li>Removing punctuation (except apostrophes)<\/li>\n<li>Contracting words that can be contracted<\/li>\n<li>Expanding written abbreviations to their full forms (such <code>Dr.<\/code> as to <code>doctor<\/code>)<\/li>\n<li>Spelling all compound words with spaces (such as <code>blackboard<\/code> to <code>black board<\/code> or <code>part-time<\/code> to <code>part time<\/code>)<\/li>\n<li>Converting numerals to words (or vice-versa)<\/li>\n<\/ul>\n<p>If you there are other differences that you don\u2019t want to count as errors, you might consider additional normalizations. For example, some languages have multiple spellings for some words (such as <code>favorite<\/code> and <code>favourite<\/code>) or optional diacritics (such as <code>na\u00efve<\/code> vs. <code>naive<\/code>), and you may want to convert these to a single spelling before calculating WER. We also recommend removing filled pauses like <code>uh<\/code> and <code>um<\/code>, which are irrelevant for most uses of ASR, and therefore shouldn\u2019t be included in the WER calculation.<\/p>\n<p>A second, related issue is that WER by definition counts the number of whole word errors. Many tools define words as strings separated by spaces for this calculation, but not all writing systems use spaces to separate words. In this case, you may need to tokenize the text before calculating WER. Alternatively, for writing systems where a single character often represents a word (such as Chinese), you can calculate a <em>character error rate<\/em> instead of a word error rate, using the same procedure.<\/p>\n<h2>Six steps for performing an ASR evaluation<\/h2>\n<p>To evaluate an ASR service using WER, complete the following steps:<\/p>\n<ol>\n<li>Choose a small sample of recorded speech.<\/li>\n<li>Transcribe it carefully by hand to create reference transcripts.<\/li>\n<li>Run the audio sample through the ASR service.<\/li>\n<li>Create normalized ASR hypothesis transcripts.<\/li>\n<li>Calculate WER using an open-source tool.<\/li>\n<li>Make an assessment using the resulting measurement.<\/li>\n<\/ol>\n<h3>Choosing a test sample<\/h3>\n<p>Choosing a good sample of speech to evaluate is critical, and you should do this before you create any ASR transcripts in order to avoid biasing the results. You should think about the sample in terms of <em>utterances<\/em>. An utterance is a short, uninterrupted stretch of speech that one speaker produces without any silent pauses. The following are three example utterances:<\/p>\n<p><a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Evaluating-Automatic-Speech-Recognition\/UtteranceA.wav\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16691\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/02\/UA.jpg\" alt=\"\" width=\"125\" height=\"125\"><\/a> <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Evaluating-Automatic-Speech-Recognition\/UtteranceB.wav\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16692\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/02\/UB.jpg\" alt=\"\" width=\"125\" height=\"125\"><\/a> <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Evaluating-Automatic-Speech-Recognition\/UtteranceC.wav\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16693\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/02\/UC.jpg\" alt=\"\" width=\"125\" height=\"125\"><\/a><\/p>\n<p>An utterance is sometimes one complete sentence, but people don\u2019t always talk in complete sentences\u2014they hesitate, start over, or jump between multiple thoughts within the same utterance. Utterances are often only one or two words long and are rarely more than 50 words. For the test sample, we recommend selecting utterances that are 25\u201350 words long. However, this is flexible and can be adjusted if your audio contains mostly short utterances, or if short utterances are especially important for your application.<\/p>\n<p>Your test sample should include at least 800 spoken utterances. Ideally, each utterance should be spoken by a different person, unless you plan to transcribe speech from only a few individuals. Choose utterances from representative portions of your audio. For example, if there is typically background traffic noise in half of your audio, then half of the utterances in your test sample should include traffic noise as well. If you need to extract utterances from long audio files, you can use a tool like\u00a0<a href=\"https:\/\/www.audacityteam.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Audacity<\/a>.<\/p>\n<h3>Creating reference transcripts<\/h3>\n<p>The next step is to create reference transcripts by listening to each utterance in your test sample and writing down what they said word-for-word. Creating these reference transcripts by hand can be time-consuming, but it\u2019s necessary for performing the evaluation. Write the transcript for each utterance on its own line in a plain text file named <code>reference.txt<\/code>, as shown below.<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still under warranty so i wanted to see if someone could come look at it\r\nno i checked everywhere the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet\r\ni tried to update my address on the on your web site but it just says error code 402 disabled account id after i filled out the form\r\n<\/code><\/pre>\n<\/div>\n<p>The reference transcripts are extremely literal, including when the speaker hesitates and restarts in the third utterance (<code>on the on your<\/code>). If the transcripts are in English, write them using all lowercase with no punctuation except for apostrophes, and in general be sure to pay attention to the text normalization issues that we discussed earlier. In this example, besides lowercasing and removing punctuation from the text, compound words have been normalized by spelling them as two words (<code>ice maker<\/code>, <code>web site<\/code>), the initialism <code>I.D.<\/code> has been spelled as a single lowercase word <code>id<\/code>, and the number <code>402<\/code> is spelled using numerals rather than the alphabet. By applying these same strategies to both the reference and the hypothesis transcripts, you can ensure that different spelling choices aren\u2019t counted as word errors.<\/p>\n<h3>Running the sample through the ASR service<\/h3>\n<p>Now you\u2019re ready to run the test sample through the ASR service. For instructions on doing this on the <a href=\"https:\/\/aws.amazon.com\/transcribe\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Transcribe<\/a> console, see <a href=\"https:\/\/aws.amazon.com\/getting-started\/hands-on\/create-audio-transcript-transcribe\/\" target=\"_blank\" rel=\"noopener noreferrer\">Create an Audio Transcript<\/a>. If you\u2019re running a large number of individual audio files, you may prefer to use the Amazon Transcribe developer API.<\/p>\n<h3>Creating ASR hypothesis transcripts<\/h3>\n<p>Take the hypothesis transcripts generated by the ASR service and paste them into a plain text file with one utterance per line. The order of the utterances must correspond exactly to the order in the reference transcript file that you created: if line 3 of your reference transcripts file has the reference for the utterance <code>pat went to the store<\/code>, then line 3 of your hypothesis transcripts file should have the ASR output for that same utterance.<\/p>\n<p>The following is the ASR output for the three utterances:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">Hi I'm calling about a refrigerator I bought from you The ice maker stopped working and it's still in the warranty so I wanted to see if someone could come look at it\r\nNo I checked everywhere in the mailbox The package room I asked my neighbor who sometimes gets my packages but it hasn't shown up yet\r\nI tried to update my address on the on your website but it just says error code 40 to Disabled Accounts idea after I filled out the form\r\n<\/code><\/pre>\n<\/div>\n<p>These transcripts aren\u2019t ready to use yet\u2014you need to normalize them first using the same normalization conventions that you used for the reference transcripts. First, lowercase the text and remove punctuation except apostrophes, because differences in case or punctuation aren\u2019t considered as errors for this evaluation. The word <code>website<\/code> should be normalized to <code>web site<\/code> to match the reference transcript. The number is already spelled with numerals, and it looks like the initialism <code>I.D.<\/code> was transcribed incorrectly, so no need to do anything there.<\/p>\n<p>After the ASR outputs have been normalized, the final hypothesis transcripts look like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still in the warranty so i wanted to see if someone could come look at it\r\nno i checked everywhere in the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet\r\ni tried to update my address on the on your web site but it just says error code 40 to disabled accounts idea after i filled out the form\r\n<\/code><\/pre>\n<\/div>\n<p>Save these transcripts to a plain text file named <code>hypothesis.txt<\/code>.<\/p>\n<h3>Calculating WER<\/h3>\n<p>Now you\u2019re ready to calculate WER by comparing the reference and hypothesis transcripts. This post uses the open-source <a href=\"https:\/\/github.com\/belambert\/asr-evaluation\" target=\"_blank\" rel=\"noopener noreferrer\">asr-evaluation<\/a> evaluation tool to calculate WER, but other tools such as <a href=\"https:\/\/github.com\/usnistgov\/SCTK\/\" target=\"_blank\" rel=\"noopener noreferrer\">SCTK<\/a> or <a href=\"https:\/\/github.com\/jitsi\/jiwer\" target=\"_blank\" rel=\"noopener noreferrer\">JiWER<\/a> are also available.<\/p>\n<p>Install the asr-evaluation tool (if you\u2019re using it) with pip install asr-evaluation, which makes the <code>wer<\/code> script available on the command line. Use the following command to compare the reference and hypothesis text files that you created:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">wer -i reference.txt hypothesis.txt<\/code><\/pre>\n<\/div>\n<p>The script prints something like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">REF: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still <span>** UNDER<\/span> warranty so i wanted to see if someone could come look at it\r\nHYP: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still <span>IN THE<\/span>   warranty so i wanted to see if someone could come look at it\r\nSENTENCE 1\r\nCorrect          =  96.9%   31   (    32)\r\nErrors           =   6.2%    2   (    32)\r\nREF: no i checked everywhere <span>**<\/span> the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet\r\nHYP: no i checked everywhere <span>IN<\/span> the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet\r\nSENTENCE 2\r\nCorrect          = 100.0%   24   (    24)\r\nErrors           =   4.2%    1   (    24)\r\nREF: i tried to update my address on the on your web site but it just says error code <span>** 402<\/span> disabled ACCOUNT  ID   after i filled out the form\r\nHYP: i tried to update my address on the on your web site but it just says error code <span>40 TO<\/span>  disabled <span>ACCOUNTS IDEA<\/span> after i filled out the form\r\nSENTENCE 3\r\nCorrect          =  89.3%   25   (    28)\r\nErrors           =  14.3%    4   (    28)\r\nSentence count: 3\r\nWER:     8.333% (         7 \/         84)\r\nWRR:    95.238% (        80 \/         84)\r\nSER:   100.000% (         3 \/          3)\r\n<\/code><\/pre>\n<\/div>\n<p>If you want to calculate WER manually instead of using a tool, you can do so by calculating the Levenshtein edit distance between the reference and hypothesis transcript pairs divided by the total number of words in the reference transcripts. When you\u2019re calculating the Levenshtein edit distance between the reference and hypothesis, be sure to calculate word-level edits, rather than character-level edits, unless you\u2019re evaluating a written language where every character is a word.<\/p>\n<p>In the evaluation output above, you can see the alignment between each reference transcript <code>REF<\/code> and hypothesis transcript <code>HYP<\/code>. Errors are printed in uppercase, or using asterisks if a word was deleted or inserted. This output is useful if you want to re-count the number of errors and recalculate WER manually to exclude certain types of words and errors from your calculation. It\u2019s also useful to verify that the WER tool is counting errors correctly.<\/p>\n<p>At the end of the output, you can see the overall WER: 8.333%. Before you go further, skim through the transcript alignments that the <code>wer<\/code> script printed out. Check whether the references correspond to the correct hypotheses. Do the error alignments look reasonable? Are there any text normalization differences that are being counted as errors that shouldn\u2019t be?<\/p>\n<h3>Making an assessment<\/h3>\n<p>What should the WER be if you want good transcripts? The lower the WER, the more accurate the system. However, the WER threshold that determines whether an ASR system is suitable for your application ultimately depends on your needs, budget, and resources. You\u2019re now equipped to make an objective assessment using the best practices we shared, but only you can decide what error rate is acceptable.<\/p>\n<p>You may want to compare two ASR services to determine if one is significantly better than the other. If so, you should repeat the previous three steps for each service, using exactly the same test sample. Then, count how many utterances have a lower WER for the first service compared to the second service. If you\u2019re using <code>asr-evaluation<\/code>, the WER for each individual utterance is shown as the percentage of <code>Errors<\/code> below each utterance.<\/p>\n<p>If one service has a lower WER than the other for at least 429 of the 800 test utterances, you can conclude that this service provides better transcriptions of your audio. 429 represents a conventional threshold for statistical significance when using a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sign_test\" target=\"_blank\" rel=\"noopener noreferrer\">sign test<\/a> for this particular sample size. If your sample doesn\u2019t have exactly 800 utterances, you can manually calculate the sign test to decide if one service has a significantly lower WER than the other. This test assumes that you followed good practices and chose a representative sample of utterances.<\/p>\n<h2>Adapting the performance metric to your use case<\/h2>\n<p>Although this post uses the standard WER metric, the most important consideration when evaluating ASR services is to choose a performance metric that reflects your use case. WER is a great metric if the hypothesis transcripts will be corrected, and you want to minimize the number of words to correct. If this isn\u2019t your goal, you should carefully consider other metrics.<\/p>\n<p>For example, if your use case is keyword extraction and your goal is to see how often a specific set of target keywords occur in your audio, you might prefer to evaluate ASR transcripts using metrics such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\" target=\"_blank\" rel=\"noopener noreferrer\">precision, recall, or F1 score<\/a> for your keyword list, rather than WER.<\/p>\n<p>If you\u2019re creating automatic captions that won\u2019t be corrected, you might prefer to evaluate ASR systems in terms of how useful the captions are to viewers, rather than the minimum number of word errors. With this in mind, you can roughly divide English words into two categories:<\/p>\n<ul>\n<li>\n<strong>Content words<\/strong> \u2013 Verbs like \u201crun\u201d, \u201cwrite\u201d, and \u201cfind\u201d; nouns like \u201ccloud\u201d, \u201cbuilding\u201d, and \u201cidea\u201d; and modifiers like \u201ctall\u201d, \u201ccareful\u201d, and \u201cquickly\u201d<\/li>\n<li>\n<strong>Function words<\/strong> \u2013 Pronouns like \u201cit\u201d and \u201cthey\u201d; determiners like \u201cthe\u201d and \u201cthis\u201d; conjunctions like \u201cand\u201d, \u201cbut\u201d, and \u201cor\u201d; prepositions like \u201cof\u201d, \u201cin\u201d, and \u201cover\u201d; and several other kinds of words<\/li>\n<\/ul>\n<p>For creating uncorrected captions and extracting keywords, it\u2019s more important to transcribe content words correctly than function words. For these use cases, we recommend ignoring function words and any errors that don\u2019t involve content words in your calculation of WER. There is no definite list of function words, but <a href=\"https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/Evaluating-Automatic-Speech-Recognition\/en-us-function-words.txt\" target=\"_blank\" rel=\"noopener noreferrer\">this file<\/a> provides one possible list for North American English.<\/p>\n<h2>Common mistakes to avoid<\/h2>\n<p>If you\u2019re comparing two ASR services, it\u2019s important to evaluate the ASR hypothesis transcript produced by each service using a true reference transcript that you create by hand, rather than comparing the two ASR transcripts to each other. Comparing ASR transcripts to each other lets you see how different the systems are, but won\u2019t give you any sense of which service is more accurate.<\/p>\n<p>We emphasized the importance of text normalization for calculating WER. When you\u2019re comparing two different ASR services, the services may offer different features, such as true-casing, punctuation, and number normalization. Therefore, the ASR output for two systems may be different even if both systems correctly recognized exactly the same words. This needs to be accounted for in your WER calculation, so you may need to apply different text normalization rules for each service to compare them fairly.<\/p>\n<p>Avoid informally eyeballing ASR transcripts to evaluate their quality. Your evaluation should be tailored to your needs, such as minimizing the number of corrections, maximizing caption usability, or counting keywords. An informal visual evaluation is sensitive to features that stand out from the text, like capitalization, punctuation, proper names, and numerals. However, if these features are less important than word accuracy for your use case\u2014such as if the transcripts will be used for automatic keyword extraction and never seen by actual people\u2014then an informal visual evaluation won\u2019t help you make the best decision.<\/p>\n<h2>Useful resources<\/h2>\n<p>The following are tools and open-source software that you may find useful:<\/p>\n<ul>\n<li>Tools for calculating WER:\n         <\/li>\n<li>Tools for extracting utterances from audio files:\n         <\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>This post discusses a few of the key elements needed to evaluate the performance aspect of an ASR service in terms of word accuracy. However, word accuracy is only one of the many dimensions that you need to evaluate when choosing on a particular ASR service.\u00a0It\u2019s critical that you include other parameters such as the ASR service\u2019s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16622 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/01\/scottseyfarth.jpg\" alt=\"\" width=\"100\" height=\"100\">Scott Seyfarth<\/strong> is a Data Scientist at AWS AI. He works on improving the Amazon Transcribe and Transcribe Medical services. Scott is also a phonetician and a linguist who has done research on Armenian, Javanese, and American English.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16317 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/24\/PaulZhao.jpg\" alt=\"\" width=\"101\" height=\"140\">Paul Zhao<\/strong> is Product Manager at AWS AI. He manages Amazon Transcribe and Amazon Transcribe Medical. In his past life, Paul was a serial entrepreneur, having launched and operated two startups with successful exits.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/evaluating-an-automatic-speech-recognition-service\/<\/p>\n","protected":false},"author":0,"featured_media":363,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/362"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=362"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/362\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/363"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=362"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=362"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=362"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}