How to recognize speech in a browser using the Web Speech API

Posted by Nunners on Sat, 09 Oct 2021 06:24:16 +0200

How to recognize speech in a browser using the Web Speech API

 

Developer evangelist Twilio   Phil Nash's This article or here It is the Japanese version of the article published by Te.

Web Speech API It has two functions: voice synthesis (text to voice) and speech recognition (voice to text). In the last article, I Explained Speech synthesis, But this time, I will explain the speech recognition and speech transcription methods of browsers using API.

Recognizing voice commands issued by users can provide a more immersive interface than usual, and it is easier to obtain users who like voice operation. According to a report by Google in 2018, 27% of the world's online population uses voice search on mobile devices . By using the browser's voice recognition function explained this time, you can provide a wide range of functions from basic voice search to interactive robots in your Web application.

Let's learn how the API works in the next section and see what you can do.

Necessary things

Prepare the following tools to create a sample application so that you can actually experience the API.


This time we will create an application using HTML, CSS, and JavaScript. Create a new working directory and Start HTML and CSS Save in this directory. When you open the saved HTML file in your browser, you will see the same screen as the following screenshot.

Let's learn how to hear and recognize audio in the browser in the next section.

Speech recognition API

Before adding voice recognition to the sample application, let's check availability using the developer tools provided by the browser. Open Chrome developer tools And enter the following code in the console.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js">speechRecognition <span style="color:#66d9ef">=</span> <span style="color:#f92672">new</span> <span style="color:#fd971f">webkitSpeechRecognition</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
speechRecognition<span style="color:#66d9ef">.</span>onresult <span style="color:#66d9ef">=</span> console<span style="color:#66d9ef">.</span>log<span style="color:#66d9ef">;</span> 
speechRecognition<span style="color:#66d9ef">.</span><span style="color:#a6e22e">start</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span>
</code></span></span>

When you run this code, Chrome requests permission to use the microphone. If you host pages on a Web server, remember your permissions in the browser. Allow to use microphone and speak. When you end the conversation, it records the speechrecognition event on the console.

This is just three lines of code, but it does a lot of work. First, SpeechRecognition creates an instance of the API (add the vendor name "webkit" at the beginning) and tells the instance to record all the results you receive from the speech recognition service. Finally, we guide you to start listening and recognizing.

In addition, this example also reflects the standard settings. For example, when an object receives a result, it stops listening. start you must call this method again to continue transcription. In addition, you will only receive the final results of the speech recognition service. To do this, you can also enable continuous speech recognition and output the recognition results during a call. The setting method will be explained later.

Speechrecognition event lets us check the output to the console. The most important attribute is results. This is the list contained in SpeechRecognitionResult. In this screenshot, I only said one word before I stopped listening, so I only saw one result object. This object, SpeechRecognitionAlternative, contains a list with more content. At the top of the list contained in speech recognition alternative, the transcription results and reliability (0 to 1) of speech recognition are displayed. By default, only one result is displayed, but you can configure it to receive more results from the speech recognition service. This is useful if you want users to choose the results that are most similar to spoken English.

How does this work

It is inaccurate to call this function in browser. Chrome now gets Audio and send it to Google's server To convert to text. To this end, Chrome and some chrome based browsers at present only Support speech recognition.

Mozilla has built-in speech recognition support in Firefox. This function Only if you enable this flag in the Firefox Nightly version available , because we are negotiating to use the Google Cloud Speech API . Mozilla is developing its own DeepSpeech engine However, due to our focus on browser support, we decided to use Google's services in this way.

Speech recognition uses a server-side API, so users must use it in an online environment. Offline local speech recognition will be released soon, but it is limited at present.

Use the entry code and development tool code you downloaded before to create a small application, so that we can recognize the user's voice in real time.

Speech recognition in Web Applications

Open the HTML < script > you downloaded earlier and perform the process between the tags at the bottom. First, DOMContentLoaded listens for events and gets a reference to the HTML element to be used.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-html"><span style="color:#f92672"><span style="color:#f92672"><span style="color:#66d9ef"><</span>script</span><span style="color:#66d9ef">></span></span> 
    window<span style="color:#66d9ef">.</span><span style="color:#a6e22e">addEventListener</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"DOMContentLoaded"</span><span style="color:#66d9ef">,</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
        <span style="color:#f92672">const</span> button <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementById</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"button"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#f92672">const</span> result <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementById</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"result"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#f92672">const</span> main <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementsByTagName</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"main"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">[</span><span style="color:#ae81ff">0</span><span style="color:#66d9ef">]</span><span style="color:#66d9ef">;</span> 
    <span style="color:#66d9ef">}</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#f92672"><span style="color:#f92672"><span style="color:#66d9ef"></</span>script</span><span style="color:#66d9ef">></span></span>
</code></span></span>

Check whether the browser supports SpeechRecognition or webkitSpeechRecognition. If it does not support this object, a message that cannot continue will be displayed.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-html"><span style="color:#f92672"><span style="color:#f92672"><span style="color:#66d9ef"><</span>script</span><span style="color:#66d9ef">></span></span> 
    window<span style="color:#66d9ef">.</span><span style="color:#a6e22e">addEventListener</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"DOMContentLoaded"</span><span style="color:#66d9ef">,</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
    <span style="color:#f92672">const</span> button <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementById</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"button"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> result <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementById</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"result"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> main <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementsByTagName</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"main"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">[</span><span style="color:#ae81ff">0</span><span style="color:#66d9ef">]</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> SpeechRecognition <span style="color:#66d9ef">=</span> window<span style="color:#66d9ef">.</span>SpeechRecognition <span style="color:#66d9ef">||</span> window<span style="color:#66d9ef">.</span>webkitSpeechRecognition 
    <span style="color:#f92672">if</span> <span style="color:#66d9ef">(</span><span style="color:#f92672">typeof</span> SpeechRecognition <span style="color:#66d9ef">===</span> <span style="color:#e6db74">"undefined"</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">{</span> 
        button<span style="color:#66d9ef">.</span><span style="color:#a6e22e">remove</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#f92672">const</span> message <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">getElementById</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"message"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        message<span style="color:#66d9ef">.</span><span style="color:#a6e22e">removeAttribute</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"hidden"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        message<span style="color:#66d9ef">.</span><span style="color:#a6e22e">setAttribute</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"aria-hidden"</span><span style="color:#66d9ef">,</span> <span style="color:#e6db74">"false"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    <span style="color:#66d9ef">}</span> <span style="color:#f92672">else</span> <span style="color:#66d9ef">{</span> 
        <span style="color:#8292a2">// good stuff to come here </span>
    <span style="color:#66d9ef">}</span> 
<span style="color:#66d9ef">}</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#f92672"><span style="color:#f92672"><span style="color:#66d9ef"></</span>script</span><span style="color:#66d9ef">></span></span>
</code></span></span>

If you can access SpeechRecognition, please prepare to use it. Define a variable that indicates whether you are recognizing speech and create an instance of the speech recognition object. In addition, we define three functions to start, stop, and respond to new results.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js"><span style="color:#66d9ef">}</span> <span style="color:#f92672">else</span> <span style="color:#66d9ef">{</span> 
    <span style="color:#f92672">let</span> listening <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">false</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> recognition <span style="color:#66d9ef">=</span> <span style="color:#f92672">new</span> <span style="color:#fd971f">SpeechRecognition</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> <span style="color:#a6e22e">start</span> <span style="color:#66d9ef">=</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span><span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> <span style="color:#a6e22e">stop</span> <span style="color:#66d9ef">=</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span><span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">const</span> <span style="color:#a6e22e">onResult</span> <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">event</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span><span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span> 
<span style="color:#66d9ef">}</span>
</code></span></span>

The start function starts speech recognition and changes the button text. In addition, add a class to the main element and start an animation to show that the page is listening. The stop function is the opposite.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js"><span style="color:#f92672">const</span> <span style="color:#a6e22e">start</span> <span style="color:#66d9ef">=</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
    recognition<span style="color:#66d9ef">.</span><span style="color:#a6e22e">start</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    button<span style="color:#66d9ef">.</span>textContent <span style="color:#66d9ef">=</span> <span style="color:#e6db74">"Stop listening"</span><span style="color:#66d9ef">;</span> 
    main<span style="color:#66d9ef">.</span>classList<span style="color:#66d9ef">.</span><span style="color:#a6e22e">add</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"speaking"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span> 
<span style="color:#f92672">const</span> <span style="color:#a6e22e">stop</span> <span style="color:#66d9ef">=</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
    recognition<span style="color:#66d9ef">.</span><span style="color:#a6e22e">stop</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    button<span style="color:#66d9ef">.</span>textContent <span style="color:#66d9ef">=</span> <span style="color:#e6db74">"Start listening"</span><span style="color:#66d9ef">;</span> 
    main<span style="color:#66d9ef">.</span>classList<span style="color:#66d9ef">.</span><span style="color:#a6e22e">remove</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"speaking"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span>
</code></span></span>

Then, when it receives the results of speech recognition, it will present all the results on the page. In this example, we will perform a direct DOM operation and add the previously described SpeechRecognitionResult object < div > as a paragraph to display the results. Add a CSS class to the result marked final to show the difference between the final result and the intermediate result.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js"><span style="color:#f92672">const</span> <span style="color:#a6e22e">onResult</span> <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">event</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
    result<span style="color:#66d9ef">.</span>innerHTML <span style="color:#66d9ef">=</span> <span style="color:#e6db74">""</span><span style="color:#66d9ef">;</span> 
    <span style="color:#f92672">for</span> <span style="color:#66d9ef">(</span><span style="color:#f92672">const</span> res <span style="color:#f92672">of</span> event<span style="color:#66d9ef">.</span>results<span style="color:#66d9ef">)</span> <span style="color:#66d9ef">{</span> 
        <span style="color:#f92672">const</span> text <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">createTextNode</span><span style="color:#66d9ef">(</span>res<span style="color:#66d9ef">[</span><span style="color:#ae81ff">0</span><span style="color:#66d9ef">]</span><span style="color:#66d9ef">.</span>transcript<span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#f92672">const</span> p <span style="color:#66d9ef">=</span> document<span style="color:#66d9ef">.</span><span style="color:#a6e22e">createElement</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"p"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#f92672">if</span> <span style="color:#66d9ef">(</span>res<span style="color:#66d9ef">.</span>isFinal<span style="color:#66d9ef">)</span> <span style="color:#66d9ef">{</span> 
            p<span style="color:#66d9ef">.</span>classList<span style="color:#66d9ef">.</span><span style="color:#a6e22e">add</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"final"</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        <span style="color:#66d9ef">}</span> 
        p<span style="color:#66d9ef">.</span><span style="color:#a6e22e">appendChild</span><span style="color:#66d9ef">(</span>text<span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        result<span style="color:#66d9ef">.</span><span style="color:#a6e22e">appendChild</span><span style="color:#66d9ef">(</span>p<span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> <span style="color:#66d9ef">}</span> <span style="color:#66d9ef">}</span><span style="color:#66d9ef">;</span>
</code></span></span>

Apply the settings used by this application before starting speech recognition. In this version, when the end of voice is detected, it will not end and the result will always be recorded. That is, the transcription results are always displayed on the page until the stop button is pressed. In addition, set it to display intermediate results when you speak (just like in Twilio) use< gather > and partialResultCallback You can always recognize your voice during a voice call. And add a result listener.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js">    <span style="color:#f92672">const</span> <span style="color:#a6e22e">onResult</span> <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">event</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span>
        <span style="color:#8292a2">// onResult code</span>
    <span style="color:#66d9ef">}</span> 
    recognition<span style="color:#66d9ef">.</span>continuous <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">true</span><span style="color:#66d9ef">;</span> 
    recognition<span style="color:#66d9ef">.</span>interimResults <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">true</span><span style="color:#66d9ef">;</span> 
    recognition<span style="color:#66d9ef">.</span><span style="color:#a6e22e">addEventListener</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"result"</span><span style="color:#66d9ef">,</span> onResult<span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#66d9ef">}</span>
</code></span></span>

Finally, add a listener to the button so that you can start and stop speech recognition.

<span style="background-color:#0d122b"><span style="color:#f8f8f2"><code class="language-js">    <span style="color:#f92672">const</span> <span style="color:#a6e22e">onResult</span> <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">event</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
        <span style="color:#8292a2">// onResult code </span>
    <span style="color:#66d9ef">}</span> 
    recognition<span style="color:#66d9ef">.</span>continuous <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">true</span><span style="color:#66d9ef">;</span> 
    recognition<span style="color:#66d9ef">.</span>interimResults <span style="color:#66d9ef">=</span> <span style="color:#ae81ff">true</span><span style="color:#66d9ef">;</span> 
    recognition<span style="color:#66d9ef">.</span><span style="color:#a6e22e">addEventListener</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"result"</span><span style="color:#66d9ef">,</span> onResult<span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
    button<span style="color:#66d9ef">.</span><span style="color:#a6e22e">addEventListener</span><span style="color:#66d9ef">(</span><span style="color:#e6db74">"click"</span><span style="color:#66d9ef">,</span> <span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">=></span> <span style="color:#66d9ef">{</span> 
        listening <span style="color:#66d9ef">?</span> <span style="color:#a6e22e">stop</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span> <span style="color:#66d9ef">:</span> <span style="color:#a6e22e">start</span><span style="color:#66d9ef">(</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
        listening <span style="color:#66d9ef">=</span> <span style="color:#66d9ef">!</span>listening<span style="color:#66d9ef">;</span> 
    <span style="color:#66d9ef">}</span><span style="color:#66d9ef">)</span><span style="color:#66d9ef">;</span> 
<span style="color:#66d9ef">}</span>
</code></span></span>

Please reload the browser and try the operation.

Read a few sentences and make sure they appear on the page. Although this speech recognition has high word recognition ability, it is not very good at punctuation. For example, if you want to use it for dictation, you need to do a little more development.

Talk to your browser

In this article, I explained the speech recognition of browser. In addition, in the last article, I introduced How to set the reading function of the browser . If you use these steps With assistants using Twilio Autopilot Combine You will be able to build interesting projects.

If you want to try out the examples in this article, sure stay View them in Glitch . If you need source code, you can From the web assistant repository on GitHub.

 

Topics: Java