sprex logo
Sprex
Banner Image
      
 

 

 

 

 

 

 

Introduction

Courses and Seminars

Entropic Products

Continuous Independent ASR

Speech Forms

TTS Class Outline

IVR Telephony

Custom Speech-Recognition Based Application Development

FAQs

Other Sprex Products/Services

Exit

 

 

 

 

 

 

 


Issues in Speech Recognition

See also the Sprex Speech Technology FAQ.

Burned by high expectations

While still widely oversold -- as it has traditionally been done since the 1940's -- speech recognition technology is finally capable of certain wide-use practical applications.

Why has speech technology always been oversold? Because it is so easy to develop unreasonably high expectations of the technology. People think "It'll write down anything I say", when, in fact, a given system, say a digit recognizer, may be unable to deal with any utterance that is not a 7 digit number. In the appropriate conversational context, exactly that class of utterances may be a perfectly fluent and natural part of a given interaction, and if so the person talking to the computer system will have the impression that when they say something, the system gets it right, so therefore it will get it right in other contexts as well. But to build a different system, for example one which will let you spell your name out loud, that is a task which may require a few person-months, perhaps even person-years of Ph.D. level speech technology engineering resources to develop. How many PhD's have to work for how many months to enable a given system to comfortably handle the next random class of utterances? The answer is often "Many".

It is completely natural to have high expectations of a system that lets you speak naturally in one context; one naturally thinks that the system will also handle your speech in any other context. This is inference is quite true if the listener is a human being, since you simply need to know if your listener undestands your language or not, and that can be pretty well established after one conversational turn. But the same inference does not work with speech recognition software programs. So be careful. To make a system work you need to limit and clarify the users' expectations, and you need to build a speaker independent grammar which has reasonably complete coverage of the range of utterances that are possible in each given context. These requirements are difficult, but not impossible to satisfy.

What is actually possible?

Currently available, off-the-shelf technology can presently be used in carefully-designed and appropriate applications to build effective and satisfying speech-based user interfaces.

That is, continuously spoken utterances (no pauses necessary) from unknown speakers (no training required) can be recognized with high accuracy when speaking naturally in the limited domain of a specific product or service interaction. Multi-thousand-word-vocabulary systems exist today and can be built using technology you can now buy (from Sprex!) providing grammar coverage of anything users are likely to say within that task.

So appropriately designed products and services can be made to understand what users say to them.

The market for speech recognition

Just as every person talks and listens, there is no intrinsic reason, beyond technology and price limitations, that every product should not also talk and listen, within the limits of its particular product functionality. This incontrovertible logic is not fully appreciated but inexorably leads to one conclusion: there is no limit to the potential range of speech applications.

Companies with market pull for speech-enhancements of their products can and should now start work to leap-frog their competition by putting speech recognition into their products. The opportunities are tremendous for those with the resources and the knowledge of how to apply current cutting edge speech technology to provide highly efficient and satisfying user- and customer-experiences with their products.

Principles of speech recognition based interface development

There are only a few sensible types of speech interfaces which work comfortably for users:
  • Learnable (if the grammar is tiny, users can memorize everything).
  • Visible (if the words you can say are visible on a computer screen, then users can read them out loud.)
  • Natural (you can say anything that makes sense to you).
A system of menus and popup sub-menus on a menu bar (i.e., a menu tree) is learnable and visible. One can easily convert a menu tree into a continuous, speaker-independent speech interface. This would be a rapid development task for Sprex to do for you.

However, for the purpose command-control on the desktop, the present methods are already highly efficient. Keyboard accelerators make it so that File..Save is a sequence of two characters on the keyboard. Comparing the ergonomics of that against saying the words, checking the results, and fixing the occasional recognition error, the keyboard clearly wins. You might want a set of highlighted words on a screen to be your visible active grammar. Sprex can provide a speech recognition module which, if you feed it a set of words or phrases, will make those available within the currently active grammar, feeding recognition results from that grammar back to your program.

Sprex can also develop a natural, task-dependent dialog system for you. Through an iterative custom engineering process, Sprex can develop a grammar covering the whole range of responses that people are likely to give in interacting with your application and mapping the output of the recognizer to the semantics of your application.

Sprex' recognizer has state of the art accuracy, so that coverage of the whole range of responses for a natural speech grammar is attainable in application-restricted tasks.

Sprex can build a custom speech interface for your application, and provide the recognition system as a software module returning the output your system needs.

This development process is a loop. First, we build an alpha stage system, which includes a grammar that covers all the things we expect people to say. Then we put that system in front of a bunch of users, and record three things: the waveform, the word sequence that was recognized, and the function that was carried out by the system in response.

As users use the system in the actual deployment environment, we can retrain all these components. We can improve the acoustic models so that they better represent the statistics of the true incoming audio data. We can refine the grammar so that it covers the things people actually say in addition to the things we expected them to say. And we can correct the response function which may or may not have always done the right thing in response to the words that were spoken. Finally, we loop back to the top, deploy the next level system, and keep refining it.

After a few rounds of this, and a few hundred users, we can be pretty confident that we have covered the statistical range of things people are likely to say, and the system will be ready for more robust deployment.

Any actual deployed speech recognition based system has been through this cycle of alpha deployment followed by iterative refinement of the system. If you have a product or service that requires this kind of recognition, you should expect that a non-trivial development project will be required.

If you have a requirement of this kind, contact tv at sprex dot com. Sprex can carry out development projects of this kind for you.

Hardware requirements

The embedded systems market has not really understood a key fact that cannot be avoided: continuous recognition requires lots of CPU power. While it is true that old-fashioned, isolated-word recognizers can run on a slower machines, this is only because when you pause between words, you're telling it where the words start and stop. But in continuous speech, any word could potentially start and stop at any time, so the system has to search through and consider every possible start time and every possible end time for every possible word to be recognized, and find the sequence that fits the best. This is a huge task, and so a fast machine is necessary. As a result, getting comfortably fast responses requires a much faster CPU.

Sprex' recognizer, being built for continuous speech, prefers a fast computer to run on.

Perplexity and confuseability: Constraints on Success

The key number from which state-of-the-art recognition performance values can be derived is this: 80 to 85% raw phoneme accuracy at 30db SNR. So if your grammar has a pair of word paths which are distinct in one phoneme only then you should expect one error in five, or so, when trying to recognize one of those word sequences. If a pair of paths are distinct in two phonemes, then you can get 1/25 errors, which is 96%, which isn't bad. But if you can make sure they're all distinct by three phonemes or more, then your recognition results should be 1 such error in 125, which is suddenly very very reliable (given 30db SNR). In general, the higher the average branching factor (perplexity), the worse the confuseability, since more words are there to be similar to one another. But if you have a very simple small grammar including things like "turn it on" and "turn it off", then you cannot expect reliable results. So design your grammar using words that are different by multiple phonemes, and you should have good results.

Network-based Recognition

The speech recognition system architecture that will win in a client/server, distributed application world (including any telephone or internet based speech recognition based service), will have this kind of structure:

The client does lightweight DSP tasks only. Calculate presence/absence of speech, FFT, mel-frequency energy bands, a cosine transform (resulting in the standard high-performance spectral parameters for speech recognition, known as Mel Frequency Cepstral Coefficients, or MFCC's), followed by a 12-bit or up to 14 bit vector-quantization based compression, reducing the data to 1200 to 1400 bits per second, to send through the network. This data will be derived from clean, high-bandwidth audio which ensures maximum speech recognition accuracies, which can be done because the client side device can have a high quality A/D in it.

On the server side, the compressed data vectors are decompressed and fed into the recognizer, which uses those spectral parameters to determine the most likely word sequence.

This architecture has the advantages of:

  • Lightweight client-side processing:
    • little memory is required on the client
    • integer-only CPUs are workable on the client
    • high-quality 16-bit A/D's can be used on the client
  • Extremely narrow bandwidth requirements: 1200 bps!
  • Server-side processing can be as complex as you want, for large vocabulary dictation tasks, and natural dialog systems conducting business transactions through speech.
This architecture is the one that will win for businesses providing distributed recognition-based services. For further information, for discussion of your business or application, and for licensing discussions, please contact info at sprex dot com.
Copyright © 1996-2005 Sprex, Inc. All rights reserved.
Date: August 27, 2008