![]() |
Sprex![]() |
||||||
|
| |||||||
|
Introduction |
IntroductionThe ProblemUsers of speech recognition systems often say things the system doesn't know about. That is, you may say a word that doesn't happen to be in its dictionary, so it cannot recognize what you say. This is a serious problem when you really need to re-use a word the system doesn't know.A User-unfriendly SolutionSo some solution is needed to let users add new words to the dictionary. Most systems allow users to add new word by typing in the pronunciation of a new word, using a phonetic alphabet. However, very few users know how to spell using any given phonetic alphabet, so this imposes an obstacle, a learning curve, which requires you to learn how to sound out and spell words using a new alphabet -- a phoneme alphabet, which has a separate symbol for every distinctive sound in the language.A User-friendly SolutionIt would be nice to have another approach where you don't have to learn this new alphabet, but instead you can just say the word, and have it generate the pronunciation automatically, according to what you say.This technical note outlines just this kind of improved solution to this problem. How it worksOverviewTo generate new words to put in a pronouncing dictionary to be used by a speech recognizer, a separate and special speech recognition subsystem is required. This subsystem would let you, the user, type in the new word, and then say it, maybe more than once. Then it would do recognition on what you said, against a "phone-star" grammar (namely, a grammar composed of a list of all the phones in the language, arranged in a loop, so that you can get any phone followed by any other phone, ad infinitum).The result of speech recognition using such a grammar is a phone transcription of what you just said, perhaps even a weighted N-Best lattice of phone transcriptions. A fancy version of the program would then use text-to-speech conversion to play out what it thought you said, so you could accept it as correct, or pick another one of the N-Best alternatives. Accuracy problemsA major issue with such systems is that they are not 100% accurate. The current state of the art on this recognition task, in fact, is about 85% phone accuracy. That is, *no-one* can do any better than 85%, no matter how much money or time or speech-Ph.D.-years they throw at the problem. For this reason fast and convenient error correction methods are an important part of the interface.Minimizing the accuracy problemsThere is another way to reduce the difficulty of the problem, however. In general, speech recognition is much easier when you have fewer alternatives to distinguish, and much harder when the alternatives are more similar or confuseable. So an alternative approach is to generate a range of possible pronunciations, perhaps using best-matching substrings taken from a pronouncing dictionary, and use that pronunciation lattice instead of the phone-star grammar. This approach gives a phone perplexity of perhaps 3 to 5, instead of more like 20, which makes the task much easier. However, these sounds, representing different possible ways of pronouncing the same letters, are likely to be the most similar sounds possible, which makes them much more confuseable, so that the improvement is probably not as much as we would have hoped. In short using spelling-constrained, pronunciation lattices instead of a phone-star grammar will let you be more accurate, but it also requires more engineering effort, since you need to have an inclusive letter-to-sound system as an additional part of the system. The APIThe form of the Application Programmer's Interface to this subsystem is rather unclear. But at some time the program should know that it needs to add a word to the dictionary. Then it should have some means of prompting for a text string representing the spelling of the word; then it should tell the subsystem to take that spelling and generate a pronunciation lattice for it; then it should tell the subsystem to start recognizing in whatever way the programmer wants the recognition to work (synchronous, asynchronous, etc.), and eventually to time out with a result. Then given the result, the subsystem should export the results as a lattice or a list or menu of pronunciation possibilities, ordered from most to least likely, down to some confidence threshold level. And there should be a text-to-speech system also available which uses the same pronunciation alphabet (or one to which one can translate).These requirements could be boiled down to
|
||||||
Copyright © 1996-2005
Sprex, Inc.
All rights reserved.
|