sprex logo
Sprex
Banner Image
      
 

 

 

 

 

 

 

Garbage modeling


Adding words


Microphones


Open Mic


Real-time response


Sample rate conversion


Grammars, dictionaries, and "vocabularies"


Other FAQs


Other Sprex Products/Services


Exit>

 

 

 

 

 

 

 


Other topics (FAQs)

Table of Contents

FAQ

Problem #1: How do you recommend doing phonetic labeling?
Solution: We recommend
Phonolyze.

Problem #2: What about Large Vocabulary Recognition (LVR)?
Solution: LVR technology is practical for transcription tasks. It is much less practical for tasks where the speaker's words must be translated into their meaning in terms that the system understands and then into actions that the system should carry out. In large vocabulary tasks, pretending you can do this with the reliability expected of commercial applications is like pretending you have solved the AI problem. I'll be polite and simply say, keep your expectations low.

Problem #3: What kind of applications are practical?
Solution: For realistic applications, particularly commercial applications where there is some expected minimum level of acceptable performance, the translation of word sequences into actions must be done by a human developer. It is a form of software development, in which a grammar is written along with the actions implied by the words in the grammar.
ANSR's gxc tool enables the development of this kind of practical application.

Thus we recommend breaking the vocabulary into distinct (task-oriented) medium-vocabulary grammars of up to 4000 words each and using ANSR.

Problem #4: How do I start developing a grammar?
Solution: We recommend writing a grammar using the ANSR grammar formalism, converting it into a word lattice using
gxc, adding a dictionary for each of the words in the lattice, and then run a demo recognizer using your grammar. Each Grammar is represented as a word lattice, with development done manually based on the needs of the dialog system.

Problem #5: What about performance analysis?
Solution: Performance analysis can mean anything from speech recognition word accuracy rates to user satisfaction levels to network throughput among distributed processes in a scaleable system operating under something like
ANSR's STARFOG protocol. Please be specific.

Problem #6: How do I optimize accuracy?
Solution: ANSR makes use of HTK configuration parameters for controlling the "beam search" widths which expand or reduce the search space. Wider beams mean slower, longer searches, which are more likely to capture the accurate transcription. So there is a tradeoff, given a model set and test data, between accuracy and speed.

To optimize this tradeoff, one can write a test script to adjust the beam width, run a recognition pass on a set of pre-recorded files, keeping track both of the time for each pass and the word error counts. For a range of beam widths, this test will allow you to determine the optimal setting, where something acceptably close to the best level of accuracy is attained while significantly maximizing the minimizing the time requirement. ANSR supports such scripts through file-based operation of the sprecd server on lists of files.

Problem #7: How can grammar errors be fixed?
Solution: There are various kinds of grammar errors.

Problem #8: Some Precious Wisdom:
Solution: Many ambitious system integrators imagine that they can use a high-complexity (LVR) grammar for their application (why not, they think, after all, it will include all the simpler grammars as a subset, then we don't have to worry about the grammar. Aren't I smart!). And thus they buy into free or inexpensive Dragon-Dictate style recognition engines (available from research lab organizations such as those at IBM or Microsoft, or smaller companies that actually have failed, under the weight of their R&D costs, like Dragon). And they focus their efforts on trying to build a general-purpose semantic-interpretation system to process the recognition results. This is tantamount to declaring that you will solve the AI problem within your short- and medium-term product development cycle. Ambitious, fun, you get to think how smart you are, for a while, but this is truly impractical. After running to the end of your budget rope, you will be snapped to heel and scolded, if not fired, and will have wasted someone's precious funding, perhaps a lot of it. We have seen this happen over and over again.

Instead, think very carefully about what particular tasks you want the system to perform, and write down in novelistic dialog form a conversation flow containing a number of sample turns by both the system and the user, and try to consider whether the grammars that are necessary at each turn in the conversation can be constrained and limited to an achievable task with high accuracy, and for which one can guide the user's expectations so that they keep within the grammar. If you can build this kind of dialog flow, with confidence that users can be kept within its bounds, then a practical speech recognition based dialog system can confidently be built, within the contraints of a short- or medium-term product development cycles and within a budget.

Problem #9: Does the monophone mean phonetic symbol in HTK?
Solution: If your HMM set is made up of monophone models, then there is one phone per HMM. Biphone and triphone HMMs are models of single phones in the (string-adjacent) context of another one or two phones, respectively. So this is a correspondence between between phone symbols and biphones or triphones also.

These terms are different in meaning but refer to things that are often confused.

    2) phonemic unit (phonemic or phonological symbol)
    3) monophone
    4) biphone
    5) triphone

"Phone" is used to be ambiguous between (1) and (2) when you might have a linguist arguing with you. Good phone sets for speech recognition purposes are always sets of phonemes, not sets of phonetic units. Mono-, bi-, and tri-phones are HMM models, statistical models, not units of human language that reside in human minds or brains or behavior.

Copyright © 1996-2005 Sprex, Inc. All rights reserved.
Date: August 27, 2008