sprex logo
Sprex
Banner Image
      
 

 

 

 

 

 

 

Garbage modeling


Adding words


Microphones


Open Mic


Real-time response


Sample rate conversion


Grammars, dictionaries, and "vocabularies"


Other FAQs


Other Sprex Products/Services


Exit>

 

 

 

 

 

 

 


Garbage modeling

On adding penalities to garbage models

When a phone loop is used as a garbage or out-of-grammar-speech model, it needs a penalty in the lattice, otherwise the recognizer may assign all speech to it. All speech is of course comprised of phone sequences, and an unconstrained phone sequence will fit utterances at least as well and normally better than a constrained subset of all possible phone sequences. The recognizer can find not just within-grammar phone sequences but any ones whatsoever. Since there is phone-level deletion or modification in most utterances, it is preferable (from the recognizer's perspective) to transcribe it using the phone loop instead of the expected part of the grammar, which don't allow for deletions and modifications. To solve this problem, a penalty is attached to the entry link in the phone loop. The recognizer will then avoid aligning speech to the phone loop unless the fit is so much better in the phone loop than in the rest of the grammar that it overcomes that penalty. The penalty quantifies how much better the phone loop needs to be over the rest of the grammar before the recognizer can return a garbage result. This is the basic idea.

An important refinement of the basic idea comes when we notice that long utterances do not show much effect from this penalty. The effect of the penalty is diluted over the many frames of the longer utterance, so that even if they are in the grammar, they may still be better fitted to the phone loop, because of the above-discussed fit preference for unconstrained phone recognition. Consequently, we add a loop-back penalty, which effectively increases the total penalty for longer utterances, so that this dilution will not occur. Then the recognizer will be forced to use within-grammar phone sequences to model a longer in-grammar utterance, because the accumulated penalties of repeatedly passing through the loop-back link are so costly that the phone loop is avoided.

Speech recognizers often confront word sequences that are not part of the specified grammar: garbage speech.

The way speech recognition works is that the incoming audio stream is
compared to all the possible paths through the grammar network and
words composing the top-scoring path are returned as the recognition
result.  

If garbage speech is the input, then this method will still find the
best fitting path, which will be some word sequence in the grammar
-- which is different from the garbage word sequence!  

Therefore it is convenient to have a generic speech model, or "garbage
model" which represents an alternative path within the grammar.  If
incoming speech matches a path through the garbage model best, then
the system can infer that the speech does not fit the (remainder of)
the grammar, and take an appropriate action (e.g., ignore the input,
or give feedback asking the user to say something in the grammar).

The following high-level grammar allows either in-grammar or
out-of-grammar speech to be recognized:

	( $GARBAGE | $GRAMMAR )

Here "|" means "or". 

There are several ways of building a generic speech model, $GARBAGE in
the above grammar.  One that I understand pretty well is a 

	"penalized phone* grammar".  

"*" here means one or more instances in sequence, and "phone" here
means any phoneme in the language.  Thus a "phone* grammar" is a
grammar which allows any number of any of the phonemes in the
language, in any sequence.  Since every utterance in the language is
composed of a sequence of phones, a phone* grammar is a grammar which
will fit any utterance in the language. In fact, a phone* grammar is
so excellent a model of general speech that in general it fits
in-grammar, non-garbage utterances even better than than the regular
word-network grammar!  (The word-network grammar is itself decomposed
into a restricted set of phones in a restricted set of orders, while
phone* allows anything in any order, so an in-grammar utterance will
fit the phone* grammar better whenever any substitute phone fits better,
since the alternatives are all there in the phone* grammar.)

So a phone* garbage model will always win: the best
fitting path through the network will always go through the garbage
model, even for in-grammar utterances.    

For this reason, the phone* grammar is augmented by an entrance
penalty.  Paths that extend into this garbage model are penalized
severely, so that only when the regular word-network grammar really
doesn't fit very well at all, only then will garbage be recognized.

A sample dictionary text file generated by hand and a lattice text
file generated by grapHvite's NetBuilder (and augmented with the
penalty score) are appended.  The dictionary is required because
networks must be specified in terms of "words" rather than in terms of
phones, so in this dictionary each word is composed of a single phone.
The lattice is simply the looping OR of all the phones (including
_SILence), with an initial guesstimate penalty of -50 attached to the
entry link of the lattice.

Implementation with grapHvite/HAPI: Insert the appended lattice as a
sublattice into your recognition network; it will represent an
alternative path between initial and final nodes in your top-level
lattice.  Add the appended dictionary to your dictionary to produce a
combined dictionary.

-------- dictionary --------
!PHONE-SET aa ae ah ao aw ax ay b ch d dh eh en er ey f g hh ih iy jh k l m n ng oh ow oy p r s sh sil sp t th uh uw v w y z zh

_AA       aa
_AE       ae
_AH       ah
_AO       ao
_AW       aw
_AX       ax
_AY       ay
_B       b
_CH       ch
_D       d
_DH       dh
_EH       eh
_EN       en
_ER       er
_EY       ey
_F       f
_G       g
_HH       hh
_IH       ih
_IY       iy
_JH       jh
_K       k
_L       l
_M       m
_N       n
_NG       ng
_OH       oh
_OW       ow
_OY       oy
_P       p
_R       r
_S       s
_SH       sh
_SIL       sil
_SP       sp
_T       t
_TH       th
_UH       uh
_UW       uw
_V       v
_W       w
_Y       y
_Z       z
_ZH       zh
-------- garbage.lat --------
#
# Lattice file generated by grapHvite Version 1.0.1
#
# Main Lattice.
#
# Grammar summary.
N=52 L=94
# Node definitions.
I=0 W=!NULL x=15 y=205 
I=1 W=!NULL x=95 y=205 
I=2 W=!NULL x=465 y=255 
I=3 W=!NULL x=505 y=115 
I=4 W=_AA x=185 y=35 
I=5 W=_AE x=185 y=65 
I=6 W=_AH x=185 y=95 
I=7 W=_AO x=185 y=125 
I=8 W=_AW x=185 y=155 
I=9 W=_AX x=185 y=185 
I=10 W=_AY x=185 y=215 
I=11 W=_B x=185 y=245 
I=12 W=_CH x=185 y=275 
I=13 W=_D x=185 y=305 
I=14 W=_DH x=185 y=335 
I=15 W=_EH x=185 y=365 
I=16 W=_EN x=185 y=395 
I=17 W=_ER x=185 y=425 
I=18 W=_EY x=295 y=35 
I=19 W=_F x=295 y=65 
I=20 W=_G x=295 y=95 
I=21 W=_HH x=295 y=125 
I=22 W=_IH x=295 y=155 
I=23 W=_IY x=295 y=185 
I=24 W=_JH x=295 y=215 
I=25 W=_K x=295 y=245 
I=26 W=_L x=295 y=275 
I=27 W=_M x=295 y=305 
I=28 W=_N x=295 y=335 
I=29 W=_NG x=295 y=365 
I=30 W=_OW x=295 y=395 
I=31 W=_OY x=295 y=425 
I=32 W=_P x=395 y=35 
I=33 W=_R x=395 y=65 
I=34 W=_S x=395 y=95 
I=35 W=_SH x=395 y=125 
I=36 W=_SIL x=395 y=155 
I=37 W=_SP x=395 y=185 
I=38 W=_T x=395 y=215 
I=39 W=_TH x=395 y=245 
I=40 W=_UH x=395 y=275 
I=41 W=_UW x=395 y=305 
I=42 W=_V x=395 y=335 
I=43 W=_W x=395 y=365 
I=44 W=_Y x=395 y=395 
I=45 W=_Z x=395 y=425 
I=46 W=_ZH x=395 y=455 
I=47 W=!NULL x=255 y=195 
I=48 W=!NULL x=245 y=255 
I=49 W=!NULL x=355 y=205 
I=50 W=!NULL x=345 y=265 
I=51 W=!NULL x=515 y=245 
# Link definitions.
J=0 S=0 E=1 l=-50.0
J=1 S=1 E=4 
J=2 S=1 E=5 
J=3 S=1 E=6 
J=4 S=1 E=7 
J=5 S=1 E=8 
J=6 S=1 E=9 
J=7 S=1 E=10 
J=8 S=1 E=11 
J=9 S=1 E=12 
J=10 S=1 E=13 
J=11 S=1 E=14 
J=12 S=1 E=15 
J=13 S=1 E=16 
J=14 S=1 E=17 
J=15 S=1 E=47 
J=16 S=2 E=3 
J=17 S=3 E=1 
J=18 S=3 E=51
J=19 S=4 E=48 
J=20 S=5 E=48 
J=21 S=6 E=48 
J=22 S=7 E=48 
J=23 S=8 E=48 
J=24 S=9 E=48 
J=25 S=10 E=48 
J=26 S=11 E=48 
J=27 S=12 E=48 
J=28 S=13 E=48 
J=29 S=14 E=48 
J=30 S=15 E=48 
J=31 S=16 E=48 
J=32 S=17 E=48 
J=33 S=18 E=50 
J=34 S=19 E=50 
J=35 S=20 E=50 
J=36 S=21 E=50 
J=37 S=22 E=50 
J=38 S=23 E=50 
J=39 S=24 E=50 
J=40 S=25 E=50 
J=41 S=26 E=50 
J=42 S=27 E=50 
J=43 S=28 E=50 
J=44 S=29 E=50 
J=45 S=30 E=50 
J=46 S=31 E=50 
J=47 S=32 E=2 
J=48 S=33 E=2 
J=49 S=34 E=2 
J=50 S=35 E=2 
J=51 S=36 E=2 
J=52 S=37 E=2 
J=53 S=38 E=2 
J=54 S=39 E=2 
J=55 S=40 E=2 
J=56 S=41 E=2 
J=57 S=42 E=2 
J=58 S=43 E=2 
J=59 S=44 E=2 
J=60 S=45 E=2 
J=61 S=46 E=2 
J=62 S=47 E=18 
J=63 S=47 E=19 
J=64 S=47 E=20 
J=65 S=47 E=21 
J=66 S=47 E=22 
J=67 S=47 E=23 
J=68 S=47 E=24 
J=69 S=47 E=25 
J=70 S=47 E=26 
J=71 S=47 E=27 
J=72 S=47 E=28 
J=73 S=47 E=29 
J=74 S=47 E=30 
J=75 S=47 E=31 
J=76 S=47 E=49 
J=77 S=48 E=50 
J=78 S=49 E=32 
J=79 S=49 E=33 
J=80 S=49 E=34 
J=81 S=49 E=35 
J=82 S=49 E=36 
J=83 S=49 E=37 
J=84 S=49 E=38 
J=85 S=49 E=39 
J=86 S=49 E=40 
J=87 S=49 E=41 
J=88 S=49 E=42 
J=89 S=49 E=43 
J=90 S=49 E=44 
J=91 S=49 E=45 
J=92 S=49 E=46 
J=93 S=50 E=2 

Comments?
If anything can be changed to better address your wishes, we are eager to hear about it. Please share any reactions you may have.
   From:
Message:
        
Copyright © 1996-2005 Sprex, Inc. All rights reserved.
Date: July 25, 2008