sprex logo
Sprex
Banner Image
      
 

 

 

 

 

 

 

News: Downloadable PDA Demos

Introduction

Technical Description

How To

Documents

Product Brief

White Paper

Supported Platforms

Consulting

License Terms

NFL Demo

Downloads

Secure Travel Demo

FAQ's for Users
& Developers

Registration Form and License Agreement

SDK Access

Client Access

Feedback

Other Sprex Products/Services

Exit

 

 

 

 

 

 

 


Developer's Frequently Asked Questions

Table Of Contents


Frequently Asked Questions

Problem: What is the response time on technical support questions?
Solution: Turnaround is fastest by email. Typically one business day. If we see it will take longer we'll try to give you an estimate of how long it will take.

Problem: An error is displayed: HAPINETINITIALISE: NETWORK BUILDING FAILED
Solution: This has been solved in the past by (a) adding the required "short pause" phone, sp, to the end of phone sequence for each word pronunciation specified in your dictionary, and (b) making sure the dictionary has an end-of-line (C: '\n', ASCII: \012) at the end of it. Files shouldn't end with a half of a line, but a whole line, duh, which you mark by adding an end-of-line character.

HAPINETINITIALISE: NETWORK BUILDING FAILED
in the past this has meant such things as:
  • add sp to the prons for each word, or
  • add \n to the end of the dictionary file
HAPILOADNETOBJECT: READ LATTICE FAILURE
in the past this has meant such things as:
  • add "SIL sil" to the dictionary
HAPICODERPREPARE: COULD NOT PREPARE CHANNEL
in the past this has meant such things as:
  • the SOURCEFORMAT isn't recognized; e.g., SOURCEFORMAT=WAV with a file that's not a wav (Microsoft RIFF) file.
Initialisation completes, and null partial results are returned for
a second or so, but there is no content in the result:
in the past this has meant such things as:
  • Add initial and final SIL to the grammar, since the beginning and ending of each utterance is of course silence, so that needs to be in the grammar.

Problem: classes.zip
Solution: This is an old Entropic-era problem. GrapHVite 1.x needs to be patched with a newer version.

Problem: No HAPI license checked out for user [ankimo.bigshed.com].
Solution: See the Entropic FAQ's. The user issue has been quite a problem for many HAPI users. It seems that (typically) newer versions/distributions of Linux are using the glibc C libraries, as opposed to the libc C libraries of the earlier versions. grapHvite was built at a time where virtually all Linux distributions used the libc C libraries. It appears that grapHvite falls over on the glibc versions with this user error.

A multiple-license key doesn't help because you can have checkouts for root and kmarx simultaneously. and not just su'd as root.

Making sure the relevant files are readable by that other UID doesn't help since they can all be 777.

"I presume the solution must come from rebuilding the product for glibc."

Problem: File permissions on installation.
Solution: I've been rather brutal on my own (as-root) installations and ran "chmod -R 666" on the whole distribution, to allow write access into the lattices/ and dicts/ directories to non-root users.

Problem: How can I find the system calls a program has made?
Solution: run a kernel trace (ktrace) on execution of programs to see all the system calls or kernel activities; e.g., you can grep for 'uid' in the voluminous output.

I've been rather brutal on my own installations and ran "chmod -R 666" on the whole distribution, to allow write access into the lattices/ and dicts/ directories to non-root users.

Problem: Anything related to keys, licenses, HAPICheckout, HAPIFree
Solution: No license manager is used any more, so the solution is to upgrade to a license-manager-free version.

Problem: NetBuilder fails on Linux
Solution: Run it on NT or Solaris where Java is stable. You can consider the Netbuilder as a completely separate development tool for use under Windows only. Assuming that you are aiming to develop a runtime application that doesn't ultimately require the Netbuilder, then this is an option. I realise that during development, switching between Windows and UNIX is not ideal when it's the same PC, but the lattices produced from the Netbuilder are usually static - designed and tested and then little changed, other than redevelopment/maintenance.

Problem: Linux Sound
Solution: I also know that the Linux Mic stuff works because I ported a shareware package called 'kvoicecontrol' to work on freeBSD. (It's a simple phrase-matching command tool. You train a set of phrase/command pairs. Link is: http://www.kiecza.de/daniel/kde/index.html, in case you're interested.) On my todo list is to rebuild this as an elf binary and test. That would be even more confirmation that 'linux sound' works for me.

Problem: System complains that HAPI*.a is not a valid archive
Solution: This happened once with a custom HAPI_mvx.linux.a.gz; I gunzip'd it and even the gunzip'd .a still claimed to be a gzip'd file. So, I renamed it to jnk.gz and re-gunzip'd, and mv'd back to HAPI_mvx.linux.a. Probably not the problem, but kinda weird...? (Probably an error of Derek's)

Problem: Can't read wierd wav files. It looks very much like garbage to to everything that I know to use for looking at it on my box, EXCEPT for grapHvite. That is:

 
           - the 'file' command doesn't know what it is:
 
              % file dial01223.wav
              dial01223.wav: data
              (but) %  file /tmp/t.wav
              /tmp/t.wav: Microsoft RIFF, WAVE audio data, 8 bit, mono 11025 Hz
  
           - cat'ing your file to /dev/dsp or /dev/audio just sounds
	       like the blank area on a scratchy (vinyl) record.
 
           - I can't load it via mxv (a sound tool that I still don't
	       know too much about). It complains of a bad magic
	       number (-1448411136).

Solution: The reason why no other tools can read this wav file is because it was recorded using a proprietary tool (HSLab) which puts a header on that HAPI will read seamlessly, but standard tools won't. A Microsoft RIFF file, for example, on the other hand, would have needed to get you to change config variables, so that kept it "simple". If you want to do some testing yourself by recording a similar utterance, then let me know the type of file you will create (RIFF, etc), and I should be able to tell you the config variables to set that will allow this type of file to be read.

Problem: Debugging HAPI audio:
Solution: When you are using live audio, as opposed to file based recognition, set the following in your config file:

	  HAudio:TRACE=777
	  HParm:TRACE=777

Problem: Switching input from mic to line in
Solution: By default the config file is looking for a line input rather than a mic. Check/change the following config variables in basic_us.cfg:

	   HParm:  MICIN         = F
	   HParm:  LINEIN        = T
if you have mic-in, then change this to:
	    
	      MICIN         = T
	      LINEIN        = F
The HParm: prefix is not necessary with MICIN and LINEIN.

Problem: AMD or Intel?
Solution: Intel. In a dialog between KN and DH, KN says:
> I'm running a voice recog timing test over a canned set of 59 .wav
> files, on two boxes:
>
> 1. weevil - AMDK63 400mhz w/ 64Mb ram (FreeBSD 3.2)
> 2. catfish - PIII 450mhz w/ 256Mb ram (FreeBSD 3.3)
>
> The binary is a linux ELF program that I build on weevil and
> copied to catfish. Sound drivers are not involved here.
>
> And, the winner is...
>
> 1. weevil: Total Recognition Time: 19.478 secs
> 2. catfish: Total Recognition Time: 10.528 secs
>
> I did a vmstat -w 1, as well as a swapinfo (in a 1-second loop)
> on successive runs on weevil to ensure that memory wasn't an issue.
> I never saw any swaping, and memory held >= 28K throughout.
>
> Catfish's timing is comperable to what DH saw on his
> PII 400mhz Linux box, getting things in at around 12 seconds.

Well, you can forget about this being attributed to any optimsation we made - I compiled the Linux library without our Pentium optimsation (compile flag oversight, I just checked). So it's something else. I have talked to the developer who made our optimisations, and he repeated what I think I said earlier - we saw a few percent (approx 5) speed increase. He did make some other observations though, which might affect one chip more than others. A lot of the floating point arithmetic was removed (now integer based), so if AMD concentrated efforts on fpu then Intel may be faster for us. Also, our stuff is really not helped by cacheing, as everything is so dynamic and effectively random (even the recognition network, as it's being traversed, is changing all the time due to pruning, or whatever). The developer also talked about other processor techniques, such as branching, where at any branch an assumption can be made which branch to take and a number of instructions can be carried out in advance. Then a check is made later what branch should actually have been taken, with the aim of speeding up loops where 998 times out of a thousand a certain branch will be taken, etc. Again, we don't benefit much from this, due to the random nature of the application. So if efforts were concentrated here by certain chip manufacturers...

Looking at the review you pointed to Ken, I also see that the Pentium has as an additional set of instructions (KNI additions), and speech recognition is sited as one of the areas benefitting from this. Although it looks like these additions were brought in for the PIII, and my PII has been getting much better results than your AMD K6III.

Given that I did not include the Pentium optimsations in the Linux build, it does look like the PIII is much better suited to this task. It would be interesting to see how other applications compare.

Problem: 8bit signed/unsigned/mulaw audio data.
Solution: grapHvite (HAPI) will not work with 8-bit data. If you can get 16-bit data then that will work. Doing an 8bit -> 16bit conversion would work only to some extent, with a much increased number of errors on the converted data. It's not good though, you need to get working on 16bit data.

Problem: Audio data formats, e.g., 22050Hz MS .wav format
Solution:

          > /tmp/t.wav: Microsoft RIFF, WAVE audio data, 8 bit, mono 11025 Hz
grapHvite (HAPI) will not work with 8-bit data. If you can get your data into 16-bit mono 16kHz (or higher, eg 22.050kHz) Microsoft RIFF, WAVE audio data, then this should work after you have set the following in your config file:
	  SOURCEKIND=WAVEFORM
	  SOURCEFORMAT=WAV
	  SOURCERATE=454
Note that SOURCERATE=454 equates to 22.050 kHz (1/freq * 10000000) so you would have to set this to 625 for 16kHz sampling rate.

Problem: Unknown formats for audio data
Solution: If it's 16-bit linear PCM data at 16kHz, we should be able to get it to work. All you need to do is tell grapHvite that this is an ALIEN file, and how large the header is (in bytes) - that way it can ignore the header but go straight to the data. Do this by setting the following in your config file:

	  SOURCEFORMAT=ALIEN
	  HEADERSIZE=n
where n is the size in bytes of the header of the data file.
	  > Uh, and how am I to know this 'n'? 
There's no easy way of finding a header size that I'm aware of - it's usually something that's documented somewhere. I have this documentation for TIMIT, NIST, SDES1, SUNAU8, OGI, and RIFF; for Microsoft's RIFF data files, the header size should be 44. Usually, it's not too important to get the header size exact, but it is important to get odd/even sequencing of the data correct. For example, if your header is truly 44 bytes, but you say it's 40, this is OK as it will treat the first 2 samples as noise, but the rest of the data is good; however, if you say the header size is 41, then your 16-bit samples are all mis-aligned, and will all appear as noise. Therefore you should have at most 2 'guesses' to get this right.
	  > Raw (no header) // I'm guessing I can use n=0 here?
You can use HEADERSIZE=0, or alternatively (and better) this is replaced by SOURCEFORMAT=NOHEAD, and HEADERSIZE is no longer required.

Problem: Microphone selection:
Solution: I've used 'Andrea NC-80' microphone.

We have used Andrea microphones before (I believe the NC-80 too). It's plenty good enough for the job, if a little flimsy.

They seem all right but not the best. We've had the Andrea NC-80 before I think. They're OK, but not too great. We've now standardised on Knowles mic's, which are not expensive but give better performance.

Problem: Sound card diagnostics
Solution: dmesg and 'cat /dev/sndstat' can be used to find out a lot about your sound card.

Problem: Sound card selection
Solution: Tom Veatch has used an AWE64-soundblaster, I thought I'd try to forward on a list of cards just in case. DH recommends a well known (i.e. well established) standard, well supported, 16-bit card - and we would suggest a Sound Blaster 16 card. The AWE64 should be fine (may even be better), but we would prefer something that's been around even longer, so would go for the SB16. The driver support for the SB16 is good. TV says: I always tells people to get the Creative Labs Sound Blaster AWE-64, which recommendation I myself received from MIT, where they buy PCs all the time, get AWE-64's in boxes of ten, rip out the sound cards that come in the PCs they buy and install those new. When I have done this under Linux I've also paid the \$20 to
www.4front-tech.com for their easily installed drivers: download and unpack them, and say "soundon" to turn them on. It works for me anyway. Check the OSS website for OS support. But even so you always have to tweak the volume controls a lot. And you don't need those drivers for NT.

Problem: Sound quality
Solution: See Sprex' FAQ on this at
http://sprex.com/faq/microphones.html

For some examples: where silence and speech are measured in dB as:

sil speechdifference
39.8dB 66.8dB27.0dB
40.4dB 72.6dB32.2dB

These would be reasonable values, since the difference between speech and silence is around 30dB. But the silence levels are kind of high, so you might try getting the silence down to around 20dB using the mixer volume controls.

Problem: tuning things up past the basic.c tutorial program
Solution: basic.c is an example application not intended for performance testing. For example, the model set supplied with the tutorial is not optimised, and you wouldn't use it in a real application (but it's fast to load, so better for 'playing around' with the tutorial). Instead of using the basic_us.cfg config file, use one supplied for the NetBuilder (eg $GVHOME/sys/sys-US/hapi_US.cfg). You'll need to add the following entries:

	   HAPI:   DICTFILE      = ./basic_us.dct
	   HAPI:   NETFILE       = ./basic.lat
	   HAPI:   HMMLIST       = $GVHOME/models/models-US/ECRL_US_WTRI.list
	   HAPI:   MMF           = $GVHOME/models/models-US/ECRL_US_WTRI.mmf
	   HParm:  DEFCEPMEANVEC = $GVHOME/sys/sys-US/cepMean.US
And over time, change your DICTFILE and NETFILE pointers.

Note that this uses word-internal triphones. You may also like to experiment with the monophone models, in which case change the 2 occurences of ECRL_US_WTRI to ECRL_US_MONO

Problem: Sound drivers
Solution: On the last couple of Creative Labs Sound Blaster AWE-64 cards I've done installations for, it required a download from www.4front-tech.com of a $20 OSS audio driver set, along with running "soundon" to load all the kernel modules, plus running some mixer program to tweak the levels for quite a while.

This is standard, we have recommended lots of Linux users with audio driver problems to contact 4-front, and obtain a commercial 3.7.1 OSS sound driver for a standard card (such as the Sound Blaster)

"The 4front folks say Apparently there are (lots) of known problems with running stable OSS drivers on FreeBSD. They don't even sell them for 2.2.x anymore. They were very nice and sent me a binary and kernel patch, however. Goodness gracious..."

Problem: Slow playback. Fast playback. The chipmunk effect (or the gorilla voice)
Solution: This comes from audio being played back at an assumed sample rate different from the one at which it was recorded. If you recorded at 16kHz and you play back at 8kHz, then it'll take twice as long to play it back, and it'll sound slow and big, like a gorilla. If you recorded at 8kHz and you play back at 16kHz, then it'll take half the time to play it back, and it'll sound super-fast and squeaky, like a chipmunk on caffeine.

Problem: Config parameters: general
Solution: I would suggest *never* having the same variable referenced twice, as you won't know for sure what's going on. In short, the behavior is undefined or unpredictable if the settings are referenced multiple times.

Problem: How long is the expected wait after the prompt in basic.c? After the "She sells sea shells..." prompt, I usually see about 50 seconds. Is that a fixed sample time, or a timeout?
Solution: There should be no wait - it's immediate. It's waiting for silence to do the calibration - but it's probably only ever getting noise, and eventually timing out. This is due to some configuration of your device drivers, plugin jacks, preamp and mixer volume controls and MICIN/LINEIN settings making it so that it has nothing but noisy input data.

Anyway that symptom makes it evident that live audio can't be set up right (or the emulation doesn't allow it), or the card/driver doesn't support 16bit data.

One option is to try another mic - just in case it has a bad connection, or similar? But if you can consistently get good audio using audiocat, then that idea is out of the window.

Problem: Portability of Entropic code
Solution: I must say that my respect-o-meter for ECRL's code portability has been pegged high ever since I saw HTK fully ported and recompiled in a single morning by a brand-new user (Andras Kornai at IBM Almaden, a couple years ago) for a new Unix OS (AIX, which we didn't support at the time). A morning (of) glory!

Problem: Is there a voice-driven dictionary wizard? It would be nice if there were a way to make dictionary entries via voice. Say, repeat a word N times to get an idea of it's phoneme spelling, and then hand enter the word, output symbol, and 'spelling' of choice.
Solution: Presently it's a do-it-yourself project. See sprex.com/phonolyze for a partial solution. Buy a copy if you want to contribute to making it complete.

Problem: Calibration and volume settings.
Solution: Make sure that a) speech levels are 30dB above silence levels, b) silence is in the neighborhood of 20dB c) you don't get values of N/A or dynamic-range-used percentages below 5% or above 95% which are pathological or below 15% or above 85% which are marginal and questionable. Reduce background noise, increase or decrease mic volumes, bring up the signal-to-noise ratio by moving the mic closer to the mouth or moving the environmental noise sources farther from the mic or overwhelm the electrical noise sources in the PC by setting PC-internal (mixer) levels to a low level while using a preamp to crank up the signal levels to a high level.

Problem: How to generate a detailed log file to get results from a recog session, with confidence values etc.
Solution: This is also a do-it-yourself project, although you can set TRACE values to numbers up to 777 to get lots of printed output to look at.

Problem: How to compare correct transcriptions with recognition output
Solution: HTK's HResults program does this. On the other hand you can printf the result strings and diff them against the prompts themselves.

Problem: Accuracy is bad
Solution: Do you have 30dB SNR with 80% of the dynamic range used? Are you using triphones? Is your task reasonable, with multiple phones to distinguish each distinct pair of paths through the grammar network?

Problem: Is J-HAPI available through Sprex.
Solution: Sprex prefers to not have to support another API, and few HAPI users have used J-HAPI for actual applications, so: No. Of course, if you're Sun Microsystems or something, we can probably work something out.

Problem: Garbage models go where in the network?
Solution: Use only one, put it at the same level as the 'commands', a single simple node between the start and terminate null-nodes of your lattice, or a top-level OR-disjunct in the grammar.

Problem: Dictionary "symbols"? Or, what is the difference between: < s > [null] sil and < s > [] sil ??
Solution: Between the brackets is the "symbol" for the word, which is different from the "name" of the word (usually its spelling). This trick gives you a way to hide the non-contentful words by using [] as the symbol. If [] or [something] is not specified, then the symbol is taken to be identical to the word's spelling (in the first column of the dictionary entry).

Problem: Windows Paths
Solution: One problem is that grapHvite sometimes struggles with Windows directory names; e.g., change NetworkDir so that it uses forward slashes instead of backslashes. Duh:
NetworkDir = "C:/Program Files/Entropic/grapHvite-1.1/lattices/lattices-US"

Problem: How to fix a specific word confusion, e.g., FOUR with OH
Solution: Check the pronunciations in the dictionary for these words. For example, FOUR may be transcribed in your dictionary with "open o" or AO as in "caught" instead of "long o", or OW, as in "cope". In most US dialects (outside the South), the OW vowel is more appropriate for the way people actually pronounce it, therefore the OH word matches better than the badly-phonemicized FOUR. So if your people aren't saying what the dictionary says they should, change your dictionary!

(This is just one of a whole class of pre-R cases though, where in the SouthEast they have /ao r/ but elsewhere /ow r/. In general where /ao/ merges with /aa/ for most speakers under 60 in the West, and in a couple of Eastern cities, the allophone of /ao/ before /r/ splits off from the others and merges with /ow/ instead of /aa/. So you can grep for "ao r" in the dictionary and get a long list of candidates to add an alternate form for.)

Another way to make things more robust in general is to look at confidence scores (a value between 0.0 and 1.0, where a number above 0.6 you are usually confident on). Apart from 'outliers' you would generally expect mis-recognised words to have a noticeably lower confidence score. There are also some discussions in the HAPI manual that you might find useful (Application Issues: System Design).

Problem: Speed is slow
Solution: Adjust the beamwidths (BEAMWIDTH) used in pruning. Do some experiments on a valid statistical test sample and see for a variety of beamwidth values what the accuracies and CPU-time durations are. This will generally show you a good cutoff point where the accuracy remains quite good at a fast speed. Shorten your silence timeout threshold. If it is long then the perceived delay may be longer than necessary. In one case, the main problem with speed (and some accuracy discrepancies we were seeing) was seemingly due to an old bug in which null '#' comment lines caused certain config specifications to be ignored.

Problem: Accuracy lossage due to background noises, or misrecognition
Solution: This is normally due to using non-optimum model sets, or feeding bad or inappropriate data into the recognizer Start with $GV/sys/sys-US/hapi_US.cfg and modify that instead of the basic tutorial config file.

Add FORCECXTEXP=TRUE (actually makes use of the loaded triphones) and USESILDET=TRUE (turns on the silence detector to not bother with recognition when it's just silence).

Check the confidence levels which may be 20-50% lower on the erroneously-recognized words.

Problem: Once things are basically working, how do we *improve* accuracy?
Solution: Basically you need to take a representative sample of test data, and try to find out why mis-recognition took place. It's time consuming but you reap the rewards later. When doing this, it might become obvious, for example, that certain words are regularly being mis-recognised, and that's a good time to check what the pronunciation should be (according to the dictionary). Where it's not a good match, simply add a new pronunciation of that word (without deleting the original).

If a word is misrecognized only at a certain location in the grammar, (e.g., FOUR at the end of the utterance but not elsewhere), then you can adjust the weights on lattice links to make that selection more or less likely. But try that as a last resort; the system should be globally optimized before doing something this fussy.

You can do adaptation, too. We are confident we can get the error rate to a very low level with adaptation (the results with large vocabulary are quite astonishing), but perhaps not as low as 2%. I really believe we won't be far off, but I have no proof yet. Only the application itself can prove this. Indeed regardless of error rates being improvable for speaker-independent recognition, getting errors down to 2-3% is hard without adaptation. So the likely goal is using speaker-independent stuff with per-speaker adaptation transforms loaded in on top.

You can do word-based models too:

Speaker-dependent word-based model training (after brief data-collection "enrollment") can be implemented with HTK and the resulting word-based models integrated into the grapHvite application, I believe. This is mildly complicated to do, but there's a well-understood step-by-step recipe for it.

12% error rate vs 2% for Verbex, 8% for L&H is pretty poor, and disappointing. Entropic's MFCC US models are not great for digit recognition. EPC models should be better than MFCCs still not that great; some UK models have whole-word digits, and hopefully for the US models also.

Problem: Extra words are inserted
Solution: Increase the word insertion penalty WORDPEN from, say -20 to, say, -25.

Problem: Can I mix model sets?
Solution: No, you can only use one model set at a time. You can't mix them during recognition, sorry.
> Ah, I see. I thought you could add model names to an
> hmmlist file somehow to combine model sets. Isn't that
> possible with HTK, so that you can mix in word models with
> phone models? That was my memory, but that could easily be
> wrong.

Yes that's possible with HTK, but not with the models we ship with graphvite. They are encrypted, and HTK does not know how to decrypt them, so neither the models themselves or the model list can be modified. Encryption aside, there are a number of optimisations in the supplied models that also HTK does not know about, so HTK could not add/modify the models even if they weren't encrypted.

Problem: What is EPC?
Solution: EPC stands for Entropic Proprietary Coding, so the algorithm is not being disclosed. We believe it catches the information better than MFCCs, so should be a bit more accurate.

Problem: Speaker dependent vs. speaker adaptive
Solution: When models are trained with one speaker in mind, the resultant system is known as speaker dependent. If the models are trained for robust recognition of any speaker, this is a speaker independent system. However, speaker independent models can be easily adapted to a particular speaker given a small amount of training data (for that particular person, of course). In HAPI, you have several ways of obtaining a speaker adaptated system - and one way is to start with speaker independent models, load a transform for the currently logged in user, and then when another user logs in unload the current transform, and reload the transform relevant to the new speaker. Loading/unloading of transforms is really quick, and the underlying model set is still speaker independent.

Problem: Adaptation is part of HTK; is it also part of HAPI?
Solution: Yes, HAPI 2.x includes transform generation/loading/unloading

Problem: How can I see all the pronunciations for a given word?
Solution: You can enter the word into NetBuilder, part of grapHvite, and all its pronunciations will be displayed. If you don't own a grapHvite license, but only a HAPI license, then you can write a program to access and print the dictionary entries for any given word. (Sprex has written such a program.)

Problem: What's the tradeoff between improving accuracy using speaker-adaptation and by improving the pronunciation dictionary?
Solution: Adaptation will shift the acoustics for each phoneme from the global, speaker-independent distributions to the particular distribution of the individual speaker being adapted to. This significantly improves accuracy at a general level. However, adaptation doesn't effect the phoneme patterns in the dictionary for individual words. If you have an error or dialect miscorrespondence in the dictionary, then even after adaptation it can remain the source of repeated systematic errors when that word is (mis-)recognized. On the other hand, if you fix the dictionary without doing adaptation, you will have entirely removed that source of error. The global benefits of adaptation are basically an independent issue.

If there is an interaction between dictionary entry adjustment and adaption, it is that adaptation on data using bad pronunciations will adapt phones that should be quite different in the direction that makes them overlap each other, so that they are harder to recognize as being different from one another. This means that dictionary errors will make adapted model sets worse.

The general lesson is that clean data is gold.

Here's another take on the issue. There are two classes of sound differences that are key here. Acoustic variation in the same sound category is what the adaptation will handle. So if a speaker says "I" in a southern way, adaptation will pick that up, because all their /ay/ phonemes are pronounced together in that southern way: this is a phonetic, low-level, acoustic difference, and the categories don't change, only their acoustic form changes. Adaptation will capture this kind of difference very nicely; it is good for handling foreign accents, dialect variation, children, people with distorted voices, and so on: global differences in acoustic patterning.

Getting the sound categories right, though, adaptation will not do that for you. So if a speaker says "roof" with the vowel in "put", rather than with the vowel in "food", that's a category difference, not an acoustic difference. The phonemes in the word are different, not the acoustics of the phonemes. If you adapted the vowel in "roof" under the assumption that it should be the same category as the vowel in "put" while the speaker actually says it with the vowel in "food", then you're going to corrupt the system by telling it the first category sometimes sounds like the second. The distributions will incorrectly be trained to overlap each other, and confusion will result. Basically the system needs to keep its categories straight, and that's a dictionary issue rather than an adaptation issue.

I hope that helps clarify the difference between these two very different issues. It's actually quite confusing; I should know, I did a whole dissertation working on this particular issue.

Problem: Is dictionary tweaking an infinite and insoluble task?
Solution: Here's a dialog about this:
> > You're right - I'm confused. In the above "roof"-like-"put" example
> > (and perhaps "root" is an even better example), what would we be
> > doing to handle this? Add a new phoneme spelling to the dictionary?
>
> Yes, assuming the dictionary doesn't include the alternate forms already.
>
> > But then there can be a small raft of such changes: roof, root, doof,
> > aloof, (more?); But not all 'oo's would qualify, such as "shoot", "coot"...
>
> Right. There are certainly an indeterminate number of these --
> there's a whole part of linguistics, dialect geography, which is
> largely concerned with this kind of thing. However you only have to
> worry about the cases that are in your application; the number is
> likely to be few. Give me your word list and I'll give you the likely
> candidates. Probably there won't be any, other than this example of
> FOUR. "eh-conomics" vs "ee-conomics" is a rare kind of alternation.
>
> > It's mildly worrisome to ponder a non-terminating set of adjustments
> > to dictionaries, especially when I was mis-understanding that adaptation
> > would be all that's required.
>
> Adaptation will certainly make major improvements anyway. This issue
> is mostly a red herring, of interest mostly to linguists only.
> Also it doesn't distinguish among technologies; every phone-based
> system in the world has the same problem.
>
> > That is, we'd like to be able to deterministically get all recognition
> > down at the < 2-3% error rate for *any* speaker. The above suggests
> > that this may not always work without developer intervention to attend
> > to dictionary issues for some problematic speaker(s).
>
> It should stabilize after a small number of speakers, since they are
> all speakers of one of a small number of highly consistent dialects.
> Once you get it right for one speaker in that dialect, you get them
> all, because they all do the same thing, as a rule. Again, the number
> of cases is not likely to be that great. Also it's not exactly a lot
> of work to add a dictionary entry.
>
> > Again the production site is a random walk through pronunciation
> > space. They have East-Asian, East-Indian, US Southern, Hatian, etc.,
> > etc. speakers.
>
> This is not as bad as you might think since the immigrants are all
> trying to speak US English, so while their pronunciation of /uw/ might
> be funny, they will be trying to pronounce all the /uw/ words with the
> same sound. What people think of as "having an accent" is mostly a
> matter of this kind of thing, and it is unnecessary to fuss with the
> dictionary for those cases.
>
> Secondly you take get another bite out of the dialect variation by
> using the UK English system, too, which is likely to be the reference
> dialect for East-Indian and Hong-Kong speakers.
>
> Finally, this is an issue that most ASR companies just don't talk
> about; the system is what it is, the dictionaries are as simply good
> as they happen to be, and the error rates are whatever fall out, and
> the only thing they care abou tis to try to have enough coverage that
> these kinds of things are down in the statistical noise. Just because
> we have thought of an interesting example here doesn't mean it's going
> to have a noticeable effect on accuracies. Some people say "hoop" to
> rhyme with "foot", I've heard. But it would seem to be only a matter
> of a few people, and it's a pretty rare word, unless you happen to
> have that dialect and your application needs that word. So it's
> mostly down in the noise.
>
> > Is my knee being too much of a jerk?
>
> Not really; this is a general problem. Do you know about Zipf's Law?
> In general the frequency of highly improbable events is very high. If
> you had a language generated by random selection with replacement from
> the (L-sized) lexicon, then the probability of any N gram occurring is
> 1/L^N (e.g., with 60k words and bigrams, that's 1/(36*10^8) which is
> less than 0.00000001), a probability that is so small that you'll
> never see most N-grams in any practically attainable data set. While
> the ones you do see have very low probability, still that's what you
> mostly see all the time. Enough of this translates from
> uniform-probability generation of N-grams to actual, natural English
> that the basic insight remains true on real data, and what it means is
> that there can necessarily never be enough data, and that even your
> dictionaries are always going to be inadequate because even if you
> collect 600 million words of text and sort out all the distinct words,
> you'll keep finding new ones at a non-negligible rate even in the last
> little fraction of your data set, and it keeps on going. This is the
> nature of language, sorry about that.
>
> But really it's more a matter of philosophy than practicality, since,
> fortunately, we bypass all this by defining a limited task around
> which the grammar that people use in speaking on that task can be
> fully specified after looking at a little data (or making it up),
> where all the words can be enumerated and the word sequences generated
> by a simple word-lattice grammar. We do have to look out for this,
> but a reasonably short amount of study and dictionary tweaking will
> make things pretty tight. Any robust recognition system has been
> through this kind of iterative refinement, that's just the nature of
> the problem. So anyway, as Panini or some other ancient wise person
> said, Life is short, therefore let the grammarians argue on in peace.
> ^person^sandwich^ ??

Panini was an ancient Sanskrit linguist. It's an interesting story: Until the 1980's (at least) his grammar of Sanskrit was the most complete and accurate grammar of any language ever. He wrote it around 800 B.C. Actually he didn't write it down; he dreamed it up in rhyming verse and memorized it, and everyone that knew it only did because they memorized it, for a thousand years or something, the Brahmins would teach their sons as they plowed the fields in India, for generation after generation. Finally noone spoke Sanskrit anymore, but only Hindi and Bengali and Marathi and stuff, and it was a good thing they still had the grammar so they knew the rules to the language of their scriptures, which along with the grammar eventually got written down. The whole thing was a religious practice (that's what happens when linguists get to be in charge -- doing linguistics becomes a religion). Anyway...

Problem: Multi-threading on Linux
Solution: HAPI can be compiled in thread-safe mode on Linux. It requires that you have a pthread library (available via GNU); alternatively, we may be able to build a version that doesn't require this library - but won't be thread safe. The following compile/link flags are needed:

	  setenv HTKCF '-ansi -O2 -DOSS_AUDIO'
	  setenv HTKLF '-L/usr/X11/lib -lpthread'

Problem: InitPronHolders has a warning about sp, sil, or missing or undefined words.
Solution: The dictionary has to have definitions for the words in the grammar. The HMMset has to have definitions for all the phones in the dictionary.

For example, I was testing with a dictionary with an imaginary phoneme named "zz", and somehow in hapiNetInitialise(), further within ExpandHRnNet(), it did a core dump in the InitPronHolders() function. When I converted the phoneme to be named "z", that particular error went away.

So one thing to check is your phone inventory. First, the line at the top of your dictionary labelled !PHONE-SET should have all the phones in it, not just some. Second, there should be no pronunciation in the dictionary using any phones not in the !PHONE-SET. Third, the phone list should match the ones in your MMF and HMM list files (which you can't determine directly, but you can go through a bunch of words with NetBuilder's dictionary and see if you have anything wierd). My phone set (which is probably identical to yours) is:

!PHONE-SET aa ae ah ao aw ax ay b ch d dh eh en er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil sp t th uh uw v w y z zh

So this may not solve your problem, but it gives you something to try.

One user found via -T 1 in HAPAdapt that the sp/sil words were indeed needed.

Another time an initPronHolders error message was solved by trying a setting for the following config variable:

ADAPTADDSIL = "sp sil"

which is meant to replace ADDSILPHONES from previous HAPI releases.

Consider the following issues for InitPronHolders errors:

For sp/sil errors:

Problem: Can I change the log file for trace output whilst HAPI is running?
Solution: Only via the LOGFILE config parameter. Calling hapiOverrideConfStr with LOGFILE won't have any useful effect as currently a file handle is opened to LOGFILE during initialisation (hapiInitHAPI) - and that's it. This handle doesn't get updated when the filename is changed.

Problem: HAPI gets stuck in the silence detector
Solution: set the config parameter SILCNTTIMEOUT.

Problem: During silence measurement attempts I get results like:

    Mean speech level in dBs -1.00 (2.1%), silence 35.40
    Mean speech level in dBs 32.68 (0.4%), silence 17.36
    Mean speech level in dBs -1.00 (0.5%), silence 33.87
    Mean speech level in dBs -1.00 (0.7%), silence 36.15

Solution: These results indicate the levels are too low, or even that the mic is plugged in to the wrong jack. The printouts ought to be modified with some intelligence to tell the user so.

Problem: Problem: during recognition I get results like:

WARNING [-6006]  ReadAudio: Failed to read all 0 samples from OSS audio in HAPI
WARNING [-6006]  ReadAudio: Failed to read all 0 samples from OSS audio in HAPI
<...repeats about 2-20 billion times...>

Solution: This is a problem in the audio channel, drivers, mic, jacks plugged in, volumes set, OSS installed & working, HAPI read functions not detecting some wierd condition, god knows.

6006 means no audio is present. -1.00 means no audio is present. 1% means whatever data it's getting isn't more than 1% of the whole range of a 16-bit sample.

KN: you say that the -1% indicates no audio. But I did see several %'s > 0. Doesn't that mean that something is getting through sometimes? TV: small numbers can indicate random voltages on the non-connected (mic in or line in) channel

This suggests, you're not getting any audio in from /dev/audio. I don't think it's hooked up, perhaps in the Linux emulation mode, perhaps underneath it somewhere. Is audio tested working? Is audio tested working within the Linux mode? Try lsmod to see if the appropriate Linux kernel modules are present. Try uploading Sprex' program audiocat from ftp://ftp.sprex.com/pub/else/audiocat and see if that runs and records any audio files, and then plays them back (audiocat -i test; ; audiocat test). If it doesn't even execute then I can give you access to source to compile there.

I am suspecting that Linux emulation doesn't extend to the audio hardware. Do you know that it does?

kmix is new to me but probably from KDE which ought to be a good thing.

I'm not sure how you're going to get /dev/audio to work, because it's generally a kernel module thing, done normally via OSS kernel modules. On the last couple of Creative Labs Sound Blaster AWE-64 cards I've done installations for, it required a download from www.4front-tech.com of a $20 OSS audio driver set, along with running "soundon" to load all the kernel modules, plus running some mixer program to tweak the levels for quite a while.

Problem: anything audio related:
Solution: Use audiocat. Ask tv@sprex.com for a distribution; he might ask you to buy it.

Problem: /dev/audio on freebsd
Solution: Check if audiocat works perfectly.

If audiocat works for record and playback then you should be clean in my opinion; that would confirm that the drivers are in and the hardware is up and the mic is on and so forth.

I already knew /dev/audio works for non-elf (i.e., non-Linux style) binaries since I can play sound samples.

There is no lsmod in freeBSD (that I know of). Instead you do modload/modstat operations. That works on dynamicly loaded (lkm) kernel modules. I have done this previously, but now simply set an option in the kernel config file to pull these things in staticly in the kernel build. There's probably a way to inerrogate a running kernel for config details (other than dmesg) but I don't know it. Suffice to say that none of the graphVite binaries or audiocat would run without this enabled, since freeBSD 2.x doesn't use elf format (3.x does, but 2.x is the target platform for the system the voice app runs in.)

Okay. The sound drivers in my experience of Linux are kernel modules rather than compiled in to the kernel although there are a bunch of soundcard options proposed to the naive user in the kernel compilation process, which are to be ignored in favor of the OSS kernel modules. In freeBSD I have no idea how to approach it, but if audiocat works then I personally think you ought to be getting audio into grapHvite.

Problem: Are the ~60 second wait times I saw normal?
Solution: No, it's not normal. That's the timeout when it's waiting for speech and hasn't seen any. It means speech isn't getting into the system.

Problem: Linux sound drivers?
Solution: OSS audio drivers work well on Linux, particularly those supported by 4front-tech.com. If a given Linux version has a different audio layer, then we may be stuck. We would expect an audio error to occur when we try to grab read/write access to the driver, or when we first grab data from the source - if this isn't happening, then there may be a different issue.

Problem: What if we get the message ECONNREFUSED on the client side (Linux)?
Solution: Check the following items. For further information, e-mail us at
info@sprex.com.

  • First, check that you're not using domain names rather than numeric IP addresses if you have DNS or DHCP difficulties.
  • Next, run "/sbin/ifconfig -a" and check the output.
  • Next, run "netstat -nr" to show what your routing environment looks like.
  • To confirm that sprecd is waiting on a port of the correct numeric IP address, run "netstat -l" and "netstat -nl" to list listening sockets; one of those (in the output of the first) should be sprecd with its ip address; the corresponding one in the output of the second should be the relevant active numeric IP address and port number.

Problem: What if there are noise or errors in the results.
Solution: ANSR incorporates several ways of handling uncertainty in the recognition process.

Problem: What if I need a speech recognition system for a language not presently supported by Sprex' currently available models?
Solution: ANSR supports arbitrary languages through HTK model compatibility. This means that HMM acoustic models built using the HTK toolkit from Cambridge University (with few exceptions) can be plugged into ANSR, which will then be able to recognize speech in the language modelled by those HMMs. ANSR customers can develop such models themselves or can arrange with Sprex to develop them by contract. Given customer-supplied transcribed audio data, Sprex can develop acoustic models for the requested language and supply them as part of ANSR.

For telephony/IVR systems, the models should be customized to telephone audio characteristics. Most of our models are for 16kHz sample rate audio ("clean" audio, twice the bandwidth of telephone audio), although we do have models for US English telephony.

Problem: Installing on RedHat 6.x the following error occurs:

   # rpm -i ansr-1.0-3.i586.rpm 
   error: failed dependencies:
	rpmlib(PayloadFilesHavePrefix) <= 4.0-1 is needed by ansr-1.0-3
	rpmlib(CompressedFileNames) <= 3.0.4-1 is needed by ansr-1.0-3

Solution: upgrade rpm, which contains rpmlib, to a more current version. Go to www.rpmfind.net and search for "rpm". Better yet, go to
http://rhn.redhat.com/errata/RHEA-2000-051.html where RedHat talks about how to fix this.

Problem: speech detector is slow.
Solution: Read HAPI Book Chapter 11 section 3.
Set and experiment with the following config parameters:

USESILDET = T
SILGLCHCOUNT = 7
SILSEQCOUNT = 30
SILCNTTIMEOUT = 50

Problem: How can I setup a direct test not using a web-browser?
Solution: IP address/TCP port information for our ANSR NFL demo are:
server IP address: 209.189.196.38, TCP port: 4102
Open the client's socket connection to the server IP address/port using TCP and then send audio data on the socket (16Ksamples/sec, 16 bits each) until the audio sample to be recognised has been sent.
Close the audio socket output, using "shutdown(socket,1);", which closes the socket on the client only in the sending direction, and expect/accept the recognition answer from the server. The answer will be a string.

Problem: I have to install a pile of RPMs
Solution: Sure. Assuming all 5 rpm packages are in /var/tmp, do

          cd /var/tmp
          mkdir xxx
          cd xxx
          for i in ../{rpm,popt}-*.rpm ; do
             rpm2cpio $i | cpio -dim
          done
          ./bin/rpm -Uvh ../{rpm,popt}-*.rpm

  Next, do
          rpm --rebuilddb         # <- if this segfaults, do a bugzilla report
  and finally
          rpm --rebuilddb         # just to make sure everything is AOK

  > 
  > All righty.  I didn't have rpm2cpio installed on the problematic
  > machine, so I did the conversion on a working machine, and copied it
  > all back.
  > 
  >         ./bin/rpm -Uvh ../{rpm,popt}-*.rpm
  > 
  > This last step segfaults. :-(
  > 

  Not quite yet. Use rpm2cpio to install directly onto the file system.
  Check the directory perms below /var/tmp/xxx, cpio tends to unpack with
  700
          cd /var/tmp/xxx
          find . -type d -exec chmod 755 {} \;

  and then install the whole tree (check first if you're paranoid)
          cd /var/tmp/xxx
          tar cf - ./usr ./bin ./etc | (cd /; tar xvf -)

  Now try
          rpm --rebuilddb
  to fix your database (if this segfaults, it's bugzilla time).

  Finally upgrade the rpm packages the usual way
          cd /var/tmp
          rpm -Uvh {rpm,popt}-*

Problem: What are some applications ANSR can support?
Solution:
Look at this sample list of applications.
We like comments. Send us a quick note here or with Co-Work.
   From:
Message:
        
Copyright © 1996-2005 Sprex, Inc. All rights reserved. Sprex, Speech in the Network, TallyGram and ANSR are trademarks of Sprex, Inc.
All other trademarks belong to their respective owners.
Date: May 17, 2008