sprex logo
Sprex
Banner Image
      
 

 

 

 

 

 

 

Garbage modeling


Adding words


Microphones


Open Mic


Real-time response


Sample rate conversion


Grammars, dictionaries, and "vocabularies"


Other FAQs


Other Sprex Products/Services


Exit>

 

 

 

 

 

 

 


How fast is speech recognition?

This is more than one question.  First, one must distinguish
Computational-Real-Time (CRT) from Impressionistic-Real-Time (IRT).
CRT means the processing time equals 1 times the duration of the
speech input.  IRT is the user's impression of not having to wait,
which depends on intelligent user-interface design, stream-oriented
processing, and CRT.  If you keep the user waiting 20sec before
responding after a 20sec spoken turn, then even though it's doing CRT,
it feels slow, so it's not IRT.
 
An important influence on IRT is whether the system does file or batch
processing on the one hand or streaming processing on the other.  File
processing requires that the speech waveform input be completed before
further processing is started.  Streaming processing starts the
decoding process on blocks of waveform as they become available.  
 
IRT can be improved by various interface-design techniques, such as
playing a canned part of a response (which doesn't depend on the
result of the decoding) after the speaker's turn is over, and during
the decoding process. For example, here's a phrase that can buy two
seconds of extra time for a slow decoder: "Thank you.  You will now be
connected to:"
 
"How fast?" can only be answered precisely relative to CRT, not IRT,
because the interface designer can take a 2*CRT decoder (on given
hardware) and make it seem immediately responsive, and one could take
a .5*CRT decoder (say, on faster hardware) and make it seem slow.
Obviously, interface design is in the application programmer's hands.
 
CRT depends on processor speed, grammar complexity, pruning levels,
clarity and speed of the speaker's speech, etc.  For example,
two labs in 1999 produced benchmarks like the following.
 
Task:Wall-Street-Journal open dictation
Vocabulary size:5000 words
Grammar Perplexity:100

LabCPURAMCRT multiplier
BBN4-CPU Sparc 20256-512M1-3
EntropicPentium Pro 20070M1

Systems can trade memory for increased speed by storing large tables
of artfully precomputed information.  Speaker dependent trained
systems can also run faster.

At the simple-task end of the scale, for example with connected digit recognition, computing resources of about 60MIPS/10-15MFLOPS (on the scale of an Intel 486 CPU, using up about a megabyte of RAM) give you a response time of about 1*CRT (again, if you speak clearly and not too fast). With today's much higher power CPUs, response times of faster than 1*CRT are easily possible (depending on the grammar and word models).

The ANSR, Active Networked Speech Recognition, system from Sprex uses pipelined distributed processing of speech input so that the recognizer starts its work right away, perhaps long before the speaker completes speaking his conversational turn. With CRT<1, use of pipelined speech processing brings the impressionistic response times down to user-friendly levels.


Comments?
If anything can be changed to better address your wishes, we are eager to hear about it. Please share any reactions you may have.
   From:
Message:
        
Copyright © 1996-2005 Sprex, Inc. All rights reserved.
Date: July 25, 2008