| | LSM | Quick View | All About | Free Demo! | Sign Up | Manual | Do It | Exit | |
![]() |
| What is the LSM? | An internet-based, automatic (or human-aided, as much as you need) track-reading service that makes timestamps for audio recordings of speech, from an as-spoken transcript. The LSM uses the transcription to produce timestamps for all the words and for their component sounds, or "phonemes". |
| Who Uses the LSM? | Animators, Producers of Random-Access Media, Scientists, Video Game Makers. |
| Why Use the LSM? | Animators. To make an
animated movie, first you record the sound tracks, then you
"track read" the recording (i.e., timestamp the words and sounds),
and then specify the lip shape of each character shown
in each frame in the whole movie.
Then the artists can draw the pictures with the lips in the right
positions. The LSM does track reading accurately, automatically, and
inexpensively, replacing or enhancing the current tedious,
error-prone manual services. Random access media means TEXT-SEARCHABLE AUDIO and audio/visual media including radio, TV, speeches, talk-shows, anything. The LSM finds the spoken text where it occurs in the media stream, so you can automatically fast-forward or randomly access particular media segments according to the text spoken in them. Scientists studying language and sound want to locate acoustic events corresponding to words or speech sounds. Multimedia and video game producers want to segment audio recordings to create controlled playback of audio. |
| How to Use the LSM? | Sign up for a free demo, and try it out.
Upload your own sample audio
files and text transcriptions; download the results. Use
it like a regular user. If it works for you, then get
a regular user account and use it whenever you need
to. If you also would like to have Sprex manually proof and correct the output, cut up audio files into appropriate-sized lengths for the LSM to handle, feed in the data for you, send you the results, or do modifications to the software to produce your preferred format for the information, we are at your service: Just send us an email and let us know how we can help you. |
| Benefits: | Accuracy:
On clean recordings
of adult American English speech, the timestamps are accurate to
better than 20 milliseconds (half of the duration of one film frame)
most of the time (70%, in our tests);
while 97% of the timestamps are within 64 ms (1.5 film frames).
This is as accurate as the most careful track
reading (the quality required for feature film animation).
Automatic: It saves lots of work and time! Available: It's up and on the net 24x7 (barring network downtime), so you can use it whenever you like! Fast: It operates at a few times real time, plus network delays. Whereas humans work all night long. Consistent: The LSM is never tired, bored, or distracted; it's accuracy is the same whether the task is short or long. Inexpensive: The LSM is free to try out, and it costs less for regular usage than professional human track readers. Fees include a quarterly minimum equivalent to 10 minutes of usage, plus additional pro-rated fees depending on how much more you need to process.
|
| Where is the LSM? | It's located on the web at http://cassandra.sprex.com, or you can contact Sprex via Co-Work or at info@sprex.com or 206-367-7741, for more information. Look at it, use it, tell us what you think! |
Please learn about the LSM and use the free demo or ask us to help you.
We hope you will make the Lip Synch Machine a regular part of your work.
Good lip synching is the key to quality not just in animation,
but also in other audio/visual media, for example in music videos
where poor lipsynching can ruin the entire performance. But why is it
that bad lip synching is so noticeable, distracting, and unrealistic,
when the whole point of animation is to be unrealistic?
There are three reasons for it.
First, vision in humans is genetically engineered to pay
attention to the face, and to pay attention to movement. Babies focus
intently on their mother's faces, for example. Humans as a species
spend a remarkably large fraction of their mental capacities watching
and interpreting each others' faces. In addition, perceptual systems are
especially sensitive to movement, and humans are not much different
from other hunting animals, for which visible movement is instantly
the total focus of concentration.
Since we're genetically engineered to look at faces, and since we're
genetically engineered to focus on movement, then is it a surprise
that we pay so much attention to that part of the face that
does the most movement -- the speaking mouth?! It follows that
humans are genetically engineered to pay special attention to lip
synching.
Lip reading is little short
of miraculous. To infer sound from vision is an amazing feat. Yet
many deaf and some hearing people can do it pretty well. This could
only be true if humans were highly sensitive to lip movements.
Second, sound is the main source of communicated thought and
feeling. Human thought is expressed and communicated from person to
person by another genetically engineered system that uses the medium
of sound, namely language itself. And human emotions are most
powerfully influenced by sound, whether it's emotion carried in music
or emotion carried in a human voice.
Third, to watch is to try to believe. Every audio/visual medium
- whether TV, movies, multimedia, or animation - implicitly asks the
viewer to suspend disbelief and enter into the world represented in
the medium. Even an animated world seems real in the mind of a
captivated viewer, otherwise it would be only noise and colors on a
screen. The animated world is a more colorful, simple, and understandable world, no
doubt, but it still holds a certain kind of reality to its viewers.
These three reasons determine what happens to the viewer/listener
when when the voice moves and the lips don't: The human voice is
visibly fake.
If we all focus on lip movements, and the lips are out of synch with
the sound, then when we watch, we can't believe the sound. But sound
is what carries emotion! That's why people care of it. If the sound is made
unbelieveable by the most focussed-on part of the picture, the lips,
then the viewer's suspension of disbelief is ruined -- the whole
reason for success in audio/visual media. We must
disbelieve, because our perceptual system requires it, and we
care about it, because our emotions are so strongly affected by
sound.
This is why badly done lip synching is disturbing, and even offensive.
It's the reason that accurate lip synching is so important.
Another example is a 1000-Hz tone, which is often used as a start-time
marker in audio tracks for animation. This tone is not a linguistic
sound -- words aren't composed of 1000-Hz tones strung together with
other sounds. However, this tone is actually not too dissimilar from
an /ih/ sound (as in "bit"), and if you tell it to, the LSM is can
accurately locate that tone in the waveform, recognizing it as an
/ih/. The point is, once again, that it is good at linguistic sounds,
which is both its capability, and its limitation.
To work with these limitations, if possible you should do your
spoken-audio recording on a separate track, and you should lip-synch
the spoken audio tracks BEFORE they are mixed in with music or audio
special effects.
For audio indexing
For science
The tremendously favorable pricing of the LSM also provides savings
for television cartoons, which are generally less expensive, more
quickly done, and commonly less accurate than the track reading
typically used for feature films. In short, if you do track reading
by hand at all, using the LSM instead should be as accurate or more
accurate and much less expensive to use.
An additional savings is that animation production no longer needs to
use film, but can be done entirely on computer, because manual track
reading is the only part of the process that until now still required
putting the audio tracks onto film. Now that computer-based track
reading is up to speed, that expensive and time-consuming part of the
process can be avoided, which saves you both days and dollars!
If you require manual checking and correction of LSM output, cutting
up audio files, or doing custom programming, we are at your service.
Our billing rate for registered users is $40 per labor hour. Billing
for labor hours can be applied to the quarterly minimum
enrollment fee, also.
A quarter is three calendar months. The quarterly payment schedule
begins on the date of registration as a regular user and ends at the
end of the quarter during which a subscription is terminated. Please,
no partial-quarter refunds; that's why it's called a minimum
fee.
Why people care about lip synch accuracy.
Facial Animation
When Fred Flintstone says "Yabba-Dabba-Doo", the
Lip Synch Machine will
tell you exactly which video frames should show his face with his lips
shut - the frames with the /b/ sounds). The quality of the
Lip Synch Machine
is similar to that of the most careful and detailed work by
professional and highly experienced human track readers, at the
level of accuracy expected in feature film production.
Text-searchable media
Given a
transcribed audio segment you can select a portion of the text and
find or play back the corresponding bit of audio using the LSM. This
enables content-based access to media, as in on-demand radio,
searchable speeches, random-access TV news, etc. The
Lip Synch Machine can synchronize the transcription of the speech or the news
segment with the audio record. This gives you the information of
exactly what piece of audio contains the spoken form of any selected
piece of the text, from words to paragraphs to whole segments. Audio
indexing becomes a matter of text processing.
Science
Linguists, phoneticians, speech scientists,
speech recognition and synthesis engineers, psychologists, etc., can
use the LSM to study human, spoken language. They use it to locate
acoustic events of interest from either experimental or naturalistic
studies. With the LSM's timing information, they can look at just the
relevant parts of the signal, saving time and often improving accuracy
in an otherwise tedious, painstaking, and error-prone manual process
of making signal measurements.
Video Game Production
Video games with lots of spoken audio often combine word- or
phrase-sized audio segments together. These make a running commentary
or dialog about the events in the game. Segmenting the words is an
important and otherwise painstaking task in this process. The LSM
provides a good, automatic, first-pass segmentation, and takes much of
the tedium out of this process.
Overview
The LSM first takes the text transcription and
determines a pronunciation for each word. If it doesn't know the
pronunciation, the LSM will ask you for it. Then it uses Sprex'
proprietary, special-purpose, statistical signal processing technology
to identify and locate the words and their component sounds with the
audio file.
"Phonemes"
How do we define a pronunciation,
then? Words are written with letters, but they are spoken with
linguistic sounds -- "phonemes". Phonemes are simply the sounds that
make up words. While "ax" has two letters and "Veatch" has six, both
have exactly three phonemes. Other sounds, which aren't part of
words, are not phonemes, whether human-produced, like burps, or
environmental, like kitchen clatter.
Sounds vs. Phonemes
Since the LSM works at the level of phonemes rather than words, it
only knows how to find phonemes in the waveform. That's its sole
capability: locating phonemes. For example, the way it finds the end
of a word, is just by finding the end of the last phoneme in the word
- it's just a phoneme analysis.
Who cares about phonemes?
Analyzing phonemes means that the LSM does
not recognize inbreaths, outbreaths, laughter, burps, grunts,
bangs, pops, musical notes, kitchen clatter, or other non-linguistic
sounds. These sounds are not linguistic sounds or phonemes -- they
don't make up words -- so the LSM doesn't recognize them. Instead, it
is likely to recognize them as "silence". Is the creak of a door a
phoneme? No, so LSM won't recognize it.
Sounds as Phonemes
However, if such a sound is close enough to English sounds that
you can give it a reasonably accurate rendering in phonemes, then the
LSM is likely to find it successfully and accurately in the waveform.
For example, the sound of a breath, or of blowing out a candle, is not
part of any word, but it still sounds identical to an /h/, so if you
tell the LSM to look for an /h/, it'll locate it for you just fine.
What Resolution?
The
Lip Synch Machine
will locate for you, in time, the points in the
audio file where each spoken word and where each pronounced sound
begins and ends, with a resolution of typically 16-64 milliseconds -
that is, if the synchronization returns a valid result.
Accuracy statistics;
On clearly spoken adult speech
by American English speakers with low background noise, test results
show that 70% of the LSM's segmentation points are accurate to within
16ms; 90% are within 32ms, and 97% are within 64 ms. These
percentages were calculated by comparing many thousands of
segmentation points in a large and carefully hand-segmented speech
database against the automatic segmentation of the LSM.
What about errors?
On other kinds of speech including children's voices,
non-natively-spoken American English speech, non-English speech, and
non-speech sounds (kitchen clatter, etc.), the accuracy of the LSM may
be degraded to varying degrees. This is because the system is only as
good as the data that it was trained on, and the results will be
optimal only to the extent that that training data is representative
of your data. The training data covers adult American speakers pretty
well, but other kinds of speech less well. Now, as the voices of
children and baby dinosaurs and other unusual voices are gradually
collected over time through operation of the LSM, new acoustic models
can be constructed which will have much better accuracy on those kinds
of data. If you are getting poor results, this may be the issue, and
the best solution will be if you can help us by providing additional
training data to adapt our statistical models to include your kind of speech data. A
short term solution is to try submitting a recording at 22050
samples/second, which make the system think that the speaker is
speaking about a third slower and has a vocal tract about a third
bigger. This has been found to reduce errors on kids speech by
about half in some cases.
How good is that?
For animation
Under proper conditions, the accuracy or resolution of the LSM is
excellent for animation purposes, since one film frame is usually
about 1/24 second or 42 milliseconds: the vast majority of sounds will
be located accurately to within a frame or two.
The resolution is more than sufficient for audio indexing, since plus
or minus an entire second would probably still be acceptable for this
application.
The resolution is not perfect for scientific purposes or for
segmenting chunks to concatenate together later in various
combinations. In these applications, a resolution of better than 10ms
is in some cases necessary (that's a quarter of a typical film frame,
or around four times as fast as our eyes can resolve the flicker in
movie images). For these purposes, the LSM should be considered not
as a complete solution, but as providing a good first-pass
segmentation, which should be examined for errors and adjusted where
necessary.
Limitations and Disclaimer
The LSM vs. Human Performance
Fees
Modified: February 1, 2005