| LSM Quick View All About Free Demo! Sign Up Manual Do It Exit |
 

All about the LSMTM

 

Introduction

What is the LSM? An internet-based, automatic (or human-aided, as much as you need) track-reading service that makes timestamps for audio recordings of speech, from an as-spoken transcript. The LSM uses the transcription to produce timestamps for all the words and for their component sounds, or "phonemes".
Who Uses the LSM? Animators, Producers of Random-Access Media, Scientists, Video Game Makers.
Why Use the LSM? Animators. To make an animated movie, first you record the sound tracks, then you "track read" the recording (i.e., timestamp the words and sounds), and then specify the lip shape of each character shown in each frame in the whole movie. Then the artists can draw the pictures with the lips in the right positions. The LSM does track reading accurately, automatically, and inexpensively, replacing or enhancing the current tedious, error-prone manual services.

Random access media means TEXT-SEARCHABLE AUDIO and audio/visual media including radio, TV, speeches, talk-shows, anything. The LSM finds the spoken text where it occurs in the media stream, so you can automatically fast-forward or randomly access particular media segments according to the text spoken in them.

Scientists studying language and sound want to locate acoustic events corresponding to words or speech sounds.

Multimedia and video game producers want to segment audio recordings to create controlled playback of audio.

How to Use the LSM? Sign up for a free demo, and try it out. Upload your own sample audio files and text transcriptions; download the results. Use it like a regular user. If it works for you, then get a regular user account and use it whenever you need to.

If you also would like to have Sprex manually proof and correct the output, cut up audio files into appropriate-sized lengths for the LSM to handle, feed in the data for you, send you the results, or do modifications to the software to produce your preferred format for the information, we are at your service: Just send us an email and let us know how we can help you.

Benefits: Accuracy: On clean recordings of adult American English speech, the timestamps are accurate to better than 20 milliseconds (half of the duration of one film frame) most of the time (70%, in our tests); while 97% of the timestamps are within 64 ms (1.5 film frames). This is as accurate as the most careful track reading (the quality required for feature film animation).

Automatic: It saves lots of work and time!

Available: It's up and on the net 24x7 (barring network downtime), so you can use it whenever you like!

Fast: It operates at a few times real time, plus network delays. Whereas humans work all night long.

Consistent: The LSM is never tired, bored, or distracted; it's accuracy is the same whether the task is short or long.

Inexpensive: The LSM is free to try out, and it costs less for regular usage than professional human track readers. Fees include a quarterly minimum equivalent to 10 minutes of usage, plus additional pro-rated fees depending on how much more you need to process.

Where is the LSM? It's located on the web at http://cassandra.sprex.com, or you can contact Sprex via Co-Work or at info@sprex.com or 206-367-7741, for more information. Look at it, use it, tell us what you think!

 

Please learn about the LSM and use the free demo or ask us to help you.
We hope you will make the Lip Synch Machine a regular part of your work.

 

 

Why people care about lip synch accuracy.

Lip synch accuracy can make or break an animated production. High quality lip synching adds reality and meaningfulness to the natural focus of the viewer's gaze: the high-speed lip and face movements associated with speech.

Good lip synching is the key to quality not just in animation, but also in other audio/visual media, for example in music videos where poor lipsynching can ruin the entire performance. But why is it that bad lip synching is so noticeable, distracting, and unrealistic, when the whole point of animation is to be unrealistic? There are three reasons for it.

First, vision in humans is genetically engineered to pay attention to the face, and to pay attention to movement. Babies focus intently on their mother's faces, for example. Humans as a species spend a remarkably large fraction of their mental capacities watching and interpreting each others' faces. In addition, perceptual systems are especially sensitive to movement, and humans are not much different from other hunting animals, for which visible movement is instantly the total focus of concentration.

Since we're genetically engineered to look at faces, and since we're genetically engineered to focus on movement, then is it a surprise that we pay so much attention to that part of the face that does the most movement -- the speaking mouth?! It follows that humans are genetically engineered to pay special attention to lip synching.

Lip reading is little short of miraculous. To infer sound from vision is an amazing feat. Yet many deaf and some hearing people can do it pretty well. This could only be true if humans were highly sensitive to lip movements.

Second, sound is the main source of communicated thought and feeling. Human thought is expressed and communicated from person to person by another genetically engineered system that uses the medium of sound, namely language itself. And human emotions are most powerfully influenced by sound, whether it's emotion carried in music or emotion carried in a human voice.

Third, to watch is to try to believe. Every audio/visual medium - whether TV, movies, multimedia, or animation - implicitly asks the viewer to suspend disbelief and enter into the world represented in the medium. Even an animated world seems real in the mind of a captivated viewer, otherwise it would be only noise and colors on a screen. The animated world is a more colorful, simple, and understandable world, no doubt, but it still holds a certain kind of reality to its viewers.

These three reasons determine what happens to the viewer/listener when when the voice moves and the lips don't: The human voice is visibly fake. If we all focus on lip movements, and the lips are out of synch with the sound, then when we watch, we can't believe the sound. But sound is what carries emotion! That's why people care of it. If the sound is made unbelieveable by the most focussed-on part of the picture, the lips, then the viewer's suspension of disbelief is ruined -- the whole reason for success in audio/visual media. We must disbelieve, because our perceptual system requires it, and we care about it, because our emotions are so strongly affected by sound.

This is why badly done lip synching is disturbing, and even offensive. It's the reason that accurate lip synching is so important.

 

 

Applications for the LSM

Facial Animation

When Fred Flintstone says "Yabba-Dabba-Doo", the Lip Synch Machine will tell you exactly which video frames should show his face with his lips shut - the frames with the /b/ sounds). The quality of the Lip Synch Machine is similar to that of the most careful and detailed work by professional and highly experienced human track readers, at the level of accuracy expected in feature film production.

 

Text-searchable media

Given a transcribed audio segment you can select a portion of the text and find or play back the corresponding bit of audio using the LSM. This enables content-based access to media, as in on-demand radio, searchable speeches, random-access TV news, etc. The Lip Synch Machine can synchronize the transcription of the speech or the news segment with the audio record. This gives you the information of exactly what piece of audio contains the spoken form of any selected piece of the text, from words to paragraphs to whole segments. Audio indexing becomes a matter of text processing.

 

Science

Linguists, phoneticians, speech scientists, speech recognition and synthesis engineers, psychologists, etc., can use the LSM to study human, spoken language. They use it to locate acoustic events of interest from either experimental or naturalistic studies. With the LSM's timing information, they can look at just the relevant parts of the signal, saving time and often improving accuracy in an otherwise tedious, painstaking, and error-prone manual process of making signal measurements.

 

Video Game Production

Video games with lots of spoken audio often combine word- or phrase-sized audio segments together. These make a running commentary or dialog about the events in the game. Segmenting the words is an important and otherwise painstaking task in this process. The LSM provides a good, automatic, first-pass segmentation, and takes much of the tedium out of this process.
 

 

How it works

Overview

The LSM first takes the text transcription and determines a pronunciation for each word. If it doesn't know the pronunciation, the LSM will ask you for it. Then it uses Sprex' proprietary, special-purpose, statistical signal processing technology to identify and locate the words and their component sounds with the audio file.

 

"Phonemes"

How do we define a pronunciation, then? Words are written with letters, but they are spoken with linguistic sounds -- "phonemes". Phonemes are simply the sounds that make up words. While "ax" has two letters and "Veatch" has six, both have exactly three phonemes. Other sounds, which aren't part of words, are not phonemes, whether human-produced, like burps, or environmental, like kitchen clatter.

 

Sounds vs. Phonemes

Since the LSM works at the level of phonemes rather than words, it only knows how to find phonemes in the waveform. That's its sole capability: locating phonemes. For example, the way it finds the end of a word, is just by finding the end of the last phoneme in the word - it's just a phoneme analysis.

 

Who cares about phonemes?

Analyzing phonemes means that the LSM does not recognize inbreaths, outbreaths, laughter, burps, grunts, bangs, pops, musical notes, kitchen clatter, or other non-linguistic sounds. These sounds are not linguistic sounds or phonemes -- they don't make up words -- so the LSM doesn't recognize them. Instead, it is likely to recognize them as "silence". Is the creak of a door a phoneme? No, so LSM won't recognize it.

 

Sounds as Phonemes

However, if such a sound is close enough to English sounds that you can give it a reasonably accurate rendering in phonemes, then the LSM is likely to find it successfully and accurately in the waveform. For example, the sound of a breath, or of blowing out a candle, is not part of any word, but it still sounds identical to an /h/, so if you tell the LSM to look for an /h/, it'll locate it for you just fine.

Another example is a 1000-Hz tone, which is often used as a start-time marker in audio tracks for animation. This tone is not a linguistic sound -- words aren't composed of 1000-Hz tones strung together with other sounds. However, this tone is actually not too dissimilar from an /ih/ sound (as in "bit"), and if you tell it to, the LSM is can accurately locate that tone in the waveform, recognizing it as an /ih/. The point is, once again, that it is good at linguistic sounds, which is both its capability, and its limitation.

To work with these limitations, if possible you should do your spoken-audio recording on a separate track, and you should lip-synch the spoken audio tracks BEFORE they are mixed in with music or audio special effects.

 

 

Accuracy

 

What Resolution?

The Lip Synch Machine will locate for you, in time, the points in the audio file where each spoken word and where each pronounced sound begins and ends, with a resolution of typically 16-64 milliseconds - that is, if the synchronization returns a valid result.

 

Accuracy statistics;

On clearly spoken adult speech by American English speakers with low background noise, test results show that 70% of the LSM's segmentation points are accurate to within 16ms; 90% are within 32ms, and 97% are within 64 ms. These percentages were calculated by comparing many thousands of segmentation points in a large and carefully hand-segmented speech database against the automatic segmentation of the LSM.

 

What about errors?

On other kinds of speech including children's voices, non-natively-spoken American English speech, non-English speech, and non-speech sounds (kitchen clatter, etc.), the accuracy of the LSM may be degraded to varying degrees. This is because the system is only as good as the data that it was trained on, and the results will be optimal only to the extent that that training data is representative of your data. The training data covers adult American speakers pretty well, but other kinds of speech less well. Now, as the voices of children and baby dinosaurs and other unusual voices are gradually collected over time through operation of the LSM, new acoustic models can be constructed which will have much better accuracy on those kinds of data. If you are getting poor results, this may be the issue, and the best solution will be if you can help us by providing additional training data to adapt our statistical models to include your kind of speech data. A short term solution is to try submitting a recording at 22050 samples/second, which make the system think that the speaker is speaking about a third slower and has a vocal tract about a third bigger. This has been found to reduce errors on kids speech by about half in some cases.
 

How good is that?

For animation
Under proper conditions, the accuracy or resolution of the LSM is excellent for animation purposes, since one film frame is usually about 1/24 second or 42 milliseconds: the vast majority of sounds will be located accurately to within a frame or two.

For audio indexing
The resolution is more than sufficient for audio indexing, since plus or minus an entire second would probably still be acceptable for this application.

For science
The resolution is not perfect for scientific purposes or for segmenting chunks to concatenate together later in various combinations. In these applications, a resolution of better than 10ms is in some cases necessary (that's a quarter of a typical film frame, or around four times as fast as our eyes can resolve the flicker in movie images). For these purposes, the LSM should be considered not as a complete solution, but as providing a good first-pass segmentation, which should be examined for errors and adjusted where necessary.

 

 

Limitations and Disclaimer

While the performance of the LSM has been studied in certain situations and found to have a relatively consistent, statistically-describable accuracy level, these accuracy levels depend on having accurate, literal, English transcriptions and on the input audio recordings having acoustic patterns similar to those the LSM was trained on. Any given segmentation point cannot be guaranteed to have any particular accuracy both because of the statistical nature of LSM's performance characteristics, and because your data may be different. It is your responsibility to verify that the accuracy and performance of the LSM are acceptable for your particular audio data, your application, and your needs. Your opportunity to do a limited-time free demonstration of the LSM helps you to fulfill this responsibility, by determining for yourself to what degree it is satisfactory. However, users acknowledge, via the LSM License Agreement, that the LSM provides no performance guarantee. See how it works for you, and use it if it does. Also, your comments are welcome!
 

 

The LSM vs. Human Performance

The automatic service provided by the LSM provides benefits of cost, accuracy, speed, and flexibility as compared with manual segmentation. The LSM is priced far below the comparable high-end service provided by human professional "track readers". Feature-film track readers charge $.60 to $1.00 per film foot (equal to 16 frames, or 2/3 of a second, so that's $54 to $90 per minute), and work for hours to "read" just a few minutes of audio. Whereas, automatic LSM fees top out at $25/minute.

The tremendously favorable pricing of the LSM also provides savings for television cartoons, which are generally less expensive, more quickly done, and commonly less accurate than the track reading typically used for feature films. In short, if you do track reading by hand at all, using the LSM instead should be as accurate or more accurate and much less expensive to use. An additional savings is that animation production no longer needs to use film, but can be done entirely on computer, because manual track reading is the only part of the process that until now still required putting the audio tracks onto film. Now that computer-based track reading is up to speed, that expensive and time-consuming part of the process can be avoided, which saves you both days and dollars!

 

 

Fees

Registered regular users of the Lip Synch Machine are charged a quarterly minimum enrollment fee of $250. The rate for usage is $25/minute, so that the minimum fee can be used to cover up to 10 minutes of lip assignment, track reading, and/or indexing. Additional usage after the quarterly minimum fee is used up is prorated at the same usage rate of $25/minute. To handle multi-track recordings, the LSM requires a separate audio file for each track; each track also counts separately toward the usage total. If you are an academic institution, ask us about our academic discount rates.

If you require manual checking and correction of LSM output, cutting up audio files, or doing custom programming, we are at your service. Our billing rate for registered users is $40 per labor hour. Billing for labor hours can be applied to the quarterly minimum enrollment fee, also.

A quarter is three calendar months. The quarterly payment schedule begins on the date of registration as a regular user and ends at the end of the quarter during which a subscription is terminated. Please, no partial-quarter refunds; that's why it's called a minimum fee.

 

Copyright © 1996-2005 Sprex, Inc. All rights reserved.
Modified: February 1, 2005