I love my Digital Influence class. It’s being taught by John Bell, managing director/executive creative director of 360 Degree Digital Influence. Public relations in the age of digital influence is about leveraging social media to create conversations with potential influencers. Heady stuff.
As much as I’m enjoying what I’m learning about in class, I’m becoming more and more frustrated that deaf people will be left behind again. When A.G. Bell patented the telephone (the true inventor was Antonio Meucci), he ended up isolating the deaf community, the very people he set out to help. Bell’s famous words, “Mr. Waston, come here,” was only the first of many technological milestones that marginalized deaf people.
We were managing to get caught up by working with the FCC to regulate access for television and telecommunications, then came along the internet. At first, the internet wasn’t so bad. It was still pretty equal footing for those of us who had a computer. Now, with the advent of Web 2.0, we’re about to be screwed, or, should I say, Bellized, yet again.
YouTube is becoming more and more popular as a means of exchanging information. Ditto for podcasts. My professor assigned the class a podcast to listen to, but luckily, I was able to get a transcript from him. What about the other podcasts with information that would benefit me, such as those from NPR or The New York Times? Much of this information is so new; traditional media hasn’t caught up yet. This information could be critical to a person working in a cutting-edge industry. We’d better get on the horn, and fast, if we want to remain competitive.
Why, oh, why doesn’t anyone use universal design principles?! We wouldn’t have to go to the FCC or Congress to regulate access every time a new invention rolls along.
© Copyrighted material. This article cannot be copied, reproduced or redistributed without the express written consent of the author. As with every blog on this website, this blog does not reflect the opinion of DeafDC.com.
42 Comments
Sorry, the comment form is closed at this time.


Good article. Podcasts are inherently an aural medium — saying they’re inaccecssible is like saying radio is inaccessible. Online video and vid-casts are another thing, although I don’t see how we can enforce a captioning requirement for user-driven services like YouTube. Not until algorithms for automated video transcription become more robust, at least.
It would be nice to have captioning from major media outlets like CNN.com, though.
Right, but NPR has transcripts of their shows online, as do many other radio stations. That’s where I’m coming from.
Do you think automated video transcription will be available anytime soon? What is available now? Apparently, not very “robust” algorithms. :-)
Any number I give you would be pulled straight from thin air — natural language processing is a tough nut to crack in computer science. We could come up with a major breakthrough tomorrow, or in 20 years.
To the best of my knowledge, current approaches all involve training of the software before production use, which has to be done on a per-person/per-voice basis. This is the approach taken by most dictation software, and works well under controlled conditions (little environmental noise, you don’t have a head cold/ thick accent).
On the other hand, some domain-specific systems are able to accurately recognize a limited vocabulary from a broad set of vocal ranges. This is how most telephone-routing systems work, I believe.
Now that I think of it, the issue at hand is probably the balance of broadness (speaker diversity) and accuracy (without having to train the software). I’m sure we’ll get there, but it might take some time — I’m not quite sure what the state of the art is today. I could talk for hours about all this, but I think I’ll quieten up now.
If you want any further proof of the uncertainty of all this, may I point you towards xkcd. NLP is akin to computational linguistics (my first love) only in the broadest terms, but it still cracks me up.
Question for anyone who knows anything about this stuff: what’s the end result when the software can’t recognize the voice? Specifically in voice to print software. Do you usually just end up with gibberish or does a word actually manage to get printed here and there?
Normally the software will provide a best guess as to what the word could be. Think of all those Chinese-name jokes you’ve heard — the effect is similiar.
edit: What I mean is the software will give the nearest English-language equivalent. So suppose you said “Sam Ting,” the software might interpret that as “same thing.”
So it ends up looking like the worst of real-time captioning, with every third word being incorrect?
If you train dictation software, the accuracy rate is pretty high. From my personal observations (I’ve never used it myself though), it’s about 85-90% accurate if you don’t speak too fast. It’s actually pretty neat.
But the main problems with discerning words from a range of voices (especially in a noisy background) is essentially a question of too much data, right? The program can’t sort it all out unless it’s trained?
I’ll rephrase that — it’s not a matter of too much data, but ambiguous data against which it has nothing to match against. Humans are adept at pattern recognition, a very hard thing for computers to do well, unless you tell it to expect every single possible variation of every single possible input. You see where things can quickly get complicated.
What the best matching algorithms do basically is try to boil down continuous data into acceptable discrete values that ultimately produce a deterministic outcome.
You’d be surprised at how much fuzzy logic humans do every day. You see a table, but how do you identify it as such, when there are infinitely many variations of a table? If you used a chair as a table (perhaps to eat a meal) how’s a computer to know that? Stuff along those lines. *flashbacks to Philosophy 101*
See the reason I ask about this is because I had a similar conversation in another blog (I think Der Sankt) a couple of months back. Help me to understand something: If the problem is ambiguous data, what would happen if there were a completely separate program running that was looking at something else–in this case motion tracking or facial recognition software? The way I figure it is that a person’s lips can only make so many shapes if he or she is speaking English, right? That’s essentially how we learn to lip read. We guess based on the shapes. And I’m guessing a computer would have to employ the same fuzzy logic subroutines to guess based on motion tracking alone. But if voice recognition were running alongside of that program, and the computer compared the results against a “best match” (for example a guy saying “What’s that pig outdoors”) may be the motion tracking program’s best guess but the voice recognition program might come up with “What’s that big loud noise?”, which seems like a better guess.
I wasn’t entirely satisfied with the results of my last discussion on this and I wonder if anyone else can tell me about the drawbacks to this idea? Or even if it’s possible now with current technology?
Sure, it’s certainly possible to employ different algorithms, compare them using a ‘confidence’ score for each and make a decision based off that. It’s actually probably better to do that.
That would make sense if you’re trying to do lipreading software, but other than that I see very limited practical use out of that — you’re assuming visual stimuli is present (and data is presumably being collected). Visual and audio data are two completely different mediums — while one can certainly complement the other, I wouldn’t approach the problem of speech recognition while depending on visual data — you’re merely adding another continuum of probability that’s non-intrinsic to the medium. What if the person had their back to the camera, or you were recording a radio show?
What I’d do is draw on characteristics of the medium that are intrinsic. Here’s what I’d do in terms of speech recognition. That involves identifying sounds, which represent words. Words constitute language, which has a whole set of rules unto itself apart from any phonological system.
So theoretically, once you’ve identified a few runs of words based on their phonological content, it becomes that much easier to match up the whole sentence against patterns in grammar, word usage, etc.
I don’t know if current systems actually look at the linguistic aspect of the sounds they’re trying to interpret, but they do take into consideration runs of phonemes, the probability of a certain phoneme coming adjacent to another, etc. All those probabilities are weighed, a decision is made and the results are strung together to give you bits of text that may or may not make any sense.
Whew. I didn’t want to dive off the geek deep-end, but there you go.
I don’t think it’s geeky, ha. I think this directly relates to what Erin is talking about, and I think it has a LOT of practical applications. Take captioning. We don’t get enough of it, why? Because it’s costly (though that’s being called into question now) and time-consuming to produce. But suppose that producing captions were as easy as filming/recording a person speak and filling in the areas where he turns his back or the sound gets muffled.
Think about it. A lot of us already have cameras and microphones attached to our computers. That’s YouTube captioning, right there, if we had the right software to do the job. I can’t believe that it’s overtly complicated. If there are limited pairs of phonemes, there has to be limited pairs of shapes the lips make when producing words. So if a computer is choosing from those limited pools of data (which are big pools granted but not oceans) then that should make at least a dent in the ambiguity problem, yes? And from there, if it’s easy to caption things while filming/recording whatever is going to end up on a consumer’s podcast, then we’re no longer locked out.
How many computer people are reading this right now? Anyone have some good links to info on the kinds of software that could be merged to get this effect? Or companies that might be interested in building something like this?
I’m not saying it’s not doable — a dual voice/video recognition approach can certainly augment one another, and a ‘fill-in-the-blanks’ system could be developed for content that’s already meant to be captioned (ie, a TV show).
Where that approach fails is with non-visual data and content that’s not intended to be captioned — YouTube, radio broadcasts, podcasts, etc. That’s not tackling the problem of NLP directly, but trying to patch it piecemeal for certain cases. There’s probably more ways where it wouldn’t work than where it would.
Maybe there are people already working on such an approach, but I wouldn’t myself. I’d rather try and solve the problem of natural language processing first, rather than work on hacks that are only going to be effective in a small minority of cases. To me, that makes the most economic and practical sense.
Hm. Okay. I don’t know enough about this to really get much further into it. I hope that whoever works on this can speed up language processing software so it can identify a range of voices. I have this vision where you just put on a pair of glasses (or contacts) and walk into a bar… presto, anyone you talk with has captions under his face…
It’s nice to dream…
Haha, nice vision — it’s one shared by you and me both!
That was like reading a very slow IM conversation. : )
Ha ha, my KODA son has that dream! Closed-captioning glasses - and he wants to invent that for me! :-)
Right now, *most* automated telephone system can understand voice response.
It will be ready within two years (I am guessing here) to see voice to text (print out orders for assembling machines, computers, and etc.) and it is possible we can use this technology for people like us.
The bad part will be for us is the lack of LIVE person on the telephony system.
Chris:
From what I have learned in regards to voice to text systems is that most automated telephone systems have voice response such as voicing your social security or banking account numbers on the telephone seems to work. text print-outs along with IP address will be forwared to the security department for quality assurance.
Second part, most computer and electronic reseller assembly plants will benefit voice to text orders. A customer orders a product via phone and the assembly department will get a text print-out of the order. Companies are developing hardware and software to make this happen. There are programs that offers feedback such as “You have ordered: 2 Dell Laptop computers with 1 case, is this correct? yes or No response will be prompted.” This is just a start and down the road will see improvements.
Once that becomes a standard, we can use this technology for relay, closed captioned, and etc.
I have always considered podcasts, videos, and flash to be hinderances to access for everyone. As a web guy, I try to use those things to enhance, but not replace content.
Text on a page is pretty accessible to all.
Erin, wish I could be taking that same class!
Anyway, you hit upon an issue in your very last sentence, “…We wouldn’t have to go to the FCC or Congress to regulate access every time a new invention rolls along. ” The FCC has been noticeably deregulatory when it comes to the Internet… the good news is I think we’ve been writhing a bit these past decade for a big payoff. The Internet is going to finally take a consistent form someday and that should be when we can expect quick gains in policymaking.
That day will come. Until then, from the FCC’s standpoint, it might be best not to regulate something that has yet to be completely understood today. I hate being behind the curve, but at least we’re not being completely shut out and I’m just glad there’s a sprinkling of social responsibility and conscience out there on the web. Captions here and there. I just wish it’d be everywhere now, not then.
Then I’ll be able to watch YouTube clips of The A-Team.
Erin –
The Coalition of Organizations for Accessible Technology agrees with you about universal design principles. Legislative and regulatory action becomes necessary in the absence of voluntary action. For more information about the Coalition’s goals and objectives, go to http://www.COATaccess.org. With respect to accessible audiovisual materials on the Internet, the Coalition is focusing on IPTV and other TV-like commercial programming, bringing existing laws into the 21st century.
Rosaline
This comment is mostly in response to the deeply nested discussion thread between Chris and Josh above. A lot of the topics mentioned there touch on my research interests (automated sign language recognition and computer vision), so I feel compelled to jump in.
Speech recognition has advanced quite a bit; see for example this article. However, it is important to understand that there is still a huge gulf between a system that is being used in a quiet office and a system that has to adapt to a variety of different people, background noise, signal degradation, and so on. The automated phone systems are not even remotely close to diction systems like Dragon Naturally Speaking, because they just have to deal with an extremely small set of discrete words. It is much easier to achieve high accuracy and true speaker independence for them than for the full range of human expression. I see such systems eventually becoming good enough in the future, to the point that the deaf can use them to ensure accessibility, but two years is a tad optimistic.
Chris Heuer was touching on the concept of data fusion in the thread above - the idea is to use multiple sources of information, such as audio and video, and by combining them you achieve better accuracy than by looking at them alone. These ideas have been a hotbed of research, especially in the computer vision community, for many years. It is one of these things that intuitively - to a human - should be dead easy, but the reality is much more complicated. Paradoxically, the higher your base accuracy with one channel of information is, the higher the chance that fusing information from multiple sources will just make overall accuracy worse, instead of better.
To explain why would require going on a lengthy excursion into probability theory and statistics, so let me just say that it boils down to determining how confident you are in each bit of information. We humans are fairly good at detecting when a video becomes garbled, or when our guesses at something are nonsense. But doing the same thing in a computer program is quite hard, and then there is also the question of what you do if the information from e.g. the audio and the video conflict. However, there have been some significant theoretical advances in machine learning and in how to deal with unreliable information in the past 10 years.
Josh: Modern speech recognition programs use quite a bit of linguistic information, but it is mostly statistical in nature. It does not mean that the program really understands the language, it just means that it is taking advantage of the statistical regularities in morphology and syntax. It significantly boosts the average accuracy, at the cost of making it harder to deal with the rare cases.
Finally, I would like to point out that many technological advances have been pragmatic in nature. We are still very far from solving the general problem of artificial intelligence, whereas the things that look like hacks are often the very ones that have suddenly gone off into unexpected directions.
Hi Christian:
It’s a fascinating field. And a fascinating set of problems. That’s an interesting point, that fusing data from multiple sources will just make accuracy worse. I never thought of it that way.
You know it’s intersting how the CART system at work (captioning created by a person sitting somewhere and typing the captions as the person speaks) ends up creating some really weird mistakes. There was a Town Hall meeting a couple of months back and one of the phrases typed was “drop-kick Iran.” No kidding. Obviously that wasn’t what was said but that’s how it came out because the transcribers are using some kind of machine that sort of jumps ahead and guesses at the word so they don’t have to type the whole thing out.
Even with these weird mistakes, though, the typing usually helps me lipread what was said. I never did figure out what “drop-kick Iran” meant, but I remember once when the CART guy typed “the dancer’s baby” and then the speaker repeated what he said and I lip-read: “The answer is maybe.”
Until perfect voice recognition software comes along I’ll settle for something that gives me at least a questionable guess. It’ll make for some interesting conversations, I’m sure, but a machine will probably making sense out of spoken language than I ever will.
Say, another question for anyone who knows:
This DNS thing (Dragon NaturallySpeaking), does this only work for a desktop computer or can you get a BlackBerry that can do the same thing? What I’d really like to have is a simple BlackBerry with a microphone attachment… if I ever NEED to make sense out of spoken English, I’d be willing to hold out that microphone whether my software has been trained on that speaker’s voice or not. You never know, it might make a pretty good guess.
Can that be done currently?
Depends on the hardware requirements of DNS, about which I have no idea. My guess is no.
However, the current generation of embedded low-power processors seems to be at approximately the level of a Pentium 3, so that would mean that the mobile processor generation is about 8 years behind the mainstream. We eventually will have low-power processors that are fast enough to handle this kind of task.
We already have seen a similar development in the transition from high end workstations by hardware vendors, such as SGI and Sun, down to the mainstream PC market. Most computer-science research, for example, used to require expensive cutting-edge hardware, but this is no longer true. It is only a question of time until the same thing happens to the mobile market.
Come to think of it, CART is a perfect candidate for statistical language models. It would not even be particularly hard to implement. I wonder if any of the current CART software on the market does this.
Uh what abt spinvox? http://www.spinvox.com
Apparently it can transcribe somebody’s speech right away. It already offers podcast transcribing services.
Hopefully spinvox or similar techs will become more versatile (captioning video clips for instance) and affordable.
Just to make it clear: Spinvox is not some magic technology, just a clever application of manpower, according to SpinVox’s patent application. So, this service is based on human operators, although it would not surprise me if they used or attempted to use speech recognition to make it more efficient.
So, I guess that captioning video clips is going to stay expensive for a while yet.
Ahh interesting link. Thanks!
Something just occurred to me. You can get a free transcript of a podcast/clip by making a relay call to your own phone, then place the phone over your cpu, and hit the play button lol.
I’m presuming that a phone’s able to transmit voice from a cpu well enough, of ocurse.
Hi Ben:
Yeah I remember your article on that. I’m gonna check this out more closely…
Hey just out of curiosity, does anyone know of a company that is developing software to motion-track the way a person’s lips move when he/she speaks? And then translate that into a “best guesss” of what’s being said and then print it out?
As for companies, no idea. However, there are researchers who are working or have worked on this problem. You can find them easily enough if you search http://scholar.google.com for the term “lip tracking” (without the quotes).
Please do not ask me to evaluate their work, though. I have never been very interested in lip tracking in particular, so I am not familiar with the current state of the art …
Chris Heuer,
Your comment reminded me of the McGurk effect. According to its Wikipedia page, “Study into the McGurk effect is being used to produce more accurate speech recognition programs by making use of a video camera and lip reading software.”
http://en.wikipedia.org/wiki/McGurk_effect
No, but I did watch a very interesting History Channel documentary on Hitler, and one deaf German developed lip-reading software. They used his program to figure out what Hitler was saying in his home movies, since his speech in private was different than in public.
Maybe someone can get ahold of that guy in Germany and see what he can do for us? ;)
I shall contact my German relatives ASAP!
Großvater, großmutter, wo bist du ?
(-:
*grins* I wish I could find it on the History Channel website. I’ve seen it twice on t.v. That guy’s doing exactly what you’re talking about. The guy’s supposed to be one of the top lip-readers in the world too.
This information does not pass the smell test. Think about it this way: is is realistic to expect automated lip reading to do well, when we cannot even get good results with automated speech recognition on movies? Remember, lip reading is a much harder problem than speech recognition.
Plus, it is pretty hard to conceive that I would never have heard of a Deaf German fellow national if he had indeed manged to develop a breakthrough in automated lip reading - it would have been a huge story.
So, I did some digging, and here is what I can say: Yes, there was a documentary by the British history channel. Yes, there was a deaf German, Frank Hübner, involved. No, he is not developing lip-reading software. He owns a company that offers sign language courses, at http://www.gebaerdenfabrik.de.
Apparently he used his flesh-and-blood lipreading skills to help decipher segments of Hitler’s private videos.
The puzzling - and frankly amazing - thing is how the English news sites could get the information so horribly wrong. A German newspaper has the story much like I wrote up here. It mentions that this guy used a computer to zoom and enhance the images. It is likely that this is the part that the English newspapers and subsequently the majority of the blogosphere misinterpreted as automated lip reading.
If it looks too good to be true, it probably is.
Christian, yeah, that’s the name of the Deaf guy. But from what I saw on the History Channel, he did use ALR software to figure out what Hitler was saying.
I dunno - maybe you’re right, things got lost in translation, but I’m pretty sure of what I saw in that documentary.
Still does not add up for me. There are copies of the video floating around on Google, but some circumstances are rather puzzling.
The video is not captioned, so I have to wait for a hearing friend to help me out.
One thing that I have been able to piece together so far is that Michael Brooke is being quoted, and he *does* have a bunch of publications on ALR and has specifically investigated the possibility of remapping lip movements from video onto a 3D face model that is purported to be lip-readable. This kind of meshes with the bits of the demonstration that are shown in the video. It could be that this was one of the enhancing methods used, with bits and pieces of ALR, with the deaf lipreading experts as a backup and arbiters.
I highly doubt that Frank himself developed the ALR technology. There is nothing to indicate that he has the expertise in this specific area. It is more likely that he was provided with it, and either used it as an application, or he developed the user interface.
But these things do not add up:
- There are no recent scientific publications by Brooke on this topic. Why keep quiet on it if he has been improving it?
- There is no mention of lipreading algorithms that are that good in the scientific literature. It is not very likely that a quantum leap of this magnitude (if it were) would be kept quiet during all that time.
- It is very difficult to find official references to the video. The British channel “Five”, for instance, where it originally aired, has no mention of it anymore. Movie databases do not even identify a publisher.
- It is being mentioned frequently in the context of anti-semites and holocaust deniers.
- I could not find any records of validation of this technique against historical footage with known audio.
Hmm. There was nothing anti-semitic in the documentary. They just focused on how lip-reading worked and the software, then to what Hitler was saying in the home films. From what I remember, Frank said he didn’t develop it himself, but that he worked with a group of computer programmers to develop it. I tried to search for the documentary itself as well, and I couldn’t find it. Which bothers me a bit.
I dunno what to tell you. You’re the expert on this, not me! So I’ll trust your judgment on this, really.
I definitely agree. I feel left behind as I watch videos increasingly become more and more common. I don’t understand why the major news websites haven’t provided closed captioning yet (CNN, MSNBC, FOX, etc) - that’s a serious insult to the Deaf and HoH community.
A couple of comments…
I cannot imagine that an automated lip reading system would be any easier to develop than audio speech recognition. Just as audio speech recognition has detrimental factors like accent, style (fast talking, running words together, etc.), head colds, and sobriety, mechanical speech recognition has to deal with mouth and lip size, whether or not the speaker is buck-toothed (I’m serious here!), or whether or not the speaker just came from the dentist.
Anyone who has seen those old videos of Hitler knows that, at least while making speeches, Hitler was very animated with his mouth and facial expressions, and he opened his mouth very wide. That has to be a significant factor.
I have a friend who is a pretty good lip reader, but she once told another friend that she could not read his lips because, as she put it, when he spoke it looked like he was eating sunflower seeds. That has to be fairly common.
Secondly, does anyone who is well versed in the subject know if anyone is researching whether or not something like Dragon Naturally Speaking can be trained by the profoundly deaf? I would think that if the speaker is consistent, the software could be trained to recognize words even if they are not intelligible to the human ear.
Has anyone seen that awful show on TV with Flava-Flave (spelling?), who wears a clock around his neck for fashion? I imagine a stripped-down laptop (no sound card, minimum video capability, no cd-rom, etc.) with no hinge, having a keyboard on the back of the screen. The user could wear it or carry it. When the user speaks, the trained computer would recognize the sounds even though the audience cannot, and the words would magically appear on the screen. Or to make it smaller and lighter, eliminate the screen and use a voice synthesizer. The keyboard would be needed for unusual words or proper names.
As an added benefit, this might blur the lines (and the animosity) over oralism.
I also envision BlackBerries eventually being able to do this. Imagine a profoundly deaf person walking into a convenience store, and saying “ic curk eek” into his/her trained BlackBerry and “I am deaf. Pack of camel filters, please,” coming out of the BlackBerry.