Read our COVID-19 research and news. Olimb

Lip-reading artificial intelligence could help the deaf—or spies

For millions who can’t hear, lip reading offers a window into conversations that would be lost without it. But the practice is hard—and the results are often inaccurate (as you can see in these Bad Lip Reading videos). Now, researchers are reporting a new artificial intelligence (AI) program that outperformed professional lip readers and the best AI to date, with just half the error rate of the previous best algorithm. If perfected and integrated into smart devices, the approach could put lip reading in the palm of everyone’s hands.

“It’s a fantastic piece of work,” says Helen Bear, a computer scientist at Queen Mary University of London who was not involved with the project.

Writing computer code that can read lips is maddeningly difficult. So in the new study scientists turned to a form of AI called machine learning, in which computers learn from data. They fed their system thousands of hours of videos along with transcripts, and had the computer solve the task for itself.

The researchers started with 140,000 hours of YouTube videos of people talking in diverse situations. Then, they designed a program that created clips a few seconds long with the mouth movement for each phoneme, or word sound, annotated. The program filtered out non-English speech, nonspeaking faces, low-quality video, and video that wasn’t shot straight ahead. Then, they cropped the videos around the mouth. That yielded nearly 4000 hours of footage, including more than 127,000 English words.

The process and the resulting data set—seven times larger than anything of its kind—are “important and valuable” for anyone else who wants to train similar systems to read lips, says Hassan Akbari, a computer scientist at Columbia University who was not involved in the research.

The process relies in part on neural networks, AI algorithms containing many simple computing elements connected together that learn and process information in a way similar to the human brain. When the team fed the program unlabeled video, these networks produced cropped clips of mouth movements. The next program in the system, which also used neural networks, took those clips and came up with a list of possible phonemes and their probabilities for each video frame. A final set of algorithms took those sequences of possible phonemes and produced sequences of English words.

After training, the researchers tested their system on 37 minutes of video it had not seen before. The AI misidentified only 41% of the words, they report in a paper posted this month to the website arXiv. That might not sound like a lot, but the best previous computer method, which focuses on individual letters rather than phonemes, had a word error rate of 77%. In the same study, professional lip readers erred at a rate of 93% (though in real life they have context and body language to go on, which helps). The work was done by DeepMind, an AI company based in London, which declined to comment on the record.

Bear likes that the program understands that a phoneme can look different depending on what is said before and after. (For example, the mouth makes a different shape to say the “t” in “boot” than the one in “beet.”) She also likes that the system has separate stages for predicting phonemes from lips and predicting words from phonemes. That means if you want to teach the system to recognize new vocabulary words, you need to retrain only the last stage. But the AI has its weaknesses, she says. It requires clear, straight-ahead video, and a 41% error rate is far from perfect.

Integrating the program into a phone would allow the hard of hearing to take a “translator” with them wherever they go, Akbarni says. Such a translator could also help people who cannot speak, for example because of damaged vocal cords. For others, it could simply help parse cocktail chatter.

Bear sees other applications, such as analyzing security video, interpreting historical footage, or hearing a Skype partner when the audio drops. The new AI approach might even answer one of the world’s greatest mysteries: In the 2002 World Cup Final, French soccer player Zinedine Zidane was ejected for dramatically headbutting an opponent in the chest. He was apparently provoked by trash talk. What was said? We may finally know, but we might regret we asked.