Read our COVID-19 research and news.

Watch artificial intelligence predict Conan O’Brien’s gestures just from the sound of his voice


LONG BEACH, CALIFORNIA—Every time you talk, your body moves in sync, whether it’s something as subtle as eyes widening or more extreme movements like flailing arms. Now, researchers have designed an artificial intelligence that knows how you’re going to move based purely on the sound of your voice.

Researchers collected 144 hours of video of 10 people speaking, including a nun, a chemistry teacher, and five TV show hosts (Conan O’Brien, Ellen DeGeneres, John Oliver, Jon Stewart, and Seth Meyers). They used an existing algorithm to produce skeletal figures representing the positions of the speakers’ arms and hands. They then trained their own algorithm with the data, so it would predict gestures based on fresh audio of the speakers.

The generated gestures were closer to reality than were randomly selected gestures from the same speaker or predictions from a different type of algorithm originally designed to anticipate the hand movements of pianists and violinists. Speakers’ gestures were also unique, the researchers reported here this week at the Computer Vision and Pattern Recognition conference. Training on one person and predicting another’s gestures did not work as well. Feeding the predicted gestures into an existing image-generation algorithm led to semirealistic videos, as seen in the video.

The team’s next step is to predict gestures based not only on audio, but also transcripts. Potential applications include creating animated characters, robots that move naturally, or movement signatures of people to identify fake videos.