Lip2Wav: Generate a voice signal with just lip movement

I've heard stories that if you get special training, you can see what you're talking about from just the silent movement of your lips. The technology called as Lip2Wav extracts visual features by ConvNet, then the attention-based speech decoder generates mel-cepstrum from them. After that, vocoder is added to synthesize a voice signal. The result is quite interesting. (There is a demo video in the link) The code and the dataset are also released.

I was amazed when I saw MIT's Speech2Face, a technology that generates faces from voice signals, but Lip2Wav is also fun. Both studies have an encoder-decoder structure, and it is expected that various studies in the form of A2B for various inputs and outputs will continue to emerge. Below is Lip2Wav's project page.

Lip2Wav: Generates a voice signal from silent lip movement

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Center for Visual Information Technology (CVIT) is a research center at International Institute of Information Technology, Hyderabad.

In addition, the author has released the code and training data. I also attach a link to this.

Rudrabha/Lip2Wav

This is the repository containing codes for our CVPR, 2020 paper titled “Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis” – Rudrabha/Lip2Wav

COVE-Computer Vision Exchange

Lip2Wav is a dataset for benchmarking speaker-specific lip to speech synthesis in unconstrained settings. It contains over 100 hours of video content of 5 speakers uttering natural speech in real-world environments. It is designed to investigate the question: “How accurately can we infer an individu…

Lip2Wav: Generates a voice signal from silent lip movement

Related Posts