LipGAN is a technology that generates the motion of the lips of a face image using a voice signal, but when it is actually applied to a video, it was somewhat unsatisfactory mainly due to visual artifacts and the naturalness of movement.
To improve this, Wav2Lip, a study that improves visual quality by considering temporal correlation by using multiple consecutive frames rather than a single frame in the Discriminator, and using visual quality loss, not just contrastive loss, was published.
If you go to the sharing link below, you can find papers, github code, pre-trained models, example videos, and even an online demo where you can upload and test actual videos and audio.
Aside from this, even before the advent of deep learning-based technologies, there were technologies that match the shape of a character's lips with actual voice signals. There were various approaches, but among them, I made several templates in advance and studied how to switch between templates according to the voice signal.
Given the diversity of voice signals, it might be tempting to think that there are too many templates required, but in fact it is mostly vowels that determine the shape of the lips, and consonants have a significantly low contribution. Not only are the types of vowels relatively small, but also recognition from the voice signal is possible in a simple way, so I remember that, for example, only 5 vowel templates and image interpolation gave us quite useful results.
Of course, this is almost 20 years old, and now I plan to apply Wav2Lip.
Rudrabha/Wav2Lip
This repository contains the codes of “A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild”, published at ACM Multimedia 2020. – Rudrabha/Wav2Lip