Lip2Wav: Generates a voice signal from silent lip movement
I've heard stories that you can know what you're talking about with just the movements of your lips if you get special training, but the research in the link was realized with AI.
I've heard stories that you can know what you're talking about with just the movements of your lips if you get special training, but the research in the link was realized with AI.
In the case of large-scale language models, there was always a difficulty because there was no Korean model. Following SKT's KoBERT, Naver released KcBERT, which was learned from the ground up with data reflecting Naver comment data and new words. Not only the trained model…
Deep learning-based super resolution technology was adopted by NVidia's latest GPU under the name DLSS (deep learning super sampling) and became a real service technology for consumers. Mainly in the 4K gaming market, 2K…
LipGan is the study of creating mouth shapes from speech signals. It is a technique that can be useful for creating an animation of a virtual character's mouth, but when applied in practice, the limitation is clear because only the lips of a character standing still move. In fact, humans...
TensorflowTTS, an open source based on Tensorflow 2 that supports several latest TTS models such as Tacotron2, MelGan, FastSpeech, etc., has finally begun supporting Microsoft FastSpeech2. FastSpeech2 shows similar performance to Transformer series TTS, but takes more than twice the time to learn…
Text-to-SQL is a task that automatically converts natural language into SQL. The post I shared at the bottom was written by Aerin Kim of Microsoft, and it is well organized about Text-to-SQL. In the world, a lot of data is built as a relational database, and in this database...
MIT's Speech2Face is a study that generates a speaker's face from a speech signal. However, it does not perform speech to face transform with one model, but it combines the results of existing studies for different purposes to create impressive results. (The first author is now...
After performing representation training with 53,000 hours of label-free data, a pre-trained model for Facebook's wav2vec 2.0, which became a hot topic because it created a speech recognizer with only 10 minutes of labeled data, was released. No fine-tuning in the representation model,...
Los Angeles Noir, a 2011 film made by Rockstar, surprised many with facial animations that were far superior to other games. The technology used at this time is called MotionScan, and basically, the actor is in a room where several cameras are elaborately placed...