Speech

Sep112020

Lip2Wav: Generates a voice signal from silent lip movement

I've heard stories that you can know what you're talking about with just the movements of your lips if you get special training, but the research in the link was realized with AI.

Aug192020

Speech Code

FastSpeech2 Open Source

Speech, Code

TensorflowTTS, an open source based on Tensorflow 2 that supports several latest TTS models such as Tacotron2, MelGan, FastSpeech, etc., has finally begun supporting Microsoft FastSpeech2. FastSpeech2 shows similar performance to Transformer series TTS, but takes more than twice the time to learn…

Aug102020

Visual Speech Code

Speech2Face-face prediction from speech signals

Visual, Speech, Code

MIT's Speech2Face is a study that generates a speaker's face from a speech signal. However, it does not perform speech to face transform with one model, but it combines the results of existing studies for different purposes to create impressive results. (The first author is now...

Aug72020

Speech

Google MixIT AI-Separation of unsupervised learning sound sources

Speech

MixIT AI, released by Google, is a technology that obtains a separate sound source from single-channel audio in which multiple sound sources are mixed. It can be viewed as a blind source separation task, and unlike existing technologies, it has the feature of delivering excellent performance with unsupervised(!).

Aug52020

Speech Code

Wav2Vec 2.0 Revealed-Create ASR with 10 Minute Voice

Speech, Code

After performing representation training with 53,000 hours of label-free data, a pre-trained model for Facebook's wav2vec 2.0, which became a hot topic because it created a speech recognizer with only 10 minutes of labeled data, was released. No fine-tuning in the representation model,...

Jul292020

Visual Speech Interaction Data

RAVDESS-Multimodal Sentiment Data

Visual, Speech, Interaction, Data

There are many complex human emotion perceptions and expressions (e.g. angry emotions affect facial expressions, voices, and language). Here's an open dataset with audio-videos tied together and emotionally labeled. The Ryerson…