Wav2Vec 2.0 Revealed-Create ASR with 10 Minutes Voice

After performing representation training with 53,000 hours of label-free data, a pre-trained model for Facebook's wav2vec 2.0, which became a hot topic because it created a speech recognizer with only 10 minutes of labeled data, was released.

Versions with no fine-tuning, 10 minutes, 100 hours, and 960 hours fine-tuning in the representation model have been released. Perhaps one of the main interests is the application of Korean, but I am very excited because it shows excellent performance without large amounts of data. Now, I think that speech recognition technology will also be on the path of universalization.

The related paper is “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” shared by the link below.

Wav2Vec 2.0 Revealed-Create ASR with 10-Minute Voice

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech
audio alone followed by fine-tuning on transcribed speech can outperform the
best semi-supervised methods while being conceptually simpler. wav2vec 2.0
masks the speech input in the latent space and solves a contrastive ta…

Basically, representation learning is performed with a large amount of unlabeled data, and the speech recognizer is completed by fine tuning with a small amount of labeled data later. It is said that wav2vec 2.0 has improved performance by adopting a transformer structure compared to the existing wav2vec. Looking at the published experimental results, representation training was performed with 53,000 hours (!) of unlabeled data, and additional learning was performed with labeled data of 10 minutes (average 12.5 seconds x 40 sentences), and WER 5.7 (clean) / 10.1 for LibriSpeech. (noisy) level. If you read 40 sentences, it's amazing that the voice recognizer is ticking. (WER 1.9 / 3.5 using all of the LibriSpeech training data)

Considering that labeling is a major hurdle in making speech recognizers, it is a very meaningful study in many ways. The links below are articles from Wav2Vec 2.0 github and VectureBeat respectively.

pytorch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. – Pytorch/fairseq

Facebook claims wav2vec 2.0 tops speech recognition performance with 10 minutes of labeled data

In a new paper, Facebook researchers detail wav2vec 2.0, which ostensibly achieves state-of-the-art speech recognition performance.

Wav2Vec 2.0 Revealed-Create ASR with 10 Minute Voice

Related Posts