After performing representation training with 53,000 hours of label-free data, a pre-trained model for Facebook's wav2vec 2.0, which became a hot topic because it created a speech recognizer with only 10 minutes of labeled data, was released.
Versions with no fine-tuning, 10 minutes, 100 hours, and 960 hours fine-tuning in the representation model have been released. Perhaps one of the main interests is the application of Korean, but I am very excited because it shows excellent performance without large amounts of data. Now, I think that speech recognition technology will also be on the path of universalization.
The related paper is “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” shared by the link below.
audio alone followed by fine-tuning on transcribed speech can outperform the
best semi-supervised methods while being conceptually simpler. wav2vec 2.0
masks the speech input in the latent space and solves a contrastive ta…
Basically, representation learning is performed with a large amount of unlabeled data, and the speech recognizer is completed by fine tuning with a small amount of labeled data later. It is said that wav2vec 2.0 has improved performance by adopting a transformer structure compared to the existing wav2vec. Looking at the published experimental results, representation training was performed with 53,000 hours (!) of unlabeled data, and additional learning was performed with labeled data of 10 minutes (average 12.5 seconds x 40 sentences), and WER 5.7 (clean) / 10.1 for LibriSpeech. (noisy) level. If you read 40 sentences, it's amazing that the voice recognizer is ticking. (WER 1.9 / 3.5 using all of the LibriSpeech training data)
Considering that labeling is a major hurdle in making speech recognizers, it is a very meaningful study in many ways. The links below are articles from Wav2Vec 2.0 github and VectureBeat respectively.