HuggingFace, famous for its integrated natural language processing package, adds speech recognition. Here are the related links:
facebook/wav2vec2-base-960h · Hugging Face
We're on a journey to solve and democratize artificial intelligence through natural language.
Specifically, Facebook-developed Wav2Vec 2.0 was added, which is famous for unsupervised learning first with a large amount of unlabeled data, and a learning method that uses only a very small amount of labeled data. Here is an introduction to Wav2Vec 2.0:
Wav2Vec 2.0 Revealed-Create ASR with 10 Minute Voice
After performing representation training with 53,000 hours of label-free data, a pre-trained model for Facebook's wav2vec 2.0, which became a hot topic because it created a speech recognizer with only 10 minutes of labeled data, was released. Versions with no fine-tuning, 10 minutes, 100 hours, and 960 hours fine-tuning in the representation model have been released. Perhaps one of the main concerns is the application of Korean, but it is expected that excellent performance comes out without large amounts of data...
Pororo, recently released by Kakao Brain, is an integrated package that supports both natural language tasks and speech recognition tasks at the same time. Here is an introduction to Pororo:
Pororo - KakaoBrain's Integrated Natural Language Framework
In Kakao Brain, Pororo, an integrated natural language framework capable of responding to various natural language tasks, has been released as open source. Pororo stands for Platform Of neuRal mOdels for natuRal language prOcessing and you can think of it as a similar purpose to HuggingFace. Pororo has the advantage of not only being more optimized for Korean tasks, but also supporting audio processing such as speech recognition. Here's a simple Korean MRC using Pororo…
Some time ago, examples of performing image recognition and prediction tasks based on transformers were announced, and predictions were made about whether natural language and image processing methods will be integrated in the future. Personally, I think the similarity between natural language and voice is much higher than natural language and image. Natural language and speech have many common elements in that they are only different in form of text and audio, but in the end, they are the way of expressing time series of language. (written language vs spoken language) So, I think it may be a natural option to be technologically fused.