HuggingFace, famous for its integrated natural language processing package, adds speech recognition. Here are the related links:
Specifically, Facebook-developed Wav2Vec 2.0 was added, which is famous for unsupervised learning first with a large amount of unlabeled data, and a learning method that uses only a very small amount of labeled data. Here is an introduction to Wav2Vec 2.0:
Pororo, recently released by Kakao Brain, is an integrated package that supports both natural language tasks and speech recognition tasks at the same time. Here is an introduction to Pororo:
Some time ago, examples of performing image recognition and prediction tasks based on transformers were announced, and predictions were made about whether natural language and image processing methods will be integrated in the future. Personally, I think the similarity between natural language and voice is much higher than natural language and image. Natural language and speech have many common elements in that they are only different in form of text and audio, but in the end, they are the way of expressing time series of language. (written language vs spoken language) So, I think it may be a natural option to be technologically fused.