In the case of large-scale language models, there was always a difficulty because there was no Korean model. Following SKT's KoBERT, Naver released KcBERT, which was learned from the ground up with data reflecting Naver comment data and new words. In addition to the trained model, the refined data used for training has also been released, and can be easily used through HuggingFace.
Below is the code released by Junbeom Lee and a link to the data used in the training. (Joonbeom Lee's blog link)
(Pretrain Dataset release: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
Kcbert-train.tar below… .
According to the code introduction, there were many Korean language models based on BERT, but most were based on well-refined data such as Korean wikis, news articles, and books, while there is little research reflecting colloquial features, new words, typos, etc. appearing in actual portal comments. KcBERT is a model that learned a tokenizer and model from scratch by collecting Naver comments to improve this aspect. In fact, there are many open source language models, but due to the difference from real-world, there are many cases where my performance does not come out. I think models made by reflecting real-world data have higher value in terms of actual application.