KcBERT, a Korean language model that reflects portal comments and new words, was released

In the case of large-scale language models, there was always a difficulty because there was no Korean model. Following SKT's KoBERT, Naver released KcBERT, which was learned from the ground up with data reflecting Naver comment data and new words. In addition to the trained model, the refined data used for training has also been released, and can be easily used through HuggingFace.

Below is the code released by Junbeom Lee and a link to the data used in the training. (Joonbeom Lee's blog link)

KcBERT: Korean language model reflecting comments and new words

Beomi/KcBERT

🤗 Pretrained BERT model & WordPiece tokenizer trained on Korean Comments BERT model pretrained with Korean comments – Beomi/KcBERT

Release Train Data(v1) Release! · Beomi/KcBERT

In order to make it easier to download the dataset that was released to Kaggle, we release it after dividing and compressing (2G/2G/0.6G each) 🙂
(Pretrain Dataset release: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
Kcbert-train.tar below… .

According to the code introduction, there were many Korean language models based on BERT, but most were based on well-refined data such as Korean wikis, news articles, and books, while there is little research reflecting colloquial features, new words, typos, etc. appearing in actual portal comments. KcBERT is a model that learned a tokenizer and model from scratch by collecting Naver comments to improve this aspect. In fact, there are many open source language models, but due to the difference from real-world, there are many cases where my performance does not come out. I think models made by reflecting real-world data have higher value in terms of actual application.

KcBERT: Korean language model reflecting comments and new words

Related Posts