Paperswithcode, which provides information on various papers in the field of AI, linked open sources, and SOTA, provides links to over 3,000 useful datasets. Of these, there are 851 data sets for text, and if limited to Korean, the following data set links are searched:
Dataset Name | Description |
Universal Dependencies | Dataset labeled grammar and morphology of various languages (total of 104 languages) |
OpenSubtitles | Multilingual subtitle dataset for movies and TV content (60 languages total) |
PAWS-X | Multilingual translation dataset (6 languages total) |
KorQuAD | Korean Q&A dataset |
WikiAnn | Multilingual labeling dataset based on Wikipedia (295 total languages) |
GeoCoV19 | A large-scale text dataset collected from Twitter, featuring geographic locations or locations. |
KorNLI | Korean dataset created for natural language inference |
KorSTS | Korean dataset created for the purpose of evaluating sentence similarity |
MKQA | Multilingual knowledge question and answer dataset (total 26 languages) |
ClovaCall | Large-scale Korean audio data set containing phone consultation content |
Wikipedia Title | Wikipedia title dataset labeled in Korean, Chinese, and Japanese |
WikiLingua | Dataset with articles and summaries paired together (18 languages in total) |
JIT Dataset | Dataset including Jeju Island dialects and standard languages |
JSS Dataset | Jeju Island Voice Dataset (Single Speaker) |
Korean HateSpeech Dataset | An entertainment news comment dataset with hate speech labeling |
Mega-COV | Dataset collected at scale on Twitter to study COVID-19 |
NSMC | Korean movie review dataset |
Here is a link to PapersWithCode's Korean dataset list: