PapersWithCode's Korean dataset

Paperswithcode, which provides information on various papers in the field of AI, linked open sources, and SOTA, provides links to over 3,000 useful datasets. Of these, there are 851 data sets for text, and if limited to Korean, the following data set links are searched:

Dataset Name	Description
Universal Dependencies	Dataset labeled grammar and morphology of various languages (total of 104 languages)
OpenSubtitles	Multilingual subtitle dataset for movies and TV content (60 languages total)
PAWS-X	Multilingual translation dataset (6 languages total)
KorQuAD	Korean Q&A dataset
WikiAnn	Multilingual labeling dataset based on Wikipedia (295 total languages)
GeoCoV19	A large-scale text dataset collected from Twitter, featuring geographic locations or locations.
KorNLI	Korean dataset created for natural language inference
KorSTS	Korean dataset created for the purpose of evaluating sentence similarity
MKQA	Multilingual knowledge question and answer dataset (total 26 languages)
ClovaCall	Large-scale Korean audio data set containing phone consultation content
Wikipedia Title	Wikipedia title dataset labeled in Korean, Chinese, and Japanese
WikiLingua	Dataset with articles and summaries paired together (18 languages in total)
JIT Dataset	Dataset including Jeju Island dialects and standard languages
JSS Dataset	Jeju Island Voice Dataset (Single Speaker)
Korean HateSpeech Dataset	An entertainment news comment dataset with hate speech labeling
Mega-COV	Dataset collected at scale on Twitter to study COVID-19
NSMC	Korean movie review dataset

Here is a link to PapersWithCode's Korean dataset list:

Papers with Code – Machine Learning Datasets

18 datasets • 40905 papers with code.

PapersWithCode's Korean dataset

Related Posts