We share a set of Korean profanity data collected and labeled by Joonhee Jo. It is gathered from multiple communities, and seems to be suitable for evaluation of real-world data. Below is a description of the data set:
Data Description
This is a Hangul data set that classifies whether sentences are profanity.
A total of 5,825 sentences have been classified for comments on various community sites such as Daily Best (Ilbe) and Today's Humor. Based on the vertical line symbol (| ), the content of the comment on the left and whether or not swear words (0,1) are recorded on the right.
Data Information
- Simple swear words, racist words, words that promote political conflict, sexual and sexist words, demeaning others, and other words that are offensive or judged as abusive
- In Ilbe, the ending of'-no' is the intention to caricature the late President Roh Moo-hyun, so it should be viewed as an abusive language. Not classified.
- Words such as'John Taste' and'Gai-deuk' contain profanity, so they can be regarded as swear words, but recently they are commonly used as a meaning of emphasis, so if they are judged to be used as simple emphasis without malicious intent, they are not classified as swear words.
- Depending on the situation, comments that may or may not be offensive are classified as non-profanity as much as possible.
Joonhee Jo's github link is shared:
Also, the kocohub github repository also discloses Korean profanity data. Labeled by collecting comments posted on entertainment news, the data details are as follows:
Data Description
The dataset consists of three parts: 1) labeled 2) unlabeled and 3) news_title.
labeled
There are a total of 9,381 human-labeled comments. These are divided into 7,896 training sets, 471 validation sets, and 974 test sets. (For a fair comparison of the predictive models, the test set label has not been released. The model can be evaluated through Kaggle submissions, which will be discussed later in this document.) Each comment on two aspects of the presence of social prejudice and hate expression. I commented. Hate speech is closely related to prejudice.
For social bias, we present the labels Gender, Other, and No Bias. Given the context of Korean entertainment news, where celebrities primarily encounter gender stereotypes, we place more emphasis on prevalent prejudices. I also added a binary label whether or not the comment contains gender bias. For hate expressions, we introduce hate, aggressive, and none labels.
unlabeled
We have limited labeled data, so we're giving you an additional 2,033,893 unlabeled comments. These unlabeled datasets can be used in a variety of ways, including pre-learning language models, semi-supervised learning, and more.
news_title
We publish the news title for each comment. You need context to fully understand the meaning of comments. For entertainment news, both title and content can be used as context. However, due to legal issues, we only provide news article titles.
Here is a link to the github repository on kocohub: