Ubuntu Dialog Corpus

Building a conversation system that allows humans to have natural-looking conversations with virtual agents is a difficult task in natural language processing and is the basis for much ongoing research.

Ubuntu Dialogue Corpus consists of nearly 1 million two-person conversations extracted from Ubuntu chat logs used to get technical support for various Ubuntu-related issues. Each conversation averages 8 turns and at least 3 turns. All conversations are done in text format (not audio).

The full dataset contains 930,000 conversations and more than 100,000,000 words and can be used here. This data set contains samples of this data set distributed in a .csv file. This data set contains more than 269 million words of text spread over 26 million times.

folder: This is the folder where the conversation appears. Each file contains conversations in one folder.
dialogueID: ID number of a specific conversation. Conversation IDs are reused across multiple folders.
date: The timestamp of when the conversation was sent.
from: user who sent the conversation
to: User who responded. On the first turn of the conversation, this field is empty.
text: The text of the conversation, separated by double quotation marks (“). Line breaks (\n) have been removed.

In the field of customer support (CS), chatbots are widely used, and in recent years, many attempts have been made to incorporate natural conversation techniques beyond rule-based limited conversation. It is not in Korean, but I think it can be helpful in developing skills in related research fields. Here is a link to the data published on Kaggle:

Ubuntu Dialogue Corpus

26 million turns from natural two-person dialogues

Related Posts