{"id":60394,"date":"2021-06-25T12:03:48","date_gmt":"2021-06-25T03:03:48","guid":{"rendered":"https:\/\/smilegate.ai\/?p=60394"},"modified":"2021-06-25T12:10:14","modified_gmt":"2021-06-25T03:10:14","slug":"handling-imbalanced-datasets","status":"publish","type":"post","link":"https:\/\/smilegate.ai\/en\/2021\/06\/25\/handling-imbalanced-datasets\/","title":{"rendered":"Handling Imbalanced Datasets"},"content":{"rendered":"

[Service Development Team Hwang Jun-sun]<\/p>\n\n\n\n

When supervised learning a machine learning model, if a dataset with an unbalanced number of data between labels is used as the training data, it will suffer from a phenomenon in which the learning of samples belonging to a label with a small ratio is not performed well. If there is simply a small number of samples, the training will not be performed well, and even if there are enough samples to learn, the model will have bias if the ratio difference is extreme. This is especially common, for example, when the problem of classifying anomalous data is a problem with too many labels to classify. In this case, no matter how good the state-of-the-art model is, it is difficult to derive the correct performance. There are four main ways to solve these problems.<\/p>\n\n\n\n

<\/div>\n\n\n\n
  1. Use of appropriate evaluation metrics<\/li><\/ol>\n\n\n\n

    Rather than being part of a method to directly solve the imbalanced dataset problem, it can be said that it is the first step to accurately interpret and understand the currently trained model and apply the solution that will be described later. For example, suppose we have a problem of binary classification of labels 0 and 1, and the proportion of samples belonging to label 0 to the entire dataset is 99% and the proportion of samples belonging to label 1 is 1%. If the trained model classifies all data as 0, the accuracy of this model will be 99%. Although this accuracy is not an incorrect indicator, can this 99% performance indicator properly tell the performance of this model? In general, we want to classify 1s correctly, not 0s in these data. If so, this indicator would not be worthwhile. Therefore, it is recommended to use the following evaluation index [1], which can see not only accuracy but also various aspects.<\/p>\n\n\n\n

    \"\"<\/figure>\n\n\n\n