ImageNet and Privacy

ImageNet is a dataset that has greatly influenced the advancement of AI technology so that everyone knows about it. Consisting of a large number of images and their metadata, this dataset consists of approximately 14 million images, grouped into 20,000 categories. To meaningfully define a number of categories, WordNet, a word hierarchy data set created in 1980, was used, and after crawling and collecting images from the Internet for each category, labeling was performed using Amazon Mechanical Turk.

A famous image-related AI contest called ImageNet Competition has been continuously held by selecting 1000 categories and 1 million images from the collected data set. Through this, well-known network structures such as AlexNet, VGG, GoogleNet, and ResNet have been developed. However, since it was collected by crawling, an issue was raised that the ImageNet dataset also contained faces of a large number of people, and that their consent was not obtained separately. Recently, the issue of the legality of collecting AI learning data is becoming increasingly important, and ImageNet is no exception, so news has been reported that it has decided to blur the face part included in the image. Here are links to related articles:

AI's Most Important Dataset Gets a Privacy Overhaul, a Decade Too Late

The authors of the image dataset that changed the world have made one long-overdue tweak

According to the experimental results, the effect of this processing on recognition rate is limited. However, since it is not a problem that is limited only to ImageNet, it is a problem that can always exist if it is large-scale data necessary for AI learning, so I think it is the time to need a “gold standard” for future response methods. For example, in the case of open source software, there are various types of licenses such as GPL, LGPL, BSD, and MPL, and accordingly, various parts such as whether the code can be released or commercially available are specified. Of course, there is a CCL (creative commons license) for photo and video content, but for data collected for AI learning rather than direct sales of content, I think it would be nice if there was a more appropriate licensing system.

Additionally, for AI-generated models such as GAN, there are more to consider. For example, if there is a licensing problem in the training data, how do you determine the license for this image if you create a new image using GAN? Although it is a new image, there may certainly be a part that contributes to the creation of a training image with a license problem. If so, should the license contribution be partially inverted? Also, GAN has the concept of latent space. Can you avoid the licensing problem by slightly changing the value of the latent vector in the latent space formed by learning from data with a license problem?

These problems are difficult to solve because they require consensus from various levels, but they are inevitable problems in the end, so I hope that a global consensus system related to them will be discussed. (E.g. International Organization for Standardization)

Related Posts