Imagenet-1K (1000 class image classification problem) is a task that has been optimized with the development of CNN. AlexNet's TOP-5 error, which announced the beginning of the deep learning era, was about 17%. Considering that the TOP-5 error of the existing top technology (SIFT+FV) at that time was about 26%, it can be seen that significant improvement has been made. Since then, as various developments have been made, the TOP-5 error has fallen to 1.3% (FixEfficientNet_L2), which is 1/13 compared to AlexNet and 1/20 compared to SIFT+FV.
So, how about comparing it to human performance? There are various studies related to this, but for the Imagenet-1K task, human performance is reported to be approximately the level of TOP-5 error 5%. Of course, we don't just ask the human for the test image of this task, we have to go through the process of “learning” by looking at all the images contained in each class to simulate the “learning” process. If you look at these results alone, it seems to have already outperformed humans in terms of image classification.
In the post below, we added distortion (add noise, add blur) to the image and compare the performance between human and CNN again. Of course, since the paper was written a while ago, the latest models such as FixEfficientNet_L2 were not used, but models such as GoogleNet, VGG, and ResNet were used for comparison. But I don't think the conclusion will change much.
As a result of the experiment, several facts were derived.
- For both noise and blur, humans are much stronger than CNN.
- In the case of noise, while the human TOP-5 error was 5% -> 20%, CNN became 5% -> 80%.
- Fine-tuning with the same type of noise improves, but the error rate still reaches 60%.
- In the case of Blur, CNN became 5% -> 80% while human TOP-5 error was 5% -> 30%.
- Fine-tuning with the same type of blur gets better, but the error rate still reaches 50%.
On the other hand, we analyzed whether there was a correlation between the “wrong type” between humans and CNNs as noise and blur increased, but there was no clear correlation. I think this means that current CNNs cannot be seen as mimicking human visual cognitive systems, and that there may be fundamental differences between the two systems. At least, it seems that there is still a need for research on "what do humans do?" on how to counter visual distortion.