Typically, Q&A systems use text to answer questions. A task in this way is the Squad task, which gives you a paragraph explaining a fact and asks a question to generate an appropriate answer. In contrast, Visual QA is a task of giving an image instead of text and having a conversation about it.
Today's paper, “3D Attention is All You Need,” is an extension of an image to a video, and describes an algorithm for the purpose of generating an appropriate response to the question content for a given video. Basically, it is a transformer-based algorithm, but compared to Visual QA, Video QA has a difficulty in having to consider both spatial and temporal contexts at the same time because a time axis is added. The author approaches using the SlowFast network used for image behavior recognition and LXMERT for feature parameter extraction at the same time. It is said that FrameQA, which is part of the TGIF-QA dataset, which consists of GIF animation files and questions and answers, was used as data.
The task of understanding video can be said to be very difficult in that it can contain several modalities. Previous studies were mainly designed to attempt to recognize only partial characteristics from video data or to operate on each image frame of a video. Alternatively, approaches such as extending 2D CNNs suitable for spatial feature analysis to 3D CNNs are also found. However, the attempt to connect the area of natural language processing and image understanding, which is rapidly developing in recent years, as in the shared thesis, is still in its infancy, and it is considered a field that is expected to develop much in the future.
Share the link to the paper on the Github page.