Multimodal Content Understanding via Structural Reasoning
Tuesday, November 17, 2020, 02:30pm - 04:00pm
Speaker: Shijie Geng
Location : Remote via Webex
Prof. Yongfeng Zhang (Advisor), Prof. Hao Wang, Prof. Sungjin Ahn, Prof. Dong Deng
Event Type: Qualifying Exam
Abstract: Vision-language tasks require a model to understand input multimodal contents and then conduct a specific goal (such as answer selection, text generation, or span prediction), which poses challenging multimodal representation learning and reasoning scenarios. Many existing works on vision-language tasks simply represent multimodal inputs as a sequence of embeddings (objects in an image or tokens in a sentence). However, such representation ignores the intrinsic structures such as relations or hierarchies of the multimodal contents. We propose to leverage the structural cues to build graph representation for multimodal contents and further reason over it through multimodal graph attention networks. We present experiments on two vision-language tasks to show the effectiveness of incorporating structural representation and reasoning for understanding multimodal contents: 1) The first task is audio-visual scene aware dialog which requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. 2) The second task is document visual question answering which requires the ability of understanding layout in document images to locate the correct answer span for given questions.