CS Events

Qualifying Exam

Multimodal Content Understanding via Structural Reasoning


Download as iCal file

Tuesday, November 17, 2020, 02:30pm - 04:00pm


Speaker: Shijie Geng

Location : Remote via Webex


Prof. Yongfeng Zhang (Advisor), Prof. Hao Wang, Prof. Sungjin Ahn, Prof. Dong Deng

Event Type: Qualifying Exam

Abstract: Vision-language tasks require a model to understand input multimodal contents and then conduct a specific goal (such as answer selection, text generation, or span prediction), which poses challenging multimodal representation learning and reasoning scenarios. Many existing works on vision-language tasks simply represent multimodal inputs as a sequence of embeddings (objects in an image or tokens in a sentence). However, such representation ignores the intrinsic structures such as relations or hierarchies of the multimodal contents. We propose to leverage the structural cues to build graph representation for multimodal contents and further reason over it through multimodal graph attention networks. We present experiments on two vision-language tasks to show the effectiveness of incorporating structural representation and reasoning for understanding multimodal contents: 1) The first task is audio-visual scene aware dialog which requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. 2) The second task is document visual question answering which requires the ability of understanding layout in document images to locate the correct answer span for given questions.


Join Zoom Meeting

Join by SIP

Meeting ID: 991 2725 9830
Password: 718150
One tap mobile
+16465588656,,99127259830# US (New York)
+13017158592,,99127259830# US (Germantown)

Join By Phone
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 9128 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
Meeting ID: 991 2725 9830