CS Events
Qualifying ExamMultimodal Content Understanding via Structural Reasoning |
|
||
Tuesday, November 17, 2020, 02:30pm - 04:00pm |
|||
Speaker: Shijie Geng
Location : Remote via Webex
Committee:
Prof. Yongfeng Zhang (Advisor), Prof. Hao Wang, Prof. Sungjin Ahn, Prof. Dong Deng
Event Type: Qualifying Exam
Abstract: Vision-language tasks require a model to understand input multimodal contents and then conduct a specific goal (such as answer selection, text generation, or span prediction), which poses challenging multimodal representation learning and reasoning scenarios. Many existing works on vision-language tasks simply represent multimodal inputs as a sequence of embeddings (objects in an image or tokens in a sentence). However, such representation ignores the intrinsic structures such as relations or hierarchies of the multimodal contents. We propose to leverage the structural cues to build graph representation for multimodal contents and further reason over it through multimodal graph attention networks. We present experiments on two vision-language tasks to show the effectiveness of incorporating structural representation and reasoning for understanding multimodal contents: 1) The first task is audio-visual scene aware dialog which requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. 2) The second task is document visual question answering which requires the ability of understanding layout in document images to locate the correct answer span for given questions.
:
Join Zoom Meeting
https://rutgers.zoom.us/j/99127259830?pwd=TlNJTmVVb2Nydzdib2hzWFlsb1pIZz09
Join by SIP
99127259830@zoomcrc.com
Meeting ID: 991 2725 9830
Password: 718150
One tap mobile
+16465588656,,99127259830# US (New York)
+13017158592,,99127259830# US (Germantown)
Join By Phone
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 9128 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
Meeting ID: 991 2725 9830