Qualifying Exam
Qualifying ExamSAViR-T: Spatially Attentive Visual Reasoning with Transformers |
|
||
Wednesday, October 19, 2022, 05:30pm |
|||
Visual Reasoning (VR) operates as a way to measure machine intelligence, by employing previously gained knowledge in new settings. Specifically, in VR, we aim to extract and identify task-relevant information from images. For example, in Raven's Progressive Matrices (RPMs), an instance of VR, we are given an incomplete 3x3 image puzzle. We should find the governing rules that generated the puzzle in order to solve it. In this talk, we will explore the importance of localized spatial information for the solution of RPM puzzles. Our proposed model SAViR-T considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies. Token-wise relationships, modeled through a transformer-based SAViR-T architecture, followed by a reasoning module are used to extract the underlying rule representations between the rows of the RPM. We use these relation representations to complete the puzzle. Finally, to demonstrate the efficacy of our approach we performed extensive experiments across both synthetic datasets, including RAVEN, I-RAVEN, RAVEN-FAIR, and the natural image-based "V-PROM".
Speaker: Kalliopi Basioti
Location : Virtual
Committee:
Professor Vladimir Pavlovic
Professor Srinivas Narayana Ganapathy
Professor Hao Wang
Professor Yongfeng Zhang
Event Type: Qualifying Exam
Abstract: Visual Reasoning (VR) operates as a way to measure machine intelligence, by employing previously gained knowledge in new settings. Specifically, in VR, we aim to extract and identify task-relevant information from images. For example, in Raven's Progressive Matrices (RPMs), an instance of VR, we are given an incomplete 3x3 image puzzle. We should find the governing rules that generated the puzzle in order to solve it. In this talk, we will explore the importance of localized spatial information for the solution of RPM puzzles. Our proposed model SAViR-T considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies. Token-wise relationships, modeled through a transformer-based SAViR-T architecture, followed by a reasoning module are used to extract the underlying rule representations between the rows of the RPM. We use these relation representations to complete the puzzle. Finally, to demonstrate the efficacy of our approach we performed extensive experiments across both synthetic datasets, including RAVEN, I-RAVEN, RAVEN-FAIR, and the natural image-based "V-PROM".
Organization:
Rutgers University School of Arts & Sciences
Contact Professor Vladimir Pavlovic