CS Events
PhD DefenseUnlocking Visual Reasoning: Exploring Representations for Enhanced Problem-Solving |
|
||
Thursday, January 18, 2024, 03:00pm - 05:00pm |
|||
Speaker: Pritish Sahu
Location : CBIM multipurpose room - 22
Committee:
Prof. Vladimir Pavlovic (advisor)
Prof. Dimitris Metaxas
Prof. Yongfeng Zhang
Prof. Junsong Yuan (external)
Event Type: PhD Defense
Abstract: The success of deep learning systems in various applications hinges on their ability to extract structured and invariant representations. However, visual reasoning remains challenging due to the complexity of high-dimensional sensory input and the necessity for high-level abstraction. In contrast, humans excel at this complex process by using simple design principles based on realizing low-level abstractions and their relations from the visual input. Hence, understanding why humans excel at this cognitive task while current computational models fall short of solving visual reasoning tasks in a human-like hierarchical manner is increasingly apparent. Current reasoning models require enormous training data, exhibit sensitivity to perturbations, and lack the capacity to generalize to new reasoning tasks. In this dissertation, we aim to address these limitations of visual perception and visual reasoning.The thesis comprises two main parts. The first part is devoted to Visual Reasoning via Disentangled Representations, delves into extracting high-quality disentangled representations and devising modules to tackle reasoning tasks using these representations. We begin with the pursuit of understanding and learning disentangled representations that encodes the salient (data-generative) factors of variation in the data independently. To achieve this, we present a novel VAE-based approach capable of disentangling latent representations in fully unsupervised manner. Our approach harnesses the total correlation (TC) within the latent space by introducing a relevance indicator variable. This variable pinpoints and emphasizes significant factors, characterized by substantial prior KL divergence, while filtering out noise-associated factors with minimal variation. Our method automatically identifies and assimilates genuine factors, even in scenarios where the count of such factors remains explicitly unknown. Furthermore, it outperforms existing methods both quantitatively and qualitatively. These disentangled latent factors, adept at independently mapping generative factors, prove invaluable in reasoning puzzles where visual attributes correspond to specific rules like constancy, progression, or arithmetic. They enable the derivation of rules capable of solving various puzzles. Additionally, these representations exhibit sample efficiency and superior generalization, rendering them ideal for solving visual reasoning problems. Expanding on this concept, we propose a computational model that addresses visual reasoning tasks as an end-to-end joint representation-reasoning learning framework. This framework leverages the weak inductive bias present in reasoning datasets to accomplish these tasks concurrently. Specifically focusing on Raven’s Progressive Matrices (RPMs) as our reasoning task, we introduce a general generative graphical model (GM-RPM). Subsequently, we propose the “Disentangling-based Abstract Reasoning Network (DAReN)”, aligning with the principles of GM-RPM. Evaluating our model across disentanglement and reasoning benchmarks demonstrates consistent improvement over existing state-of-the-art models in both domains. Our results underscore the necessity of structured representations for solving visual reasoning tasks.The second part of my dissertation is devoted to learning tokenized spatial representations that grasp low-level visual concepts within each RPM image. We introduce “Spatially Attentive Transformers for Abstract Visual Reasoning (SARN)”, a novel computational model which integrates spatial semantics within visual elements, represented as spatio-visual tokens, capturing both intra-image and inter-image relationships within the puzzle. The reasoning module groups these tokens (by row or column) to capture the underlying rule binding the puzzle, thereby solving the visual reasoning task. Through extensive experiments on established RPM benchmarks, we demonstrate that our results surpass existing approaches. Furthermore, we validate that the learned rule representation exhibits increased robustness in novel tasks and better generalization to test-time domain shifts compared to current methods.In a nutshell, this work underscores the necessity of acquiring structured representations to enhance visual reasoning performance. Thus, we address certain limitations in AI model design, as well as narrowing the gap between machine intelligence and human cognitive abilities.
:
Contact Professor Vladimir Pavlovic