CS Events

PhD Defense

Unlocking Artificial Intelligent Video Understanding through Object-Centric Relational Reasoning

 

Download as iCal file

Monday, April 24, 2023, 02:00pm - 04:00pm

 

Our experience as humans is deeply shaped by our perception of what happens to the objects in the visual world. Rather than building a machine that attempts to attain visual intelligence solely from the static image or directly from the low-level pixels, we may need to accomplish that from dynamic relational reasoning about objects in videos. Relational reasoning reasons the saliency of objects and their dynamic interactions. As advanced video understanding is crucial for machines to effectively interpret and interact with the dynamic visual world that humans live in, we propose using relational reasoning about objects to address the key challenges in video understanding: (i) spatiotemporal understanding, (ii)  high-level semantic understanding, and (iii) foundational video representation learning. Specifically, we begin with considering the problem of object tracking, where multiple objects move and are frequently occluded by one another. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence through multi-step relational modeling. Moving towards the non-trivial higher-level semantic understanding of videos, we tackle the task of Group Activity Recognition that detects the activity collectively performed by a group of actors in a video. We propose COMPOSER, a Multi-scale Transformer based architecture that performs multi-scale relational reasoning to learn group activity compositionally. Last but foremost, video representations are the basis of video understanding. We propose Paprika, a pre-training procedure that leverages graph-based relational reasoning to learn a generalizable video representation, targeting improved downstream procedural understanding in instructional videos, and thus paving the way for robot learning from human-provided demonstrations to provide humans with better assistance and interactions. Our solutions outperform the current state-of-the-art methods on several video datasets. Our results indicate that object-centric relational reasoning is the key towards artificial intelligent video understanding. By introducing inductive biases such as multi-step reasoning, multi-hierarchy, graph-structured knowledge, suitable modalities, and/or external sources of information, relational reasoning can be further enhanced, ultimately leading to a machine-learning-based video system that is more efficient to develop, and a more effective ability from the system to solve a variety of video-related tasks.

 

Speaker: Honglu Zhou

Location : CBIM 22

Committee

Professor Mubbasir Kapadia (Chair)

Professor Vladimir Pavlovic

Professor Dimitris Metaxas

Dr. Iqbal Mohomed (Tornoto AI Research Centre)

Event Type: PhD Defense

Abstract: See above

Organization

Rutgers University

School of Arts & Sciences

Department of Computer Science

 

Contact  Professor Mubbasir Kapadia