CS Events

PhD Defense

Multimodal Story Comprehension: Datasets, Tasks and Neural Models


Download as iCal file

Thursday, August 12, 2021, 10:00am - 12:00pm


Speaker: Hareesh Ravi

Location : Remote via Zoom


Mubbasir Kapadia

Gerard De Melo

Dimitris Metaxas

Nasrin Mostafazadeh (External)

Event Type: PhD Defense

Abstract: Storytelling is a uniquely human skill that plays a central role in how we learn about and experience the world. Stories play a crucial part in the mental development of humans as it helps us encode a wide range of shared knowledge, including common sense physics, cause and effect, human psychology, and morality. We postulate that machine intelligence requires comparable skills, particularly when interacting with people. Much of the current research in understanding of the visual and textual world operates only at a superficial, factual level using data that align atomic one--to--one descriptive, factual text with and an image. This dissertation aims to close the gap between current research and what we hope AI systems could do by developing novel datasets, tasks and neural methods for true multimodal story creation and comprehension. We start by showing that existing datasets and generic vision-language tasks do not demand the understanding of factors such as coherence and causality that are important for story comprehension. We propose \textit{Story Illustration}, automatic illustration of an input story with a sequence of images, as a measure of story comprehension imitating how humans visualize when they read or comprehend a story. We develop an end-to-end trained neural module with explicit entity--grid based coherence modelling that is able to illustrate a story with clear understanding of coherence and coreference resolution. We then extend the \emph{Story Illustration} to a more generalized \textit{Many-to-Many Story Illustration} formulation and create a new dataset and a novel machine translation approach to story illustration. Our model is shown to generalize well and achieve high scores in creating coherent illustrations by virtue of its explicit causal decoding.In our works, we observe that generation is a much closer imitation of the human visualization process than retrieval. Moreover, the existing datasets primarily capture the perceptive process associated with comprehension rather than the creative process associated with storytelling. We create \textsc{aesop}, a new dataset that captures the creative process associated with visual storytelling. Using \textsc{aesop}, we propose foundational storytelling tasks that are generative variants of story cloze tests, to better measure the creative and causal reasoning ability required for visual storytelling. We develop a generalized story completion framework that models multimodal stories as the co-evolution of visual and textual concepts. Our dataset and model treats images as a composition of objects and related concepts making it capable of plugging them in different scenarios to generate new illustrations. An AI system that can comprehend a story can essentially comprehend any kind of data. We hope that this dissertation along with all the publicly released resources such as data and code drives further research towards building complex and intelligent systems that can create and comprehend multimodal stories.


Join Zoom Meeting

Join by SIP

Meeting ID: 987 4061 6466
Password: 142673
One tap mobile
+13017158592,,98740616466# US (Washington DC)
+13126266799,,98740616466# US (Chicago)

Join By Phone
+1 301 715 8592 US (Washington DC)
+1 312 626 6799 US (Chicago)
+1 646 558 8656 US (New York)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
+1 669 900 9128 US (San Jose)
Meeting ID: 987 4061 6466
Find your local number: https://rutgers.zoom.us/u/acZzzbSBwr

Join by Skype for Business