Multimodal data understanding is a well studied problem that's an integral part of an intelligent AI system. However, current state-of-the-art techniques are limited to learning one-to-one correspondences between images and text, with a focus on descriptive text (e.g., description of objects and scenes in an image). This is in sharp contrast to how humans tend to communicate and establish associations between multi-modal data. For example, an image of sunset posted in facebook by a user, captioned as “I owe it to myself to be happy with my life. A look at what nature has to offer is all it takes.” where the caption expresses the poster’s opinion, in multiple sentences, rather than what is actually in the image. This type of text and co-occurring images make most of the information in social networks. Also, associations in narratives, conversations, stories and dialogues are also of similar nature. Our work concentrates on learning representation for sequential text or co-occurring images and text, where text can be abstract and narrative, like in a story or a conversation and/or colloquial. To this end I want to answer the following questions as part of my thesis: (1) Can human like understanding and discourse of multimodal information be learned? (2) Can the learned model be used to tell or show a convincing story? (3) Can learning be extended to encode multimodal conversations that happen in an information cascade in a social network? (4) Can cascade encoding better predict evolution of information spread, cascade or the network itself?