Skip to content Skip to navigation
Seminar
4/29/2015 11:00 am
CoRE A(Room 301)

Random Walks on Context Spaces: Towards an explanation of the mysteries of semantic word embeddings

Sanjeev Arora, Princeton University

Organizer(s): Swastik Kopparty and Shubhangi Saraf

Abstract

Semantic word embeddings represent words as vectors in R^d for say d=300. They are useful  for many NLP tasks and often constructed using nonlinear/nonconvex techniques such as deep nets and energy-based models. Recently Mikolov et al (2013) showed that such embeddings exhibit linear structure that can be used to solve "word analogy tasks" such as man: woman :: king: ??.  Subsequently, Levy and Goldberg (2014) and Pennington et al (2014) tried to explain why such linear  structure should arise in embeddings derived from nonlinear methods.  We point out gaps in these explanations and provide a more complete explanation in the form of a loglinear generative model for text corpora that directly models latent semantic structure in words and involves random walk in the context space.  A rigorous mathematical analysis is performed to obtain explicit expressions for word co-occurrance probabilities, which leads to a surprising explanation for why word embeddings exist and why they exhibit linear structure that allows solving analogies.  Our model and its analysis lead to several counterintuitive predictions, which are also empirically verified.

We think our methodology may be useful in other settings.

Joint work with Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski (listed in alphabetical order).

manuscript at http://arxiv.org/abs/1502.03520