Skip to content Skip to navigation
1/9/2019 11:00 am
CoRE A 301

Searching Heterogeneous Personal Data

Daniela Vianna, Rutgers

Defense Committee: Prof. Amelie Marian (Chair), Prof. Alexander Borgida, Prof. Thu D. Nguyen


Personal data is now pervasive, as digital devices are capturing every part of our lives. Users are constantly collecting and saving more data, either actively in files, emails, social media interactions, etc., or passively by GPS tracking of mobile devices, or records of financial transactions. Unlike traditional information seeking, which focuses on discovering new information, search on personal data is usually focused on retrieving information that users know exists in their own data set, even though most of the time they do not have a perfect recollection of where it is stored. Attempting to retrieve and cross-reference personal information leads to a tedious process of individually accessing all the relevant sources of data and manually linking their information.In this scenario, traditional searchesare often inefficient, making it critical for search tools to be capable of accessing heterogeneous and decentralized data in a flexible and accurate way by taking into consideration the additional knowledge the user is likely to have about the target information.

In this work, we propose a set of techniques that allow users to easily access their own data. We start by proposing a unified and intuitive multidimensional data model following a combination of dimensions that naturally summarize various aspects of the data collection: who, when, where, what, why, how. We then proceed by proposing frequency-based scoring models that leverage the correlation between users (who), time (when), location (where), data topics (what), and provenance (how) to improve search over personal data. Since the scoring model proposed needs to generalize well over user-specific data sets, we extend our static scoring function by adopting a learning-to-rank approach using the state of the art LambdaMART algorithm. Due to the lack of pre-existing personal training data, a combination of known-item query generation techniques and an unsupervised ranking model (field-based BM25) is used to build our own training sets.

  To validate the data and scoring models, we implemented tools for data extraction, classification, entity recognition, and topic modeling. A thorough qualitative evaluation over real user data from a variety of data sources such as Facebook, Dropbox, and Gmail show that our approach significantly improves search accuracy when compared with traditional personal search tools such as Apple’s Spotlight and Apache Solr, and techniques like TF-IDF and BM25.