CS Events
PhD DefenseLarge Language Models for Data Driven Applications |
|
||
Friday, March 15, 2024, 08:45pm - 10:00pm |
|||
Speaker: Runhui Wang
Location : FULLY REMOTE
Committee:
Prof. Yongfeng Zhang (advisor)
Prof. Dong Deng (co-advisor)
Prof. Hao Wang
Dr. Xiao Qin (external member)
Event Type: PhD Defense
Abstract: This dissertation presents a series of innovative approaches leveraging deep learning and Large Language Models to address challenges in various steps in the pipeline of real-world data driven applications.Firstly, we explore the enhancement of locality-sensitive hashing (LSH) for entity blocking through a neuralization approach. Entity blocking is an important data pre-processing step that finds similar data records that might refer to the same real-world entity. We train deep neural networks to act as hashing functions for complex metrics, which surpasses the limitations of generic similarity metrics in traditional LSH-based methods. Our novel methodology, embodied in NLSHBlock (Neural-LSH Block), leverages pre-trained language models fine-tuned with a novel LSH-based loss function. NLSHBlock achieves significant performance improvements in entity blocking tasks and can boost the performance of later steps in the data processing pipeline.Further, we introduce Sudowoodo, a multi-purpose data integration framework based on contrastive representation learning and large language models, which offers a unified solution for data integration tasks like Entity Matching. Entity matching is a process that determines if a pair of data records represent the same real-world entity and plays an essential role in plenty of applications. To tackle the common issue of insufficient number of high-quality labeled data, Sudowoodo utilizes similarity-aware data representations learned without labels and enables effective fine-tuning in the semi-supervised learning setting where only a small amount of labeled data is available. Besides, Sudowoodo also applies to other data integration tasks such as data cleaning and semantic type detection.Finally, we propose a Generate-and-Retrieve with Reasoning (GTR) framework for recommender systems, inspired by generative large language models. Entity recommendation is usually the last step in a data driven application pipeline. Our proposed framework views recommendation tasks as a process of instruction following by generative large language models, employing natural language instructions to express and decipher user preferences and intentions. GTR innovates by directly generating item names, employing state-of-the-art retrieval models for item alignment, and enhancing model performance through reasoning distillation.Through rigorous experimentation on diverse real-world datasets, we validate the effectiveness of these approaches, setting new benchmarks in their respective domains. The findings of this dissertation not only advance the state of the art in crucial steps of industrial application pipelines including entity blocking, entity matching, and entity recommendation systems, but also open promising avenues for the application of deep learning and large language models in complex data integration and recommendation tasks, fostering improved accuracy, efficiency, and user interaction.
:
Contact Professor Yongfeng Zhang