Past Events

PhD Defense

Methods for Leveraging Auxiliary Signals for Low-Resource NLP

 

Download as iCal file

Wednesday, December 21, 2022, 03:00pm - 04:30pm

 

Speaker: Xin Dong

Location : Virtual https://rutgers.zoom.us/j/8203272500?pwd=R0hGYkNBQllsaWdKMWN3OEh5V0dpZz09

Committee

Prof Gerard de Melo (Chair)

Prof Yongfeng Zhang

Prof Karl Stratos

Prof Handong Zhao (external Member)

Event Type: PhD Defense

Abstract: There is a growing need for NLP systems that support low-resource settings, for which task-specific training data may be lacking, while domain-specific corpora is too scarce to build a reliable system. In the past decade, the co-occurrence-based training objectives of methods such as word2vec are first able to offer word-level semantic information for specific domains. Recently, pretrained language model architectures such as BERT have been shown capable of learning monolingual or multilingual representations with self-supervised objectives under a shared vocabulary, simply by combining the input from single or multiple languages.Such representations greatly facilitate low-resource language applications.Still, the success of such cross-domain transfer hinges on how close the involved domains are, with substantial drops observed for some more distant domain pairs, such as English to Korean, Wikipedia text to social media comments. To address this, domain-specific unlabeled corpora is available to serve as the auxiliary signals to enhance low-resource NLP systems. In this dissertation, we present a series of methods for leveraging auxiliary signals. In particular, cross-lingual sentiment embeddings with transfer learning are proposed to improve sentiment analysis. For cross-lingual text classification, we present a self-learning framework to take advantage of unlabeled data. Furthermore, a framework upon data augmentation with adversarial training for cross-lingual NLI is proposed for the low-resource problem from the target domain. Extensive experimental results demonstrate the effectiveness of the proposed methods in achieving better performance across a variety of NLP tasks.