CS Events Monthly View

Computer Science Department Colloquium

Unstructured Data Management at Scale for Machine Learning

 

Download as iCal file

Thursday, November 17, 2022, 10:30am

 

Speaker: Dong Deng

Bio

Dong Deng is an assistant professor in the Computer Science Department at Rutgers University. His research interests include large-scale data management, data science, data curation, and database systems.  Before joining Rutgers, he was a postdoc in the Database Group at MIT, where he worked with Mike Stonebraker and Sam Madden on data curation systems. He received his Ph.D. from Tsinghua University with honors. He has published over 30 research papers in top data management venues, mainly SIGMOD, VLDB, and ICDE. Based on Google Scholar, his publications have attracted over 2000 citations. His research is supported by a couple of NSF awards. He regularly serves the PC committee of various data management, data mining, and information retrieval conferences. He also serves on the organization committees of several data management conferences.

Location : CoRE 301

Event Type: Computer Science Department Colloquium

Abstract: A clear trend in machine learning is that models become larger and larger and more and more training data is used. This is especially true for foundation models, such as GPT-3 and DALL-E 2. In this talk, I will discuss a few unstructured data management problems arising in machine learning. First, recent studies show large language models (LLMs) unintendedly memorize part of the training data, which brings significant privacy risks. These studies mostly focus on the exact duplicates. However, how many texts generated by LLMs have near-duplicate sequences in the training data? Do sequences with more near-duplicates in the training data more likely to be memorized by LLMs? To answer these questions, I will introduce a series of works (SIGMOD-21 and SIGMOD-22a) from my group that enables efficient near-duplicate sequence search in terabytes of LLM training texts on a single machine. A major challenge here is the number of sequences in a text is quadratic to the text length. Second, real-world objects such as images and texts can be represented as dense vectors. I will briefly introduce our work on large-scale vector data management, a project funded by NSF IIS. Finally, I will conclude the talk by outlining a few near-future works to be conducted in my group.

Contact  Matthew Stone