CS Events

Computer Science Department Colloquium

Efficient Query Processing of Large-Scale Unstructured Data

 

Download as iCal file

Tuesday, September 17, 2024, 10:30am - 12:00am

 

Speaker: Dong Deng

Bio

Dong Deng is an Assistant Professor in the Department of Computer Science at Rutgers University. Before joining Rutgers, he was a postdoc in the Database Group at MIT, where he worked with Mike Stonebraker and Sam Madden on data curation systems. He obtained his Ph.D. in Computer Science from Tsinghua University. His research interests include large-scale data management, vector databases, data science, data curation, and database systems. He has published over 50 research papers in top database venues, including SIGMOD, VLDB, and ICDE. His research is supported by the National Science Foundation, Adobe, and The ASSISTments Foundation.

Location : CoRE 301

Event Type: Computer Science Department Colloquium

Abstract: Large-scale unstructured data, such as vectors and texts, are ubiquitous nowadays due to the rapid development of deep learning and large language models. Many machine learning models have been developed to effectively represent real-world objects as high-dimensional feature vectors. Meanwhile, real-world objects (e.g., products, video frames) are often associated with structured attributes (e.g., price, timestamp). In many scenarios, both the feature vectors and the structured attributes of these objects need to be jointly queried. In this talk, I will present our recent work on multi-modal approximate nearest neighbor search, which finds the approximate nearest neighbors of a query vector while applying predicates to the structured attributes. Additionally, near-duplicate text alignment identifies snippets in a collection of long texts that are similar to a short query text. This computation intensive operation has applications in bioinformatics, large language model evaluation, and copyright protection. I will report on our recent progress in efficient near-duplicate text alignment over large-scale text corpora, such as the training data of large language models.

Contact  Professor Ulrich Kremer

Join Zoom Meeting
https://rutgers.zoom.us/j/2014444359?pwd=WW9ybFNCNVFrUWlycHowSHdNZjhzUT09

Meeting ID: 201 444 4359
Password: 550978