CS Events
Computer Science Department ColloquiumEfficient Query Processing of Large-Scale Unstructured Data |
|
||
Tuesday, September 17, 2024, 10:30am - 12:00am |
|||
Speaker: Dong Deng
Bio
Dong Deng is an Assistant Professor in the Department of Computer Science at Rutgers University. Before joining Rutgers, he was a postdoc in the Database Group at MIT, where he worked with Mike Stonebraker and Sam Madden on data curation systems. He obtained his Ph.D. in Computer Science from Tsinghua University. His research interests include large-scale data management, vector databases, data science, data curation, and database systems. He has published over 50 research papers in top database venues, including SIGMOD, VLDB, and ICDE. His research is supported by the National Science Foundation, Adobe, and The ASSISTments Foundation.
Location : CoRE 301
:
Event Type: Computer Science Department Colloquium
Abstract: Large-scale unstructured data, such as vectors and texts, are ubiquitous nowadays due to the rapid development of deep learning and large language models. Many machine learning models have been developed to effectively represent real-world objects as high-dimensional feature vectors. Meanwhile, real-world objects (e.g., products, video frames) are often associated with structured attributes (e.g., price, timestamp). In many scenarios, both the feature vectors and the structured attributes of these objects need to be jointly queried. In this talk, I will present our recent work on multi-modal approximate nearest neighbor search, which finds the approximate nearest neighbors of a query vector while applying predicates to the structured attributes. Additionally, near-duplicate text alignment identifies snippets in a collection of long texts that are similar to a short query text. This computation intensive operation has applications in bioinformatics, large language model evaluation, and copyright protection. I will report on our recent progress in efficient near-duplicate text alignment over large-scale text corpora, such as the training data of large language models.
:
Contact Professor Ulrich Kremer