CS Events

Qualifying Exam

Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches

 

Download as iCal file

Thursday, January 30, 2025, 01:30pm - 03:00pm

 

Speaker: Zhizhi Wang

Location : CoRE 305

Committee

Assistant Professor Dong Deng (Advisor)

Associate Professor Yongfeng Zhang

Assistant Professor He Zhu

Assistant Professor Arpita Biswas

Event Type: Qualifying Exam

Abstract: The near-duplicate text alignment search problem—aimed at identifying all near-duplicate passage pairs between a query document and a collection of source documents—presents a computationally intensive challenge. This task is crucial for applications like plagiarism detection and LLM memorization. Brute-force methods require comparing O(n^2 m^2) passage pairs between a single query document with n words and a single source document with m words, making them impractical for large corpora. Existing solutions often rely on heuristics like "seeding-extension-filter," which lack theoretical guarantees and involve numerous difficult-to-tune hyperparameters. A previous work ALLIGN employs min-hash sketches to solve this problem. However, it is limited to comparisons between pairs of documents.To address these limitations, we propose leveraging bottom-k sketching to estimate Jaccard similarity between passages. A key insight in our approach is that contiguous passages share the same bottom-k sketch. We introduce the concept of “compact window” to represent all passages that share the same sketch in a concise manner and develop an efficient algorithm to group these passages in O(nlog⁡n+nk)time, reducing the total number of sketches from O(n^2) to O(nk) for a document with n words. Experimental results on real-world datasets demonstrate that our techniques achieve high efficiency.List of Publication(s):TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection (SIGMOD2022)

Contact  Assistant Professor Dong Deng

https://rutgers.zoom.us/j/95841467337?pwd=Ivci4myiYFIcUb5kVafYnoatF5owc4.1