CS Events
Qualifying ExamEfficient Near-Duplicate Text Alignment Search via Bottom-k Sketches |
|
||
Thursday, January 30, 2025, 01:30pm - 03:00pm |
|||
Speaker: Zhizhi Wang
Location : CoRE 305
Committee:
Assistant Professor Dong Deng (Advisor)
Associate Professor Yongfeng Zhang
Assistant Professor He Zhu
Assistant Professor Arpita Biswas
Event Type: Qualifying Exam
Abstract: The near-duplicate text alignment search problem—aimed at identifying all near-duplicate passage pairs between a query document and a collection of source documents—presents a computationally intensive challenge. This task is crucial for applications like plagiarism detection and LLM memorization. Brute-force methods require comparing O(n^2 m^2) passage pairs between a single query document with n words and a single source document with m words, making them impractical for large corpora. Existing solutions often rely on heuristics like "seeding-extension-filter," which lack theoretical guarantees and involve numerous difficult-to-tune hyperparameters. A previous work ALLIGN employs min-hash sketches to solve this problem. However, it is limited to comparisons between pairs of documents.To address these limitations, we propose leveraging bottom-k sketching to estimate Jaccard similarity between passages. A key insight in our approach is that contiguous passages share the same bottom-k sketch. We introduce the concept of “compact window” to represent all passages that share the same sketch in a concise manner and develop an efficient algorithm to group these passages in O(nlogn+nk)time, reducing the total number of sketches from O(n^2) to O(nk) for a document with n words. Experimental results on real-world datasets demonstrate that our techniques achieve high efficiency.List of Publication(s):TxtAlign: Efficient Near-Duplicate Text Alignment Search via Bottom-k Sketches for Plagiarism Detection (SIGMOD2022)
:
Contact Assistant Professor Dong Deng