CS Events

PhD Defense

Distributed Frameworks for Approximate Data Analytics


Download as iCal file

Monday, July 27, 2020, 01:00pm - 03:00pm



Guangyan Hu 

Location : Remote via Webex


Prof. Thu Nguyen (chair)

Prof. Uli Kremer

Prof. Yongfeng Zhang

Prof. Zhenhua Liu (Stony Brook University)

Event Type: PhD Defense

Abstract: Data-driven discovery has become critical to the mission of many enterprises and scientific research. At the same time, the rate of data production and collection is outpacing technology scaling, suggesting that significant future investment, time, and energy will be needed for data processing. Employing more hardware resources can address the extra processing needs by either adding more CPU cores/memory (scale-up) or more worker nodes~(scale-out). However, it will introduce higher computing cost that may not be feasible when budget is limited. One powerful tool to address the above challenge is approximate computing, which trades off computational time and resources with computational accuracy by reducing the amount of data needed to be processed. Fortunately, many data analytic applications such as data mining, log processing, video/image processing are amenable to approximation. In this thesis, we describe the design and implementation of approximation frameworks to accelerate distributed data analytics. We present the frameworks targeting a variety of tasks and datasets, including log aggregation, text analytics and video querying and aggregation: Our first work targets approximating aggregation jobs with error estimation. Aggregation is central to many decision support queries. Aggregation is an important component in OLAP~(Online Analytical Processing)systems, is frequently used for summarizing data patterns in business intelligence. Aggregation jobs often involve multiple transformation steps in a data processing pipeline. We design and implement a sampling-based approximation framework called ApproxSpark, that can rigorously derive estimators with error bounds for approximate aggregation. Our second work targets approximate text analytic tasks. We propose and evaluate a framework called EmApprox that uses sampling-based approximation to speed up the processing of a wide range of queries over large text datasets. EmApprox builds an index for a dataset by learning a natural language processing model, producing vectors representing words and subcollections of documents. Finally, we target approximate video analytics. Video data embed rich and high-quality information. Yet video analytics is particularly compute intensive as it often involves invoking a deep convolutional neural network~(CNN) for object detection. We design and implement a approximate video analytics framework called VidApprox for accelerating video queries that involve object detection. VidApprox first leverages cheap CNNs to learn vector representations of video segments, and further processes the vectors as persistent index structure. At query processing time, the index lookup will serve as auxiliary information for only retrieving a subset of more similar video segments.


Guangyan Hu defense

Meeting number: 120 798 1696
Password: MSnM9r4gi5J


Join by video system
Dial This email address is being protected from spambots. You need JavaScript enabled to view it.
You can also dial and enter your meeting number.

Join by phone
+1-650-429-3300 USA Toll
Access code: 120 798 1696