Skip to content Skip to navigation

PhD Defense: Boundless Data Analytics through Progressive Mining

Abstract: 

Multidimensional distributions in data mining are often represented as plots: scatter plots between two numerical variables; heat maps, bargraphs, histograms, boxplots - they either relate two variables together or show frequency distributions of one variable. What makes one distribution more interesting than the other? What if we could generate all possible relationships and rank the most interesting at the top - do it all automatically, thus savings days of repetitive human work?

We define an attribute-value pair from a dimension as a descriptor, and a conjunction of k descriptors is used to slice a dataset. The problem of generating all possible large data slices is formalized as the frequent itemset mining problem. Because the number of dimensions may also include derived dimensions so we do not know ahead of time how long the process will take, may even take an unbounded amount of time. We explore solutions which can answer the following questions: 1) Can we provide some progress indicator during this process? 2) Is the best-so-far partial solution available at any time? To this end, we investigate anytime algorithms and propose a dynamic approach called ALPINE that allows us to achieve flexible trade-offs between efficiency and completeness. ALPINE is to our knowledge the first algorithm to progressively mine itemsets and closed itemsets support-wise. It guarantees that all itemsets with support exceeding the current checkpoint's support have been found before it proceeds further. ALPINE runs literally forever and without apriori decided minsup value. The ALPINE approach is also generalized to multiple tables based on the E-R Modeling without joining the tables to form a single big table.

Finally, we propose a boundless analytics engine which can slice a given dataset in all possible ways progressively and generate very large (unbounded) number of plots. A search system can be built to explore all the plots. The engine is used to analyze the extensive historical NBA Players stats data with a demo system to explore the dataset.

Speaker: 
Qiong Hu
Location: 
CoRE B (305)
Event Date: 
07/11/2018 - 10:00am
Committee: 
Prof. Tomasz Imielinski (Chair), Prof. Alex Borgida, Prof. Shan Muthukrishnan, Prof. Wojciech Szpankowski (Purdue University)
Event Type: 
PhD Defense
Organization: 
Dept. of Computer Science