Skip to content Skip to navigation

Pre-Defense: Forever Data Analytics through Progressive Mining

Abstract: 

Multidimensional distributions in data mining are often represented as plots: scatter plots between two numerical variables; heat maps, bargraphs, histograms, boxplots – they either relate two variables together or show frequency distributions of one variable. What makes one distribution more interesting than the other? What if we could generate all possible relationships and rank the most interesting at the top – do it all automatically, thus savings days of repetitive human work?

We define the categorical attribute-value pair of the form attribute = value as a descriptor, and a conjunction of k descriptors is used to slice a dataset. The problem of generating all possible data slices with some minimum support is formalized as the frequent itemset mining problem. Due to the tough and compute-intensive nature of the task, the mining process may be long, may even take forever. We explore solutions which can answer the following questions: 1) Can we provide some progress indicator during this process? 2) Is the best-so-far partial solution available at any time? To this end, we investigate anytime algorithms and propose a dynamic approach called ALPINE that allows us to achieve flexible trade-offs between efficiency and completeness. ALPINE is to our knowledge the first algorithm to progressively mine itemsets and closed itemsets “support-wise”. It guarantees that all itemsets with support exceeding the current checkpoint’s support have been found before it proceeds further. ALPINE runs literally “forever” and without apriori decided minsup value. The ALPINE approach is also generalized to multiple tables based on the E-R Modeling without joining the tables to form a single big table.

Finally, we propose a forever analytics engine which can slice and dice a given dataset in all possible ways progressively and a search system can be built to explore all the generated relationships for the given dataset. The engine is used to analyze the extensive historical NBA Players stats data with a demo system to explore the dataset.

Speaker: 
Qiong Hu
Location: 
CoRE B (305)
Event Date: 
05/16/2018 - 11:00am
Committee: 
Prof. Tomasz Imielinski (Chair); Prof. Alex Borgida; Prof. Shan Muthukrishnan; Prof. Wojciech Szpankowski (Purdue University)
Event Type: 
Pre-Defense
Organization: 
Dept. of Computer Science