![]()
Maintained by web@cs.rutgers.edu |
Rutgers University DCIS Colloquium Date: Thursday, March 11, 2004 Time: 11:00 AM Location: CoRE Building room 301, Busch Campus, Rutgers University
Abstract: Data mining methods have grown in importance as data sets grew larger and more numerous. Many of the fundamental problems in performing data mining tasks rely on statistical estimation and modeling. However, many of the computational advances in statistical analysis methods paid little attention to the problems of scaling to massive data sets (in contrast to much of the work on the database systems side where scalability is a central theme). In this talk, we present several algorithms and considerations in the area of scaling data mining algorithms to large databases. The approaches fundamentally rely on the notion of decomposing algorithms into basic components that more easily lend themselves to scaling to large data. It turns out that most popular algorithms can be decomposed into components that need to be close to the data, and others than can operate over reduced forms or sufficient statistics of the data. The key to a good decomposition is to keep the components that need to "touch the data" simple and fast. In addition, it is important to consider the number of times an algorithm requires a scan of the data. After covering a couple of illustrative examples of scaling algorithms to large databases, we consider the converse approach: can we utilize fundamental notions in data mining to help solve classical database problems such as indexing high-dimensional data and estimating query selectivity etc. The theme here is that database considerations are important in data mining while statistical and data mining considerations play an important role in database systems considerations. We wrap-up the discussion of databases with a brief coverage of some work on integrating data mining in a major commercial database system (Microsoft SQL Server). We conclude the talk with a summary of the numerous remaining technical challenges facing the field of data mining. Speaker Bio: Usama Fayyad is President & CEO of DMX Group, a data mining services firm that delivers advanced predictive and strategic data applications to some of the world's largest organizations. He is also Chairman of digiMine, Inc. (now Revenue Science Inc.), which he co-founded in early 2000 and served as its President and CEO until 2003. Prior to digiMine, Dr. Fayyad founded and led Microsoft Research's Data Mining & Exploration Group. At Microsoft he also led the development of data mining components within Microsoft products, including SQL Server 2000. From 1989 to 1995, Dr. Fayyad was at NASA's Jet Propulsion Laboratory (JPL), California Institute of Technology, where he founded and grew a multi-million dollar advanced research program to develop data mining systems for the analysis of large scientific databases. This work earned him numerous awards including the most distinguished research excellence award from Caltech/JPL and a U.S. Government Medal from NASA. Dr. Fayyad has published over 100 technical articles and is co-editor of two books on Data Mining and Knowledge Discovery in Databases. He served as program co-chair of KDD-94 and KDD-95 (the 1st International Conference on Knowledge Discovery and Data Mining) and as general chair of KDD-96 and KDD-99. He is Editor-in-Chief of the scientific technical journal Data Mining and Knowledge Discovery, and is a Founder and Director of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). He serves on several Editorial Boards including the Communications of the ACM. He received his Ph.D. in 1991 from The University of Michigan Ann Arbor in Computer Science and Engineering. He holds two BSE's in engineering, M.Sc. in Mathematics, and MSE in Computer Engineering.
|