Title: Top-k Queries over Structured and Semi-structured Data Spaker: Amelie Marian. ----------------------------------------------------------------- Abstract: Traditionally, query processing strategies for structured (e.g., relational) and semi-structured (e.g., XML) data identify the ``exact matches'' for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy --often expressing user preferences and not hard Boolean constraints-- and are best answered with a ranked, or ``top-k,'' list of the best matching data objects. In this talk, I will present a variety of top-k query processing algorithms. For efficiency, these algorithms focus on the objects that are most likely to be in the top-k query answers, and discard --as early as possible-- objects that are guaranteed not to qualify for the answers. I will center the discussion around a number of important scenarios, each presenting fundamental challenges for top-k query definition and processing. Specifically, I will first focus on a common web-application scenario where the data is structured and only available through autonomous web services with heterogeneous access interfaces and constraints. By taking into account the peculiarities of the sources and potentially choosing a different query execution plan for each individual candidate object, my adaptive algorithms significantly reduce the query processing time over previously existing algorithms, which select coarser ``global'' query execution plans. I will also discuss an important XML data integration scenario where XML data comes from heterogeneous sources, and therefore may not share the same schema. In this scenario, exact query matches are too rigid, so XML query answers should be ranked based on their ``similarity'' to the queries in terms of both content and structure. Processing top-k queries efficiently in such a scenario is challenging, as the number of candidate answers increases dramatically with the query size. By pruning irrelevant data fragments as early as possible, my algorithms minimize the number of candidate answers considered during query evaluation. In summary, I will discuss the general problem of ranking query answers over structured and semi-structured data, and returning the ``best'' objects, according to user preferences, in a time-efficient manner. -----------------------------------------------------------------------