Data curation (ingest, transformation, cleaning, schema mapping, deduplication, and consolidation) of raw data sets consumes up to 80% of a data scientist’s time. Integrating silos of enterprise data is also a major challenge to business users. To address these issues, we have built an end-to-end data curation system, Data Civilizer, in cooperation with the Qatar Computing Research Institute. In this talk, I will start with a brief introduction to the Data Civilizer system. Then I will discuss two of the components that I have constructed. First, I will discuss entity consolidation in Data Civilizer. This module accepts a collection of clusters of records thought to represent the same entity (i.e. duplicates) and merges each cluster into a single “golden” record. Next, I will show how to address the key challenges to enable scalable entity matching in Data Civilizer. Finally, I will conclude the talk with my future vision on data curation for end-users, and massive data lake management.
Dong Deng is a postdoctoral associate in the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, where he works with Prof. Michael Stonebraker and Prof. Samuel Madden. He is interested in data management and data science, with a special focus on tackling the theoretical and system building challenges in data curation. Dong obtained his PhD degree from Tsinghua University in 2016 with the highest doctoral dissertation award. He also received scholarships from the Siebel Foundation, Google, Microsoft, Intel, and Boeing Company and has been regularly publishing in top venues including SIGMOD, PVLDB, and ICDE.