This class is related to 541 (graduate databases) and to 539 (physical databases) – but it is much systems oriented and focused on practical hands on aspects of storing and managing of massive amounts of structured and unstructured data.
Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.
- Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.
Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
-
Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online. -
Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.
The objective is to teach skills to manage (query, analyze, learn) very large amounts of structured and unstructured data.