- Course Number: 16:198:543
- Course Type: Graduate
- Semester 1: Fall
- Credits: 3
In today's world of massive amounts of data, new methods and techniques are needed to dive deeper into such data including hands-on practical work on real data sets. This course covers the major existing algorithmic primitives necessary for the processing of "massive" data sets in external memory and data streaming settings. It also introduces various deep learning frameworks and corresponding neural methods in a variety of application areas. Students have the opportunity to gain experience with major platforms for storage and retrieval, including Hadoop Distributed File Systems and Data Streaming with Apache Spark. The focus of one "large data" project is on techniques for information retrieval, natural language processing, and the necessary indexing methods for the retrieval of semantically relevant information. A second "large data" project focuses on the application of a deep learning framework to an application area of each student's choice.
This class is related to CS541 (graduate databases) and to CS539 (physical databases) – but it is much systems oriented and focused on practical hands on aspects of storing, processing, and managing massive amounts of structured and unstructured data.
- M.S. Course Category: Data Storage and Retrieval
- Category: B (Ph.D.)
- Prerequisite Information:
Basic familiarity with data structures and algorithms (like the one provided in CS512), mathematical foundations and probability theory (CS501), and data base foundations (like CS527 and/or CS539). Some of the tools to be used require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. However, prior knowledge of Spark (using Scala) is not required but certainly it will not be harmful.
Module 1. Introduction to Massive Computing, MapReduce, and Hadoop Distributed File systems.
Module 2. Streaming Data and Apache Spark, Examples: Text data and Natural Language Processing pipeline.
Module 3. Clustering and Page Rank.
Module 4. Introduction to External Memory Algorithms.
Module 5. Search and Storage Inverted Indexes for Search, Elastic Search, Semantic Hashing, Vector Representations of Data.
Module 6. Similarity Search, Local Sensitive Hashing, Spatial Indexing
Module 7. Deep Learning Principles: Gradient Descent, Backpropagation, Feedforward Networks, Activation Functions, SoftMax.
Module 8. Deep Learning for Classification, Retrieval, and Labeling. Convolutional Networks for Sentiment Analysis, Text Classification, Neural Information Retrieval, and Named Entity Recognition.
Module 9. Selected Topics from:
a. Language Models for Retrieval and Prediction, Recurrent Neural Networks, LSTMs, Seq2Seq
b. Semantic Analysis. LSTMs with Attention, Neural Machine Translation.
c. Question Answering and Structured Data
d. Graph Streaming Algorithms and Classical Combinatorial Problems.
Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.
- Holden Karau et al. Learning Spark. Lightning-Fast Data Analysis. O'Reilly, 2015.
Note: The book is helpful but not required, especially since this is a fast-paced field and some of the latest changes to Spark are not yet covered in the book.
- Jure Leskovec, Anand Rajaraman, Jeff Ullman. Mining of Massive Datasets
Note: Available for free online.
- Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
- Expected Work: The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to the course project, there will be homework and reading assignments during the semester. Additionally, there will be small in-class quizzes.
- Learning Goals:
The objective is to get trained on the major skills necessary to manage (store, retrieve, query, analyze, and learn) from very large amounts of structured and unstructured data.