Skip to content Skip to navigation

Massive Data Storage, Retrieval and Deep Learning


In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data. In terms of areas, the course will focus on techniques for information retrieval, natural language processing, and relevant indexing methods to enable retrieval based on the semantics of text (or multimodal content).
In terms of methods, the course will focus on recent Deep Learning and neural network methods for these areas.
The course will include hands-on practical work on real data sets, based on deep learning frameworks, as well as optionally the Apache Spark platform.

This class is related to 541 (graduate databases) and to 539 (physical databases) – but it is much systems oriented and focused on practical hands on aspects of storing and managing of massive amounts of structured and unstructured data.

B (Ph.D.)
Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.
Basic programming ability. Some of our examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.
09-07 Logistics, Introduction to Massive Data and Deep Learning Reading: Chapter 1 from Data-Intensive Text Processing with MapReduce
09-14 Textual Data Processing with Spark
09-21 Text Processing with Spark & Natural Language Processing with Spark Text Processing Pipeline
09-28 Information Retrieval and Textual Data Processing with Spark
10-05 Search and Storage Inverted Indexes for Search, Elastic Search Semantic Hashing, Vector Representations of Data
10-12 Embeddings and Learning with SGD Word Vectors SGD
10-19 Storage and Algorithms Duplicate Detection, Nearest-Neighbor/Similarity Search with LSH, FAISS, etc. Spatial Indexing NoSQL
10-26 Deep Learning Backpropagation, Feedforward Networks, Activation Functions, Softmax
11-02 Deep Learning for Classification, Retrieval, and Labeling Convolutional Neural Networks Sentiment Analysis, Text Classification, Neural Information Retrieval, Named Entity Recognition
11-09 Sequence Modeling Language Models for Retrieval and Prediction Recurrent Neural Networks, LSTMs, Seq2Seq
11-16 Semantic Analysis LSTMs with Attention, Neural Machine Translation
11-21 Question Answering and Structured Data IBM's Watson, Question Answering over Structured Data Deep Learning approaches Siri
11-23 No class: Thanksgiving Recess
11-30 Short Project Presentations
12-07 Question Answering, Recap
Course Material: 

Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.

Expected Work: 

The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).
Additionally, there will be small in-class quizzes.

Learning Goals: 

The objective is to teach skills to manage (query, analyze, learn) very large amounts of structured and unstructured data.

Teaching Professors Names: 
James Abello
Gerard De Melo