Course Details

  • Course Number: 16:198:543
  • Course Type: Graduate
  • Semester 1: Fall
  • Credits: 3
  • Description:

    In today's world of massive amounts of data, new methods and techniques are needed. In this course, we discuss methods to dive deeper into such data. In terms of areas, the course will focus on techniques for information retrieval, natural language processing, and relevant indexing methods to enable retrieval based on the semantics of text (or multimodal content). 

    In terms of methods, the course will focus on recent Deep Learning and neural network methods for these areas.

    The course will include hands-on practical work on real data sets, based on deep learning frameworks, as well as optionally the Apache Spark platform.

    This class is related to 541 (graduate databases) and to 539 (physical databases) – but it is much systems oriented and focused on practical hands on aspects of storing and managing of massive amounts of structured and unstructured data.

  • Category: B (Ph.D.)
  • Prerequisite Information:

    Basic familiarity with data structures (from Introduction to Computer Science class) and basic mathematics and probability theory.


    Basic programming ability. Some of our examples will use Deep Learning tools, most of which require knowledge of Python, C++, Java, or Scala. Other examples will be based on Apache Spark. Prior knowledge of Spark (especially using the Scala programming language) is not required, but certainly won't harm.

  • Topics:

    09-07 Logistics, Introduction to Massive Data and Deep Learning Reading: Chapter 1 from Data-Intensive Text Processing with MapReduce

    09-14 Textual Data Processing with Spark

    09-21 Text Processing with Spark & Natural Language Processing with Spark Text Processing Pipeline

    09-28 Information Retrieval and Textual Data Processing with Spark

    10-05 Search and Storage Inverted Indexes for Search, Elastic Search Semantic Hashing, Vector Representations of Data

    10-12 Embeddings and Learning with SGD Word Vectors SGD

    10-19 Storage and Algorithms Duplicate Detection, Nearest-Neighbor/Similarity Search with LSH, FAISS, etc. Spatial Indexing NoSQL

    10-26 Deep Learning Backpropagation, Feedforward Networks, Activation Functions, Softmax

    11-02 Deep Learning for Classification, Retrieval, and Labeling Convolutional Neural Networks Sentiment Analysis, Text Classification, Neural Information Retrieval, Named Entity Recognition

    11-09 Sequence Modeling Language Models for Retrieval and Prediction Recurrent Neural Networks, LSTMs, Seq2Seq

    11-16 Semantic Analysis LSTMs with Attention, Neural Machine Translation

    11-21 Question Answering and Structured Data IBM's Watson, Question Answering over Structured Data Deep Learning approaches Siri

    11-23 No class: Thanksgiving Recess

    11-30 Short Project Presentations

    12-07 Question Answering, Recap


    Course Material: 

    Since we are focusing on the latest research and technology, this course does not strictly follow any designated coursebook. However, the following (optional) books may be useful.

  • Expected Work: The main course requirement will be a semester-long course project, involving Apache Spark and/or Deep Learning. See Course Project Slides for details. Related to this course project, there will be homework assignments during the semester (Note: currently just the intermediate report).Additionally, there will be small in-class quizzes.
  • Learning Goals:

     The objective is to teach skills to manage (query, analyze, learn) very large amounts of structured and unstructured data.