16:198:550 - Massive Data Mining

Course Number: 16:198:550
Course Type: Graduate
Semester 1: Spring
Credits: 3
Description:
This class is recommended for students interested in pursuing a concentration in Massive Data Analytics or Machine Learning. Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory. The material covered here equips students with the main backend algorithms and infrastructure necessary for the completion of the required Capstone Project for MS graduation with a Massive Data Analytics concentration.

M.S. Course Category: Algorithms & Complexity, Data Storage and Retrieval
Category: A (M.S.), B (M.S.), B (Ph.D.)
Prerequisite Information:
CS 512 or CS 513 (Fundamental Algorithms)
- Linear Algebra, Basic Probability (Moments, Typical Distributions, MLE)
- Programming Languages: C++/Java
- Infrastructure: Hadoop Cluster
Course Links: 16:198:512 - Introduction to Data Structures and Algorithms, 16:198:513 - Design and Analysis of Data Structures and Algorithms
Topics:
1. Scalable Algorithms, Computing Architectures, Automation for Large Data Handling (2 lectures).

2. Models of Computation : MapReduce, Streams(Online Algorithms), Internal/ External Memory (2 lectures)

3. Data Types, Algorithmic Tools( Similarity Measures, SVD, Stochastic Gradient Descent, Frequent Item Sets, LSH, Bloom Filters) (5 lectures )

4. B-trees, Log-structured Merge-Trees and B^epsilon Trees ( 2 lectures )

5. Sample Applications ( Recommender Systems, Association Rules, Duplicate Detection, … )( 2 lectures )

6. MapReduce and New Software Stack ( 1 lecture )

7. Local Sensitive Hashing ( 2 lectures )

8. Data Streams and Bounded Memory Processing: Querying, Counting, Bucketing, Windowing ( 3 lectures )

9. Link Analytics: Page Rank, Flows, and Eigenvectors ( 2 lectures )

10. Association Rules: Frequent Item Sets ( 2 lectures )

11. Clustering: Spectral, k-Means, Markov Clustering ( 2 lectures )

Selected Presentations ( 4 lectures )
1. Advertising: Online Matching, Biding Algorithms
2. Recommendation Systems: Content Based, Collaborative Filtering, Latent Models, Optimization based Methods
3. Social Networks: Community Detection, Modularity, Betweenness, Models and Maximum Likelihood Estimation
4. Dimensionality Reduction: SVD vs CUR Decompositions; SVD vs Non-Linear Methods
12. Case study in Smart Cities ( 4 lectures )
1. Urban Data Collection, Management, Processing, and Visualization
2. Spatial-Temporal Urban Data Analytics
3. Data Predictive Control in Smart Cities
4. Data Privacy and Security in Smart Cities
Selected papers from the literature on

– Algorithmic Analytics, Visualization and Computer Human Interaction. Data Ethics, Privacy, Security, Sharing, Provenance

Course Material:

Textbook: (LRU) Mining Massive Data Sets by J. Leskovec, A. Rajaraman, J. D. Ullman
Learning Goals:
Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory.
Notes:
Objective: Volume, velocity and sheer data size require specialized processing algorithms to access, navigate, extract, protect, validate, and synthesize “useful” information that is unreachable or hidden from a superficial searcher. Students in this class will be exposed to the major algorithms and state of the art techniques that are used in massive data mining. Students will complete a Big Data project.

Grading: Homeworks and Quizzes(16%), Midterms(34%), Big Data Project (50%). The project evaluation will be guided by: the “value” of the extracted information at different scales, the methods and models used, and the final application usefulness.

Contacts

Badri Nath

Andrew Chun

M.S. Program

Graduate Course Details

16:198:550 - Massive Data Mining