Course Details
16:198:550 - Massive Data Mining
- Course Number: 16:198:550
- Course Type: Graduate
- Semester 1: Spring
- Credits: 3
- Description:
This class is recommended for students interested in pursuing a concentration in Massive Data Analytics or Machine Learning. Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory. The material covered here equips students with the main backend algorithms and infrastructure necessary for the completion of the required Capstone Project for MS graduation with a Massive Data Analytics concentration.
- M.S. Course Category: Algorithms & Complexity, Data Storage and Retrieval
- Category: A (M.S.), B (M.S.), B (Ph.D.)
- Prerequisite Information:
CS 512 or CS 513 (Fundamental Algorithms)
- Linear Algebra, Basic Probability (Moments, Typical Distributions, MLE)
- Programming Languages: C++/Java
- Infrastructure: Hadoop Cluster
- Course Links: 16:198:512 - Introduction to Data Structures and Algorithms, 16:198:513 - Design and Analysis of Data Structures and Algorithms
- Topics:
1. Scalable Algorithms, Computing Architectures, Automation for Large Data Handling (2 lectures).
2. Models of Computation : MapReduce, Streams(Online Algorithms), Internal/ External Memory (2 lectures)
3. Data Types, Algorithmic Tools( Similarity Measures, SVD, Stochastic Gradient Descent, Frequent Item Sets, LSH, Bloom Filters) (5 lectures )
4. B-trees, Log-structured Merge-Trees and B^epsilon Trees ( 2 lectures )
5. Sample Applications ( Recommender Systems, Association Rules, Duplicate Detection, … )( 2 lectures )
6. MapReduce and New Software Stack ( 1 lecture )
7. Local Sensitive Hashing ( 2 lectures )
8. Data Streams and Bounded Memory Processing: Querying, Counting, Bucketing, Windowing ( 3 lectures )
9. Link Analytics: Page Rank, Flows, and Eigenvectors ( 2 lectures )
10. Association Rules: Frequent Item Sets ( 2 lectures )
11. Clustering: Spectral, k-Means, Markov Clustering ( 2 lectures )
Selected Presentations ( 4 lectures )
- Advertising: Online Matching, Biding Algorithms
- Recommendation Systems: Content Based, Collaborative Filtering, Latent Models, Optimization based Methods
- Social Networks: Community Detection, Modularity, Betweenness, Models and Maximum Likelihood Estimation
- Dimensionality Reduction: SVD vs CUR Decompositions; SVD vs Non-Linear Methods
12. Case study in Smart Cities ( 4 lectures )
- Urban Data Collection, Management, Processing, and Visualization
- Spatial-Temporal Urban Data Analytics
- Data Predictive Control in Smart Cities
- Data Privacy and Security in Smart Cities
Selected papers from the literature on
– Algorithmic Analytics, Visualization and Computer Human Interaction. Data Ethics, Privacy, Security, Sharing, Provenance
Course Material:
Textbook: (LRU) Mining Massive Data Sets by J. Leskovec, A. Rajaraman, J. D. Ullman
- Learning Goals:
Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory.
- Notes:
Objective: Volume, velocity and sheer data size require specialized processing algorithms to access, navigate, extract, protect, validate, and synthesize “useful” information that is unreachable or hidden from a superficial searcher. Students in this class will be exposed to the major algorithms and state of the art techniques that are used in massive data mining. Students will complete a Big Data project.
Grading: Homeworks and Quizzes(16%), Midterms(34%), Big Data Project (50%). The project evaluation will be guided by: the “value” of the extracted information at different scales, the methods and models used, and the final application usefulness.