Skip to content Skip to navigation

Massive Data Mining

16:198:542

This class is a foundational class for the newly created Professional Master in Data Science within the Computer Science DepartmentStudents will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory. The material covered here equips students with the main backend algorithms and infrastructure necessary for the completion of the required Capstone Project for MSDS graduation.

Credits: 
3
Prerequisite: 

CS 512 or CS 513 (Fundamental Algorithms)

  • Linear Algebra, Basic Probability (Moments, Typical Distributions, MLE)
  • Programming Languages: C++/Java
  • Infrastructure: Hadoop Cluster
Topics: 

1. Scalable Algorithms, Computing Architectures, Automation for Large Data Handling (2 lectures).

2. Models of Computation : MapReduce, Streams(Online Algorithms), Internal/ External Memory (2 lectures)

3. Data Types, Algorithmic Tools( Similarity Measures, SVD, Stochastic Gradient Descent, Frequent Item Sets, LSH, Bloom Filters) (5 lectures )

4. B-trees, Log-structured Merge-Trees and B^epsilon Trees ( 2 lectures )

5. Sample Applications ( Recommender Systems, Association Rules, Duplicate Detection, … )( 2 lectures )

6. MapReduce and New Software Stack ( 1 lecture )

7. Local Sensitive Hashing ( 2 lectures )

8. Data Streams and Bounded Memory Processing: Querying, Counting, Bucketing, Windowing ( 3 lectures )

9. Link Analytics: Page Rank, Flows, and Eigenvectors ( 2 lectures )

10. Association Rules: Frequent Item Sets ( 2 lectures )

11. Clustering: Spectral, k-Means, Markov Clustering ( 2 lectures )

Selected Presentations ( 4 lectures )

a. Advertising: Online Matching, Biding Algorithms

b. Recommendation Systems: Content Based, Collaborative Filtering, Latent Models, Optimization based Methods

c. Social Networks: Community Detection, Modularity, Betweenness, Models and Maximum Likelihood Estimation

d. Dimensionality Reduction: SVD vs CUR Decompositions; SVD vs Non-Linear Methods

12. Case study in Smart Cities ( 4 lectures )

a. Urban Data Collection, Management, Processing, and Visualization

b. Spatial-Temporal Urban Data Analytics

c. Data Predictive Control in Smart Cities

d. Data Privacy and Security in Smart Cities

• Selected papers from the literature on

– Algorithmic Analytics, Visualization and Computer Human Interaction. Data Ethics, Privacy, Security, Sharing, Provenance

Course Material: 

Textbook: (LRU) Mining Massive Data Sets by J. Leskovec, A. Rajaraman, J. D. Ullman

Learning Goals: 

Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in Random Access Memory.

Notes: 

Objective: Volume, velocity and sheer data size require specialized processing algorithms to access, navigate, extract, protect, validate, and synthesize “useful” information that is unreachable or hidden from a superficial searcher. Students in this class will be exposed to the major algorithms and state of the art techniques that are used in massive data mining. Students will complete a Big Data project.

GradingHomeworks and Quizzes(16%), Midterms(34%), Big Data Project (50%). The project evaluation will be guided by:  the “value” of the extracted information at different scales, the methods and models used, and the final application usefulness.

Professor: 
James Abello
Semester: 
Spring
Course Type: 
Graduate

Check the University Schedule of Classes to see if this course is open.

Request a Special Permission Number here if the class is full.