198:671 Seminar on Processing Massive Data Sets

Professor: S. Muthukrishnan.

Alternative Lecturer: Graham Cormode.

ANNOUNCEMENT: Lecture will begin on Thursday, 30th of Jan. At Core B (3rd Floor, Core)

Course Description. Every
generation has considered one or the other data set they had as being

"massive". Today, we (are positing to) have truly massive datasets. This is primarily because

we now collect data via automatic datafeeds from the internet, from satellites, from the financial industry,

using sensors, etc. which challenges the transmitting (T), computing (C) and storage (S) capacity of our systems.

Dealing with this challenge to TCS calls for fundamental ideas and techniques from many areas, and a

new mindset to embed them all into novel systems. This course will explore all aspects of meeting this

challenge. We are looking for students from any (or all!) of the following areas:

Statistics: What statistical analysis is feasible and useful on massive data?

Algorithms: How to design new algorithms that work within the resource constraints?

Complexity Theory: Develop a theory of what can be processed on massive streams?

Probability: Develop the theory of random variables and sampling that underpin many of the algorithms.

Harmonic Analysis, Mathematical Approximation Theory: Fundamental questions in these areas arise.

Databases: How to extend the current database technology to deal with massive streams?

Networking: What networking issues can be resolved using massive data analysis on traffic logs?

Sensors: How to build a massive information gathering infrastructure using sensors?

Hardware: How to build specialized hardware to process massive data streams?

Programming Languages: What special programming language support is needed?

Course Organization: We meet once a week for 2 hours. Tentatively it will be on Thursdays, 6---8PM. First few

weeks will be lectures on the basics. In second half of the course, you will present research papers or projects or

perspectives from each of the different fields, drawing from whatever is your area of expertise or interest. There is

scope to explore only programming projects, only mathematical papers, or whatever.

Lecture 1: pdf.

Lecture 2: pdf.

Lecture 3: pdf.

Lecture 4: pdf.

Lecture 5: pdf.

Code fragments for generating values from stable distributions (Lecture 4)

"massive". Today, we (are positing to) have truly massive datasets. This is primarily because

we now collect data via automatic datafeeds from the internet, from satellites, from the financial industry,

using sensors, etc. which challenges the transmitting (T), computing (C) and storage (S) capacity of our systems.

Dealing with this challenge to TCS calls for fundamental ideas and techniques from many areas, and a

new mindset to embed them all into novel systems. This course will explore all aspects of meeting this

challenge. We are looking for students from any (or all!) of the following areas:

Statistics: What statistical analysis is feasible and useful on massive data?

Algorithms: How to design new algorithms that work within the resource constraints?

Complexity Theory: Develop a theory of what can be processed on massive streams?

Probability: Develop the theory of random variables and sampling that underpin many of the algorithms.

Harmonic Analysis, Mathematical Approximation Theory: Fundamental questions in these areas arise.

Databases: How to extend the current database technology to deal with massive streams?

Networking: What networking issues can be resolved using massive data analysis on traffic logs?

Sensors: How to build a massive information gathering infrastructure using sensors?

Hardware: How to build specialized hardware to process massive data streams?

Programming Languages: What special programming language support is needed?

Course Organization: We meet once a week for 2 hours. Tentatively it will be on Thursdays, 6---8PM. First few

weeks will be lectures on the basics. In second half of the course, you will present research papers or projects or

perspectives from each of the different fields, drawing from whatever is your area of expertise or interest. There is

scope to explore only programming projects, only mathematical papers, or whatever.

Lecture 1: pdf.

Lecture 2: pdf.

Lecture 3: pdf.

Lecture 4: pdf.

Lecture 5: pdf.

Code fragments for generating values from stable distributions (Lecture 4)