Light Seminar: Mapreduce and Mobile Algorithms.
Course number is 16:198:500:03 index number 46134

Instructors:
S. Muthukrishnan, x2379, Core 319. muthu@cs.rutgers.edu.
Graham Cormode, graham@dimacs.rutgers.edu

Meeting:  Thurs 4 -- 6 pm, Room 305 (Core B) for Mapreduce Algorithms. Thurs 6:30 -- 7:30 PM, Room 305 (Core B) for Mobile.

Schedule (Mapreduce Algorithms).

This course covers the fundamentals of the MapReduce framework and the Hadoop system for scaling huge computations to distributed clusters.  It surveys recent research papers on the topic to address problems on large data aggregation and analysis, such as for massive data logs, social network graphs, and distributed databases.


Jan 20
Intro to Mapreduce/Hadoop
Muthu+Graham
Jan 27

Cancelled
Feb 3
Design mapreduce algorithms for examples, PRAM M+G
Feb 10
Finish mapreduce examples
G
Feb 17
Guest Lecture
Howard Karloff (AT&T Research)
Feb 24
PigLatin+Hbase
Bill Katsak
Mar 3
Graph twiddling and pegasus
Matt Muscari
Mar 10
Sawzall
Vanchi Anathanarayanan
Mar 17
Spring Break
Attend this workshop 14--16
Mar 24
Set cover
Alex Nikolov
Mar 31
Counting triangles
Priya Govindan
Apr 7
Behavioral Targeting
Jinyun Yan
Apr 14
Sequence alignment
Daniela Vianna
Apr 21
Parallel GO
Sergiu Goschin
Apr 28


May 5




MapReduce Algorithms:

Introductory slides:
http://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt

Talk videos:
http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html

Other tutorials:
http://www.cloudera.com/wp-content/uploads/2010/01/5-MapReduceAlgorithms.pdf
http://www.cloudera.com/videos/mapreduce_algorithms

http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/session3-slides.pdf

UMD Class from Spring 2010:
http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/syllabus.html
Accompanying text:
http://www.umiacs.umd.edu/~jimmylin/book.html
http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf

Papers to read/discuss in class:

Models

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI 2004
http://labs.google.com/papers/mapreduce.html
Communications of the ACM, 2010
http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

On the Complexity of Processing Massive, Unordered, Distributed Data.
J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein and Z. Svitkina,
SODA 2008
http://arxiv.org/abs/cs/0611108

A Model of Computation for MapReduce
H. Karloff, S. Suri, and S. Vassilvitskii
SODA 2010
http://www.sidsuri.com/About_Me_files/mrc2.pdf

Sorting, Searching, and Simulation in the MapReduce Framework
Michael T. Goodrich, Nodari Sitchinava, Qin Zhang
Under submission, 2011
http://arxiv.org/abs/1101.1902

Systems on top of MR:

Interpreting the Data: Parallel Analysis with Sawzall
Scientific Programming Journal 2005
Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan
http://research.google.com/archive/sawzall.html

Pig latin: a not-so-foreign language for data processing
SIGMOD 2008
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins     
https://portal.acm.org//citation.cfm?id=1376616.1376726&coll=DL&dl=GUIDE&CFID=5697894&CFTOKEN=14842407

Hive: a warehousing solution over a map-reduce framework
VLDB 2009
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy
http://portal.acm.org/ft_gateway.cfm?id=1687609&type=pdf&coll=DL&dl=GUIDE&CFID=5697894&CFTOKEN=14842407

Hive - A Petabyte Scale Data Warehouse Using Hadoop
ICDE 2010
http://infolab.stanford.edu/~ragho/hive-icde2010.pdf

Alternatives/Extensions

Dryad: Distributed data-parallel programs from sequential building blocks.
Eurosys 2007
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
http://research.microsoft.com/pubs/63785/eurosys07.pdf

MapReduce Online
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy and Russell Sears
NSDI 2010, SIGMOD 2010
neilconway.org/docs/sigmod2010_hop_demo.pdf
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.pdf

Algorithms

Simulating Parallel Algorithms in the MapReduce Framework with Applications to Parallel Computational Geometry
Michael T. Goodrich
Massive 2010
http://arxiv.org/abs/1004.4708

A New Computation Model for Cluster Computing
Foto Afrati, Jeff Ullman
http://infolab.stanford.edu/%7Eullman/pub/mapred-model-report.pdf

Max-cover in map-reduce
Flavio Chierichetti, Ravi Kumar, Andrew Tomkins
WWW 2010
http://portal.acm.org/citation.cfm?id=1772715   

Graphs and Matrices

Graph Twiddling in a MapReduce World
Jonathan Cohen
Computing in Science and Engineering, vol. 11, no. 4, pp. 29-41, July/August, 2009.
http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120
           
Parallelizing Random Walk with Restart for large-scale query recommendation
Meng-Fen Chiang, Tsung-Wei Wang, Wen-Chih Peng
2010 Workshop on Massive Data Analytics on the Cloud
http://portal.acm.org/citation.cfm?id=1779599.1779607

Distributed non-negative matrix factorization for dyadic data analysis on mapreduce

Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang WWW 2010

http://research.microsoft.com/pubs/119077/DNMF.pdf

DOULION: Counting Triangles in Massive Graphs with a Coin
Charalampos E. Tsourakakis, U. Kang, Gary L. Miller, Christos Faloutsos
KDD 2009
Fast counting of triangles in real-world networks: proofs, algorithms and observations
http://reports-archive.adm.cs.cmu.edu/anon/ml2008/CMU-ML-08-103.pdf

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations
U Kang, Charalampos E. Tsourakakis, Christos Faloutsos
IEEE International Conference on Data Mining (ICDM 2009)
http://www.cs.cmu.edu/~ctsourak/pegasusICDM09.pdf

Database

Optimizing Joins in a Map-Reduce Environment
Foto Afrati, Jeff Ullman
http://infolab.stanford.edu/%7Eullman/pub/join-mr.pdf

Efficient parallel set-similarity joins using MapReduce
Rares Vernica, Michael J. Carey, Chen Li
SIGMOD 2010
http://portal.acm.org/citation.cfm?id=1807222

Applications


Scaling Up Classifiers to Cloud Computers
Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla
ICDM 08
http://www.cse.nd.edu/~dthain/papers/classify-icdm08.pdf

Large-Scale Behavioral Targeting
KDD 09
Ye Chen, Dmitriy Pavlov, John Canny
http://www.cc.gatech.edu/~zha/CSE8801/ad/p209-chen.pdf

MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
BMC Bioinformatics 2010
Suzanne J Matthews email and Tiffani L Williams email
http://www.biomedcentral.com/1471-2105/11/S1/S15

A novel approach to multiple sequence alignment using hadoop data grids
2010 Workshop on Massive Data Analytics on the Cloud
G. Sudha Sadasivam, G. Baktavatchalam
http://portal.acm.org/citation.cfm?id=1779599.1779601

Experiences on Processing Spatial Data with MapReduce
Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe
21st International Conference on Scientific and Statistical Database Management
http://users.cis.fiu.edu/~vagelis/publications/Spatial-MapReduce-SSDBM2009.pdf

Web-Scale Distributional Similarity and Entity Set Expansion
Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana-Maria Popescu, Vishnu Vyas
2009 Conference on Empirical Methods in Natural Language Processing
http://www.aclweb.org/anthology/D/D09/D09-1098.pdf

Other lists of papers:

http://atbrox.com/2010/05/08/mapreduce-hadoop-algorithms-in-academic-papers-may-2010-update/
http://www.columbia.edu/~ak2834/mapreduce.html

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Schedule (Mobile Apps)

Jan 20: First meeting and planning projects.