CS 553 Questions on Computing part 1 A. Mapreduce: Simplified Data Processing http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf 1. Describe what a map function does, and what a reduce function does. 2. Write the mapreduce code that implements an inverted index as described in section 2.3 in the same style as example 2.1. Assume the input to map() is a documentID and a document. Assume you have a function called parse() that returns a list of all words in the document, and a sort() function. 3. How does the mapreduce system handle the case when 2 tasks complete on the same input? 4. Explain the "straggler" problem, and how mapreduce reduces the impact of stragglers. B. Spark: cluster computing with working sets. http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf 1. What is a closure? Give a simple example from any language of your choice, for example Javascript or Python. 2. How does Spark distribute the Scala objects that make up a program? 3. What is Spark's strategy for recovering an RDD if a node fails? 4. How are RDDs similar to a shared memory model with checkpointing? How are they different?