CS 553 Fall 2015 
Question on readings on reliability 

A. Reading: Algirdas Avizienis, Jean-Claude Laprie, Brian Randell Fundamental Concepts of Dependability

1. What are 4 key metrics the authors' claim characterize computing systems? 
Name two additional metrics, i.e., things that make a system desirable.

2. What is the difference between a fault and an error? 

3. What is a latent error? 

4. Describe the difference between availability and reliability.

5. What are (probabilistic) mathematical models that capture the concepts of availability and reliability? Use text in your description, but what you describe should be something that is computable. An example of the level of description I am looking for: one method that describes the concept of a circle would be to define an equation such that the square root of the sum of the squares of the X and Y coordinates is a  constant. 

6. Explain what is meant by the authors' claim that fault tolerance is 
recursive, and give and example.


B. Reading: David Oppenheimer, Archana Ganapathi, and David A. Patterson, Why do Internet services fail, and what can be done about it?

1. What are the authors' goals for this work?

2. Using the results in the paper, describe how you would quantify how 
effective the services in masking component failures from the users.

3. Rank the three services from most effective to least effective at masking component failures. 

4. What are the first three root causes for failures of Internet Services? 

5. What is performability? What is a performability benchmark? Give an example of one such benchmark that applies to Internet Services. What would be the independent variable (input) and what would be the dependent variable (output)? 

6. Given the authors' results, If you were in charge of allocating resources  (e.g. salary and equipment) to improve service uptime, what would you do? Justify your answer for full credit. 

C. Reading: Bianca Schroeder and Garth A. Gibson, A Large-Scale Study of Failures in High-Performance Computing Systems:

1. Describe the workload difference between this paper and the Oppenheimer, Ganapathi, and Patterson work. 

2. What is the biggest difference in the root-cause of failures between this paper and the Oppenheimer paper? Give a plausible explanation. 

3. What distribution bests fits the measured time between failures? 

4. In this work, does a failure have any predictive value of future failures? That is, if we observe a failure over time interval 
t, is the time to the next failure is more likely, less likely, or the failure has no predictive value (is memory-less). What evidence in the paper supports your answer? 

5. Why might memoryless failures make it easier to construct a reliable system than if they were not? 

6. What are some impacts if we built a system with a temporal distribution of time between failures that was a poor fit to the actual one? 

7. Given the authors' results, if you were in charge of allocating resources (e.g. salary and equipment) to improve service uptime for these systems, what would you do? How would your approach be similar or different or similar from your response to the  Oppenheimer et. al. answer?