Seminar on Self-Healing Systems

Faults Study

Software Defects and their Impact on System Availability A Study of Field Failures in Operating Systems
Mark Sullivan, Ram Chillarege IBM Thomas J. Watson Research Center, 1991

Comparing the Robustness of POSIX Operating Systems
Philip Koopman,John DeVale, the Proceedings of FTCS’99, 15-18 June 1999, Madison, Wisconsin.

Whither Generic Recovery From Application Faults? A Fault Study using Open-Source Software,
Subhachandra Chandra, Peter M. Chen, Proceedings of the 2000 International Conference on Dependable Systems and Networks / Symposium on Fault-Tolerant Computing (DSN/FTCS) , June 2000.

Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code 
Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou,, Benjamin Chelf. SOSP, 2001

Why do Internet services fail, and what can be done about it(ROC)
David Oppenheimer, Archana Ganapathi, David A. Patterson,  4th USENIX Symposium on Internet Technologies and Systems (USITS '03), March 2003

Fault-tolerance and Dependability

Dealing With Disaster: Surviving Misbehaved Kernel Extensions
Margo Seltzer, Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, 1996

The Rio File Cache: Surviving Operating System Crashes
Peter M. Chen, Wee Teck Ng, Subhachandra Chandra, Christopher Aycock, Gurushankar Rajamani, David Lowell, , Proceedings of the 1996 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996.

Fundamental Concepts of Dependability
A. Avizienis, J.-C. Laprie and B. Randell. Research Report N01145, LAAS-CNRS, April 2001

Improving the Reliability of Commodity Operating Systems
Michael M. Swift, Brian N. Bershad, Henry M. Levy, SOSP'03

Exploring Failure Transparency and the Limits of Generic Recovery
David E. Lowell, Subhachandra Chandra, Peter M. Chen , Proceedings of the 2000 Symposium on Operating Systems Design and Implementation (OSDI), October 2000.

The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
Subhachandra Chandra, Peter M. Chen, Proceedings of the 2002 International Symposium on Software Reliability Engineering (ISSRE) , November 2002.

FIG: A prototype tool for online verification of recovery mechanisms. (ROC)
P. Broadwell, N. Sastry, and J. Traupman. In Workshop on Self-Healing, Adaptive and Self-Managed Systems, June 2002.

Information and control in gray-box systems
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, SOSP'01

Transforming Policies into Mechanisms with Infokernel
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J. Engle, Haryadi S. Gunawi, James A. Nugent, Florentina I. Popovici. SOSP'03

Hypervisor-based fault tolerance
T. C. Bressoud, F. B. Schneide, SOSP 95

Progress-based regulating of low-importance processes
John R. Douceur, William J. Bolosky, SOSP 99

Defensive Programming: Using an Annotation Toolkit to Build Dos-Resistant Software
Xiaohu Qie, Ruoming Pang, Larry Peterson, OSDI 2002

ReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded Codes
Milos Prvulovic, and Josep Torrellas, ISCA 2003.

A "Flight Data Recorder" for Enabling Full-system Multiprocessor Deterministic Replay
Min Xu , University of Wisconsin - Madison; Rastislav Bodik , University of California, Berkeley; Mark D. Hill , University of Wisconsin – Madison, ISCA 2003.

Designing for Disasters  (FAST'04)
Kimberley Keeton, Cipriano Santos, and Dirk Beyer, Hewlett-Packard Labs; Jeff Chase, Duke University; John Wilkes, Hewlett-Packard Labs

Constructing Services with Interposable Virtual Hardware (NSDI'04)
Andrew Whitaker, Richard S. Cox, Marianne Shaw, and Steven D. Gribble, University of Washington

Path-based Failure and Evolution Management (NSDI'04)
Mike Chen, University of California, Berkeley; Anthony Accardi, Tellme; Emre Kiciman, Stanford University; Dave Patterson, University of California, Berkeley; Armando Fox, Stanford University; Eric Brewer, University of California, Berkeley

Consistent and Automatic Service Regeneration (NSDI'04)
Haifeng Yu, Intel Research Pittsburgh; Amin Vahdat, University of California, San Diego

Software Architecture  for Self-Healing

An Active Events Model for Systems Monitoring
Gross, P.N., Gupta, S., Kaiser, G.E., Kc, G.S., Parekh, J.J.  Proceedings of the Working Conference on Complex and Dynamic System Architecture, Brisbane, Australia, Dec 2001

Adaptive Mirroring of System of Systems Architectures (WOSS '02)
Combs, N., Vagel, J. Proceedings of the First ACM SIGSOFT Workshop on Self-Healing Systems

Towards architecture-based self-healing systems (WOSS'02)
Eric M. Dashofy, André van der Hoek, Richard N. Taylor

Model-based adaptation for self-healing systems (WOSS'02)
David Garlan, Bradley Schmerl

Self-organizing software architectures for distributed systems (WOSS'02)
Ioannis Georgiadis, Jeff Magee, Jeff Kramer

An architectural support for self-adaptive software for treating faults (WOSS'02)
Rogério de Lemos, José Luiz Fiadeiro

Architectural style requirements for self-healing systems (WOSS'02)
Marija Mikic-Rakic, Nikunj Mehta, Nenad Medvidovic

An instrumentation and control-based approach for distributed application management and adaptation (WOSS'02)
D. Reilly, A. Taleb-Bendiab, A. Laws, N. Badr

Remote Healing Architecture

Nonintrusive Failure Detection and Recovery for Internet Services using Backdoors.
F. Sultan, A. Bohra, Y. Pan, S. Smaldone, I. Neamtiu, P. Gallard and L. Iftode. Rutgers University Technical Report, DCS-TR-524, December 2003. Submitted for publication.

Nonintrusive Remote Healing Using Backdoors
F. Sultan, A. Bohra, I. Neamtiu, L. Iftode. Proceedings of First Workshop on Algorithms and Architectures for Self-Managing Systems, in conjunction with ISCA '03, June 2003. An initial version was published as Rutgers University Technical Report DCS-TR-522, April 2003.

Diagnosis and Detection        

An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment.  (ROC)
Brown, A., G. Kar, and A. Keller.  Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM 2001), Seattle, WA, May 2001

Pinpoint: Problem Determination in Large, Dynamic, Internet Services.  (ROC)
Chen, M., E. Kiciman, E. Fratkin, E. Brewer and A. Fox. Proceedings of the International Conference on Dependable Systems and Networks (IPDS Track), Washington D.C., 2002.

Providing Persistent and Consistent Resources through Event Log
Ramendra K. Sahoo, SSRS'03

Bug Isolation via Remote Program Sampling
Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. In ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 2003)

On-Line Intrusion Detection and Attack Prevention Using Diversity, Generate-and-Test, and Generalization
James C. Reynolds, James Just, Larry Clough, Ryan Maglich, Proceedings of the 36th Hawaii International Conference on System Sciences (HICSS’03)

Measurement and Analysis of Spyware Infections in a University Environment  (NSDI'04)
Stefan Saroiu, Steven D. Gribble, and Henry M. Levy, University of Washington

A Framework for Model Checking Network Protocols (NSDI'04)
Madanlal Musuvathi, David L. Dill, and Dawson R. Engler, Stanford University

Recovery and Repair

Cost-Sensitive Fault Remediation for Autonomic Computing
Michael L. Littman and Thu Nguyen and Haym Hirsh, Eitan M. Fenson and Richard Howard. Workshop on AI and Autonomic Computing: Developing a Research Agenda for Self-Managing Computer Systems, 2003

Undo for Operators: Building an Undoable E-mail Store. (ROC)
Brown, A. and D. A. Patterson.  In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003 (Best Paper Award).

Software Analysis

Software analysis: A roadmap
D. Jackson and M. C. Rinard. In Proceedings of 22nd International Conference On Software Engineering (ICSE’00),2000

Consistency management with repair actions
C. Nentwich, W. Emmerich, and A. Finkelstein. In Proceedings of the 25th International Conference on Software Engineering, May 2003.

Generating audits the FAST way
N. Gupta, L. Jagadeesan, E. Koutsofios, and D. Weiss. Auditdraw: . In Proceedings of the 3rd IEEE International Symposium on Requirements Engineering, 1997.

Automatic Detection and Repair of Errors in Data Structures  (Slides)
Brian Demsky and Martin C. Rinard, Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2003  

Biological Approach

A Biological Programming Model for Self-Healing
Selvin George, David Evans and Steven Marchette. First ACM Workshop on Survivable and Self-Regenerative Systems , October 31, 2003. (Swarm)

Autonomic Computing

The Vision of Autonomic Computing
Jeffrey Kephart, David M.Chess

Enabling autonomic behavior in systems software with hot swapping
J. Appavoo, K. Hui, C. A. N. Soules,R. W. Wisniewski, D. M. Da Silva,O. Krieger, M. A. Auslander, D. J. Edelsohn,B. Gamsa, G. R. Ganger, P. McKenney,M. Ostrowski, B. Rosenburg,M. Stumm, J. Xenidis,

Network  Repairing     

The Capacity of Wireless Networks
Piyush Gupta, P.R. Kumar, Technical report,University of Illinois, Urbana-Champaign, 1999.

Mobility Increases the Capacity of Ad-hoc Wireless Networks
Matthias Grossglauser, David Tse, INFOCOM 2001

User-level Internet Path Diagnosis
Ratul Mahajan, Neil Spring, David Wetherall, Thomas Anderson.