Seminar on Self-Healing Systems
Software Defects and their Impact on System Availability A Study of Field Failures in Operating Systems
Mark Sullivan, Ram Chillarege IBM Thomas J. Watson Research Center, 1991
Comparing the Robustness of POSIX Operating Systems
Philip Koopman,John DeVale, the Proceedings of FTCS’99, 15-18 June 1999, Madison, Wisconsin.
Whither Generic Recovery From Application Faults? A Fault Study using Open-Source Software,
Subhachandra Chandra, Peter M. Chen, Proceedings of the 2000 International Conference on Dependable Systems and Networks / Symposium on Fault-Tolerant Computing (DSN/FTCS) , June 2000.
Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code
Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou,, Benjamin Chelf. SOSP, 2001
Why do Internet services fail, and what can be done about it? (ROC)
David Oppenheimer, Archana Ganapathi, David A. Patterson, 4th USENIX Symposium on Internet Technologies and Systems (USITS '03), March 2003
Fault-tolerance and Dependability
Dealing With Disaster: Surviving Misbehaved Kernel Extensions
Margo Seltzer, Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, 1996
The Rio File Cache: Surviving Operating System Crashes
Peter M. Chen, Wee Teck Ng, Subhachandra Chandra, Christopher Aycock, Gurushankar Rajamani, David Lowell, , Proceedings of the 1996 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996.
Fundamental Concepts of Dependability
A. Avizienis, J.-C. Laprie and B. Randell. Research Report N01145, LAAS-CNRS, April 2001
Improving the Reliability of Commodity Operating Systems
Michael M. Swift, Brian N. Bershad, Henry M. Levy, SOSP'03
Exploring Failure Transparency and the Limits of Generic Recovery
David E. Lowell, Subhachandra Chandra, Peter M. Chen , Proceedings of the 2000 Symposium on Operating Systems Design and Implementation (OSDI), October 2000.
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
Subhachandra Chandra, Peter M. Chen, Proceedings of the 2002 International Symposium on Software Reliability Engineering (ISSRE) , November 2002.
FIG: A prototype tool for online verification of recovery mechanisms. (ROC)
P. Broadwell, N. Sastry, and J. Traupman. In Workshop on Self-Healing, Adaptive and Self-Managed Systems, June 2002.
Information and control in gray-box systems
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, SOSP'01
Transforming Policies into Mechanisms with Infokernel
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J. Engle, Haryadi S. Gunawi, James A. Nugent, Florentina I. Popovici. SOSP'03
Hypervisor-based fault tolerance
T. C. Bressoud, F. B. Schneide, SOSP 95
Progress-based regulating of low-importance processes
John R. Douceur, William J. Bolosky, SOSP 99
Defensive Programming: Using an Annotation Toolkit to Build Dos-Resistant Software
Xiaohu Qie, Ruoming Pang, Larry Peterson, OSDI 2002
ReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded Codes
Milos Prvulovic, and Josep Torrellas, ISCA 2003.
A "Flight Data Recorder" for Enabling Full-system Multiprocessor Deterministic Replay
Min Xu , University of Wisconsin - Madison; Rastislav Bodik , University of California, Berkeley; Mark D. Hill , University of Wisconsin – Madison, ISCA 2003.
Designing for Disasters (FAST'04)
Kimberley Keeton, Cipriano Santos, and Dirk Beyer, Hewlett-Packard Labs; Jeff Chase, Duke University; John Wilkes, Hewlett-Packard Labs
Constructing Services with Interposable Virtual Hardware (NSDI'04)
Andrew Whitaker, Richard S. Cox, Marianne Shaw, and Steven D. Gribble, University of Washington
Path-based Failure and Evolution Management (NSDI'04)
Mike Chen, University of California, Berkeley; Anthony Accardi, Tellme; Emre Kiciman, Stanford University; Dave Patterson, University of California, Berkeley; Armando Fox, Stanford University; Eric Brewer, University of California, Berkeley
Consistent and Automatic Service Regeneration (NSDI'04)
Haifeng Yu, Intel Research Pittsburgh; Amin Vahdat, University of California, San Diego
Software Architecture for Self-Healing
An Active Events Model for Systems Monitoring
Gross, P.N., Gupta, S., Kaiser, G.E., Kc, G.S., Parekh, J.J. Proceedings of the Working Conference on Complex and Dynamic System Architecture, Brisbane, Australia, Dec 2001
Adaptive Mirroring of System of Systems Architectures (WOSS '02)
Combs, N., Vagel, J. Proceedings of the First ACM SIGSOFT Workshop on Self-Healing Systems
Towards architecture-based self-healing systems (WOSS'02)
Eric M. Dashofy, André van der Hoek, Richard N. Taylor
Model-based adaptation for self-healing systems (WOSS'02)
David Garlan, Bradley Schmerl
Self-organizing software architectures for distributed systems (WOSS'02)
Ioannis Georgiadis, Jeff Magee, Jeff Kramer
An architectural support for self-adaptive software for treating faults (WOSS'02)
Rogério de Lemos, José Luiz Fiadeiro
Architectural style requirements for self-healing systems (WOSS'02)
Marija Mikic-Rakic, Nikunj Mehta, Nenad Medvidovic
An instrumentation and control-based approach for distributed application management and adaptation (WOSS'02)
D. Reilly, A. Taleb-Bendiab, A. Laws, N. Badr
Remote Healing Architecture
Nonintrusive Failure Detection and Recovery for Internet Services using Backdoors.
F. Sultan, A. Bohra, Y. Pan, S. Smaldone, I. Neamtiu, P. Gallard and L. Iftode. Rutgers University Technical Report, DCS-TR-524, December 2003. Submitted for publication.
Nonintrusive Remote Healing Using Backdoors
F. Sultan, A. Bohra, I. Neamtiu, L. Iftode. Proceedings of First Workshop on Algorithms and Architectures for Self-Managing Systems, in conjunction with ISCA '03, June 2003. An initial version was published as Rutgers University Technical Report DCS-TR-522, April 2003.
Diagnosis and Detection
An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment. (ROC)
Brown, A., G. Kar, and A. Keller. Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM 2001), Seattle, WA, May 2001
Pinpoint: Problem Determination in Large, Dynamic, Internet Services. (ROC)
Chen, M., E. Kiciman, E. Fratkin, E. Brewer and A. Fox. Proceedings of the International Conference on Dependable Systems and Networks (IPDS Track), Washington D.C., 2002.
Providing Persistent and Consistent Resources through Event Log
Ramendra K. Sahoo, SSRS'03
Bug Isolation via Remote Program Sampling
Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. In ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 2003)
On-Line Intrusion Detection and Attack Prevention Using Diversity, Generate-and-Test, and Generalization
James C. Reynolds, James Just, Larry Clough, Ryan Maglich, Proceedings of the 36th Hawaii International Conference on System Sciences (HICSS’03)
Measurement and Analysis of Spyware Infections in a University Environment (NSDI'04)
Stefan Saroiu, Steven D. Gribble, and Henry M. Levy, University of Washington
A Framework for Model Checking Network Protocols (NSDI'04)
Madanlal Musuvathi, David L. Dill, and Dawson R. Engler, Stanford University
Recovery and Repair
Cost-Sensitive Fault Remediation for Autonomic Computing
Michael L. Littman and Thu Nguyen and Haym Hirsh, Eitan M. Fenson and Richard Howard. Workshop on AI and Autonomic Computing: Developing a Research Agenda for Self-Managing Computer Systems, 2003
for Operators: Building an Undoable E-mail Store.
Brown, A. and D. A. Patterson. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003 (Best Paper Award).
Software analysis: A roadmap
D. Jackson and M. C. Rinard. In Proceedings of 22nd International Conference On Software Engineering (ICSE’00),2000
Consistency management with repair actions
C. Nentwich, W. Emmerich, and A. Finkelstein. In Proceedings of the 25th International Conference on Software Engineering, May 2003.
the FAST way
N. Gupta, L. Jagadeesan, E. Koutsofios, and D. Weiss. Auditdraw: . In Proceedings of the 3rd IEEE International Symposium on Requirements Engineering, 1997.
Automatic Detection and Repair
of Errors in Data Structures (Slides)
Brian Demsky and Martin C. Rinard, Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2003
A Biological Programming Model for Self-Healing
The Vision of Autonomic Computing
Jeffrey Kephart, David M.Chess
Enabling autonomic behavior in systems software with hot swapping
J. Appavoo, K. Hui, C. A. N. Soules,R. W. Wisniewski, D. M. Da Silva,O. Krieger, M. A. Auslander, D. J. Edelsohn,B. Gamsa, G. R. Ganger, P. McKenney,M. Ostrowski, B. Rosenburg,M. Stumm, J. Xenidis,
The Capacity of Wireless Networks
Piyush Gupta, P.R. Kumar, Technical report,University of Illinois, Urbana-Champaign, 1999.
Mobility Increases the Capacity of Ad-hoc Wireless Networks
Matthias Grossglauser, David Tse, INFOCOM 2001
User-level Internet Path Diagnosis
Ratul Mahajan, Neil Spring, David Wetherall, Thomas Anderson. SOSP'03.