System Support for Scalable, Fault-Tolerant Computing and Services on PC Clusters

NSF Information

PI: Thu D. Nguyen, co-PIs: Ricardo Bianchini, Liviu Iftode, and Richard P. Martin

Overall Project Summary. A high-end PC cluster is the crucial computing infrastructure for our research in the recently established Distributed Computing Laboratory (DISCO Lab) in the Department of Computer Science at Rutgers University. The requested cluster consists of 8 quad-processor PCs, a 16-port Alteon Gb/s Ethernet LAN with layer 4 switching capabilities and 16 NICs, and a 16-port Myrinet Gb/s LAN with 8 NICs. This cluster will support research in several projects that, jointly, are intended to improve the state-of-the-art in cluster computing, moving towards the realization of a robust and efficient distributed computing environment for clusters of PCs. In particular, this proposal describes four projects: (i) developing a robust software distributed shared-memory environment to support emerging cluster applications, (ii) building system support for efficient global utilization of cluster resources, (iii) exploring the construction of highly available operating systems, and (iv) implementing distributed applications such as data mining, scalable servers, and interactive continuous media applications and system mechanisms and policies to support them efficiently on clusters.

Project Abstracts

Project 1: Towards an Efficient and Robust Software DSM System for New Application Domains on Clusters - Iftode, Bianchini, and Nguyen.
In this project, we propose to implement a high-performance, fault-tolerant software distributed shared-memory (DSM) system. While researchers have investigated DSM systems extensively, most current DSM protocols do not meet the performance and functionality demands of emerging cluster applications such as data mining and interactive continuous media applications. We will investigate how to improve DSM support for these applications using multi-threading, prefetching, and fault-tolerance techniques. We focus on combining various policies for thread scheduling and migration, prefetching, page migration, and fault-tolerance.


Project 2: Towards a Globally Coherent Computing Environment on Clusters - Bianchini, Iftode, and Nguyen.
In this project, we propose to build system support for integrating the resources of a cluster into a coherent global computing environment. In particular we propose to develop three kernel extensions and investigate relevant design tradeoffs: (1) a lightweight DSM protocol, (2) a shared logical disk with a global file cache, and (3) multiple-coherence protocol for sharing the state of a distributed kernel. We also plan to investigate how to exploit advance communication support to reduce the system overhead and how to use the idle cycles of the SMP processors to make clustering more efficient.


Project 3: Fault-Tolerant Operating Systems - Martin and Nguyen.
This project will explore the construction of highly available cluster operating systems. A goal of the project is the demonstration of a practical cluster with a 99.9996% uptime--three minutes of downtime every two years. We begin with the premise that hardware and software will experience failures and bugs. We plan to follow a methodology inspired by the field of manufacturing, where statistical information about input components is used to create overall designs that are robust to component failures. Much of the research will thus focus on the definition and isolation of both hardware and software components and their impact on the operating system structure and cluster availability.


Project 4: System Support for Emerging Distributed Applications - Nguyen, Bianchini, Iftode, and Martin.
In this project, we propose to investigate and build system support for emerging cluster applications such as data mining, scalable servers, and interactive continuous media applications. Each of these application domains places unique requirements on the underlying system. Data mining is extremely compute-intensive and has huge memory requirements. Distributed servers such as web servers have large I/O requirements and unique load-balancing characteristics that must be leveraged to achieve scalability and good performance. Interactive continuous media applications such as real-time 3D rendering and object tracking introduce soft real-time requirements. In addition, these applications have the intriguing property that they can tradeoff quality/accuracy for processing requirements, necessitating new APIs and resource management policies to maximize system efficiency in a multi-programmed distributed environment. Exploring these application domains will allow us to implement system mechanisms and policies appropriate for real cluster applications.