PI: Thu D. Nguyen, co-PIs: Ricardo Bianchini, Liviu Iftode, and Richard P. Martin
Overall Project Summary. A high-end PC cluster is the crucial computing infrastructure for our research in the recently established Distributed Computing Laboratory (DISCO Lab) in the Department of Computer Science at Rutgers University. The requested cluster consists of 8 quad-processor PCs, a 16-port Alteon Gb/s Ethernet LAN with layer 4 switching capabilities and 16 NICs, and a 16-port Myrinet Gb/s LAN with 8 NICs. This cluster will support research in several projects that, jointly, are intended to improve the state-of-the-art in cluster computing, moving towards the realization of a robust and efficient distributed computing environment for clusters of PCs. In particular, this proposal describes four projects: (i) developing a robust software distributed shared-memory environment to support emerging cluster applications, (ii) building system support for efficient global utilization of cluster resources, (iii) exploring the construction of highly available operating systems, and (iv) implementing distributed applications such as data mining, scalable servers, and interactive continuous media applications and system mechanisms and policies to support them efficiently on clusters.
Project Abstracts
Project 1: Towards an Efficient and Robust Software DSM System for New Application
Domains on Clusters - Iftode, Bianchini, and Nguyen.
In this project, we propose to implement a high-performance, fault-tolerant
software distributed shared-memory (DSM) system. While researchers have investigated
DSM systems extensively, most current DSM protocols do not meet the performance
and functionality demands of emerging cluster applications such as data mining
and interactive continuous media applications. We will investigate how to improve
DSM support for these applications using multi-threading, prefetching, and fault-tolerance
techniques. We focus on combining various policies for thread scheduling and
migration, prefetching, page migration, and fault-tolerance.
Project 2: Towards a Globally Coherent Computing Environment on Clusters
- Bianchini, Iftode, and Nguyen.
In this project, we propose to build system support for integrating the resources
of a cluster into a coherent global computing environment. In particular we
propose to develop three kernel extensions and investigate relevant design tradeoffs:
(1) a lightweight DSM protocol, (2) a shared logical disk with a global file
cache, and (3) multiple-coherence protocol for sharing the state of a distributed
kernel. We also plan to investigate how to exploit advance communication support
to reduce the system overhead and how to use the idle cycles of the SMP processors
to make clustering more efficient.
Project 3: Fault-Tolerant Operating Systems - Martin and Nguyen. This project
will explore the construction of highly available cluster operating systems.
A goal of the project is the demonstration of a practical cluster with a 99.9996%
uptime--three minutes of downtime every two years. We begin with the premise
that hardware and software will experience failures and bugs. We plan to follow
a methodology inspired by the field of manufacturing, where statistical information
about input components is used to create overall designs that are robust to
component failures. Much of the research will thus focus on the definition and
isolation of both hardware and software components and their impact on the operating
system structure and cluster availability.
Project 4: System Support for Emerging Distributed Applications - Nguyen,
Bianchini, Iftode, and Martin.
In this project, we propose to investigate and build system support for emerging
cluster applications such as data mining, scalable servers, and interactive
continuous media applications. Each of these application domains places unique
requirements on the underlying system. Data mining is extremely compute-intensive
and has huge memory requirements. Distributed servers such as web servers have
large I/O requirements and unique load-balancing characteristics that must be
leveraged to achieve scalability and good performance. Interactive continuous
media applications such as real-time 3D rendering and object tracking introduce
soft real-time requirements. In addition, these applications have the intriguing
property that they can tradeoff quality/accuracy for processing requirements,
necessitating new APIs and resource management policies to maximize system efficiency
in a multi-programmed distributed environment. Exploring these application domains
will allow us to implement system mechanisms and policies appropriate for real
cluster applications.